Received: by 2002:ab2:6857:0:b0:1ef:ffd0:ce49 with SMTP id l23csp2636228lqp; Mon, 25 Mar 2024 05:16:39 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXPKGh74DVHQLpzV3Ri8x1qQaIkG4giWo54WjYFTpwwLsPXzVhejDn5B8cAykw+MIrsgSuCN8KcffzgY7qYUeNL/Pn+V235NI8thcKACA== X-Google-Smtp-Source: AGHT+IGrrlMX/EfmNE47m1fGnWqkEp7HHu2jBhmiq/fjFrbBVvO8gR49LmuiQsO4VkZ31UC9Dgrr X-Received: by 2002:a05:6870:8997:b0:22a:5690:f056 with SMTP id f23-20020a056870899700b0022a5690f056mr323578oaq.44.1711368998924; Mon, 25 Mar 2024 05:16:38 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711368998; cv=pass; d=google.com; s=arc-20160816; b=hQyLp+Nc+lOVvyeo7Nf8bPgva9rZpE1h2qQZ0S8RxNZLBCJtasCupRAAYElvOqMXKt HlXxDwdVH/nNfVPnZva0Gtfasx7bexUc3QsT/ceVk7qcH0bnyL4jomUFPXMoNr/G+dUy A4xQxHrw5cK7U7a1kBZf0cs9gktRkJRZ/Xjj7HX0HOWLvxCPex4k1XBM4nP25bk7NFj5 p6QB/mdDpxSJjqroZs5R24t4M4LC/ZuNn8I8VNBiPKPgbQmRaW6rbCS6C4nN+nXGIp2W Rs0yFqGUYWFtmz2et0NBFD0EdWPl/vuiiVQWTpRWJqG8Ec77vBOKAgWzr3Z1Lg8MR1ze iKgg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=sLxzYhXSlaEWABuGJ+dP9abTPX+Otro8XjOZlPvH+fs=; fh=hrlWvyh2+2+qtYSDS74IWbwD2TgzeHtygwXvTncL5J0=; b=jRvpIrLKLht+XGyLiJuKl9TUR1sgbMhMjuZPHEGWy1QMRe3jmWh/StWd0mOOE9wQgc qETJxRKYt1+cdAtg2i7Y096ycwN0iSD7kd58ui/tjrCtxmgnzAdFrIzao00xZW0dT7Da KIGr4ypjWXBvk4HaJMS34qtb+A9sxOYtuLvIaHK7hs2Vwixtxa3yYTXateOIgNMk/1B8 tz2MmcVNB+KjHZgCLQIUnS/b6pQd8An+dWRjl8VZVNGWNqEMsmBpFbbkm14jX60c3XSb n186Rp7FeVtnXM2ZEFpmay01z68Ncc84RjRveXtcFdqebbRj3zqurFT3baezLQI3coAW wD9w==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZmWakkdl; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-116443-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-116443-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id fd3-20020a056a002e8300b006e8f79e0141si1507666pfb.217.2024.03.25.05.16.38 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Mar 2024 05:16:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-116443-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZmWakkdl; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-116443-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-116443-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 5B4C9296396 for ; Mon, 25 Mar 2024 12:09:50 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4EFA7137928; Mon, 25 Mar 2024 07:31:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ZmWakkdl" Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB27834DF14 for ; Mon, 25 Mar 2024 02:50:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711335059; cv=none; b=eicKOxKY9s+uFAfx2Rsumh+CnmjuvdxqOIuNk+IblXbtIX9yN+OCNMt/Oh9qbdDZFp+UxqzAfq/vISL9WPqOJOBFCtFMuAlKcD5/EY1XDXAz/RAVO9xB7IYuC2FHKp6XUggk+n8VT4R7QE8oaOnOOPL8hb+1KLzNZE8QGFx4qBc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711335059; c=relaxed/simple; bh=KtDX9c5FQBgaoOD7h5ZI/Q8jd/0Y1KkqpSUWOSPHmPU=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=itl8GmflKlELyxqvkg6mwQ9/5ZMRwzi/mOJ/mYYVR2QKFMl+CBP81PEWR4ZxK+dShXZTb0azKpQSsqOMLJn3cM3P1JSyhp4F1uUQULbG0aCUM1eMvR96X7wT4SAQBBLz49KBQiMAjnm+ZJiVQapfK5EWlKm9m7fdIBS0u9Kz+DA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ZmWakkdl; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711335058; x=1742871058; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=KtDX9c5FQBgaoOD7h5ZI/Q8jd/0Y1KkqpSUWOSPHmPU=; b=ZmWakkdlJGZkaNC62NgFRkPLbUIHTAkOgP0llQfvKjanfGD4ZC2tMnyg gFHTg/na2Qz7MGWtWFyfZAPZ9617HHyQ8Xu7nnImEDKXCIjsKsB/JH3wK hgcEBXpyZPctmGq9s1G6YZCt/PGRorrRXwxPUa+WDSJl75K/7eLDuvLKg AUCgI4mnMy5tVGdEjvdb2If3EsNZsubHnO6nDeadEVYqdCJxSPaLdDkbf GF8ZsTXaSlWcRq2ehrrTZSkL4MyGWFh9+iepNsFcpDEis/GwpVY3HExWD 9UE+MFSeLnQBn4q+l0pPsoCcOQCf+Ebc64Z7aMrwnamWih6o7RC8C4lpq A==; X-IronPort-AV: E=McAfee;i="6600,9927,11023"; a="10113200" X-IronPort-AV: E=Sophos;i="6.07,152,1708416000"; d="scan'208";a="10113200" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2024 19:50:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,152,1708416000"; d="scan'208";a="15358991" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2024 19:50:52 -0700 From: "Huang, Ying" To: Donet Tom Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Aneesh Kumar , Michal Hocko , Dave Hansen , Mel Gorman , Feng Tang , Andrea Arcangeli , Peter Zijlstra , Ingo Molnar , Rik van Riel , Johannes Weiner , Matthew Wilcox , Vlastimil Babka , Dan Williams , Hugh Dickins , Kefeng Wang , Suren Baghdasaryan Subject: Re: [PATCH v3 2/2] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy In-Reply-To: (Donet Tom's message of "Fri, 22 Mar 2024 15:35:58 +0530") References: <87h6gyr7jf.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 25 Mar 2024 10:48:58 +0800 Message-ID: <875xxbqb51.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Donet Tom writes: > On 3/22/24 14:02, Huang, Ying wrote: >> Donet Tom writes: >> >>> commit bda420b98505 ("numa balancing: migrate on fault among multiple b= ound >>> nodes") added support for migrate on protnone reference with MPOL_BIND >>> memory policy. This allowed numa fault migration when the executing node >>> is part of the policy mask for MPOL_BIND. This patch extends migration >>> support to MPOL_PREFERRED_MANY policy. >>> >>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>> the kernel should not allocate pages from the slower memory tier via >>> allocation control zonelist fallback. Instead, we should move cold pages >>> from the faster memory node via memory demotion. For a page allocation, >>> kswapd is only woken up after we try to allocate pages from all nodes in >>> the allocation zone list. This implies that, without using memory >>> policies, we will end up allocating hot pages in the slower memory tier. >>> >>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>> allocation control when we have memory tiers in the system. With >>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>> of faster memory nodes. When we fail to allocate pages from the faster >>> memory node, kswapd would be woken up, allowing demotion of cold pages >>> to slower memory nodes. >>> >>> With the current kernel, such usage of memory policies implies we can't >>> do page promotion from a slower memory tier to a faster memory tier >>> using numa fault. This patch fixes this issue. >>> >>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>> mask, we allow numa migration to the executing nodes. If the executing >>> node is not in the policy node mask, we do not allow numa migration. >> Can we provide more information about this? I suggest to use an >> example, for instance, pages may be distributed among multiple sockets >> unexpectedly. > > Thank you for your suggestion. However, this commit message explains all = the scenarios. Yes. The commit message is correct and covers many cases. What I suggested is to describe why we do that? An examples can not covers all possibility, but it is easy to be understood. For example, something as below? For example, on a 2-sockets system, there are N0, N1, N2 in socket 0, N3 in socket 1. N0, N1, N3 have fast memory and CPU, while N2 has slow memory and no CPU. For a workload, we may use MPOL_PREFERRED_MANY with nodemask with N0 and N1 set because the workload runs on CPUs of socket 0 at most times. Then, even if the workload runs on CPUs of N3 occasionally, we will not try to migrate the workload pages from N2 to N3 because users may want to avoid cross-socket access as much as possible in the long term. > For example, Consider a system with 3 numa nodes (N0,N1 and N6). > N0 and N1 are tier1 DRAM nodes=C2=A0 and N6 is tier 2 PMEM node. > > Scenario 1: The process is executing on N1, > If the executing node is in the policy node mask, > Curr Loc Pages - The numa node where page present(folio node) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D > Process=C2=A0=C2=A0=C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 Curr Loc Pages=C2=A0=C2=A0=C2=A0 =C2=A0 Observat= ions > -------------------------------------------------------------------------= ---------- > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N1 N6= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Pages Migrated from N0 to N1 > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N1 N6= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 N6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Pages Migrated from N6 to N1 > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0N1=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0 Pages Migrated from N1 to N6 Pages are not Migrating ? > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0N6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Pages Migrated= from N6 to N1 > -------------------------------------------------------------------------= ----------- > Scenario 2:=C2=A0 The process is executing on N1, > If the executing node is NOT in the policy node mask, > Curr Loc Pages - The numa node where page present(folio node) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > Process=C2=A0=C2=A0=C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 C= urr Loc Pages=C2=A0=C2=A0=C2=A0 Observations > -------------------------------------------------------------------------= ---------- > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N= 6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N= 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0Pages are not Migrating > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N= 6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N= 6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0Pages are not migration, > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0Pages are not Migrating > -------------------------------------------------------------------------= ----------- > > Scenario 3: The process is executing on N1, > If the executing node and folio nodes are=C2=A0 NOT in the po= licy node mask, > Curr Loc Pages - The numa node where page present (folio node) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Thread=C2=A0=C2=A0=C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Curr = Loc Pages=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Observations > -------------------------------------------------------------------------= ----------- > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0N6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0Pages are not Migrating > N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N6=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0Pages are not Migrating > -------------------------------------------------------------------------= ----------- > > We can conclude that even if the pages are distributed among multiple soc= kets, > if the executing node is in the policy node mask, we allow numa migration= to the > executing nodes. If the executing node is not in the policy node mask, > we do not allow numa migration. > [snip] -- Best Regards, Huang, Ying