Received: by 2002:ab2:620c:0:b0:1ef:ffd0:ce49 with SMTP id o12csp123314lqt; Mon, 18 Mar 2024 03:31:57 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVuUPeMGRO0QbekIYoB+wFoxNMMELfJmreix4f6ZcPKg8P8PX4eIYBiBuLIuP/uizefpsjC2e2JmIpieOjHlkjsmqM46Ngeghhny2TQyQ== X-Google-Smtp-Source: AGHT+IFkLss488fUmuqLFRTr0s6wmQdSKu+gAC69F1XHrDPLdTQKiX9sAPr48drUK/hWMs9irFKG X-Received: by 2002:a05:620a:5dd9:b0:789:f309:7533 with SMTP id xy25-20020a05620a5dd900b00789f3097533mr6007439qkn.35.1710757917613; Mon, 18 Mar 2024 03:31:57 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1710757917; cv=pass; d=google.com; s=arc-20160816; b=JKlmGJG19dfF9rhQBdr05Rxe5FDavY4uVEWWgi0bWpIwjpxXO2c+j8Kms8uM+hKC9G wU28sbyPMuRE2sGHrpF2FLF+xU5yznDK8g5Po5SKknZ0p77nrFfZw9nutYq29S8UWDnD CFI/TNCBAOSvov2gGCBV4eHzpdQNYmGb3Wfr6a6mX5DGpFIDIrlXJd5qkclp+ld9Ft39 8su8y6MSYZN3OnfH0IvHZyppuXdkXOUv2W9DpKHue1cj8isfyix4D5Q4q0GC51+RzrFf 9NSznMxgfFvjMfuIyJD7D5L4j3dFSdRSSs33+bRqRpRCGYZjTc/wgEBJdFQkPA3fYlu5 3qFQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:list-unsubscribe:list-subscribe:list-id :precedence:date:message-id:dkim-signature; bh=N+z5yj4errvZzUnsDbH2i/qBxwekB8l9wAKiw6mVF4c=; fh=Gped7xK+c8uzmr2+8QevM64eALrr7yDQz+J1FwbWWig=; b=zsYg3+kk+8L+WxRD1p8yVWid7btWVLQnssIDthli1WgPCl37K+UOg0TyJD0fpN29Bh bwSanMSJse+pAE9pQ+H7wMqMykHNnSOhl3+YMPtybGFmLZ20jWXrYP4qmopoL34TAePU SaX3LtRust5WFYF6ScIgYNmx4b6Z/r9STvpEIVxEvir9X6UXNAxXv+/kqG+P2sARsenn MqbKdljWnQvQanM1kvWoWkEWZkt9YV12Yl8xj8s1GrPfGUdiBtOsrDIrBFKjlHpAlmOr yFd1fSecTMTmsXeFoPnSSBgWAZ8j/7s5KOmwX5+DlOxeuKeHXTm6b4Eud09PINzzPOWr CkFw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=adoeg+gC; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-106005-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-106005-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id l16-20020a05620a28d000b0078a03386735si1301207qkp.568.2024.03.18.03.31.57 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 18 Mar 2024 03:31:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-106005-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=adoeg+gC; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-106005-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-106005-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 4DE371C210E0 for ; Mon, 18 Mar 2024 10:31:25 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 1992B26AC1; Mon, 18 Mar 2024 10:31:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="adoeg+gC" Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9207B1FB5 for ; Mon, 18 Mar 2024 10:31:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.111 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710757878; cv=none; b=sbx43t+4W36absMxRKR9CvgbwCXnglPThKGFqVKuSmdOmzZkqebiVaUKjoGSYTLats0YW17Gc5NsGLPSC3QKTKf9jo+e+L51jcEYPU3kBnnqWDSvpauy6X2DtdHO5n33SLIhbARJ0C49EQMwG/kSabJfbssetKyjOXaPGjzZU84= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710757878; c=relaxed/simple; bh=XYotKAMNzvycl1JG/Q3HI/vrv2SJntIP+AhU4IZD0vs=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=YQRLo3NvigHrDsdCankODtnb/zclN8u09a8uWMQY1UqgGO0Qr1Z/q8nNmKDJf1FpkMZ1gf3SF3pJup8I0fCGaLOeTDcqnV3BffRYLeq7wWhRxc95bfbwsZEf9yvWGUQCAbrC602WlZkeR8M2QUoxEfhMz/i0hb32q/a77a4cNH4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=adoeg+gC; arc=none smtp.client-ip=115.124.30.111 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1710757873; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=N+z5yj4errvZzUnsDbH2i/qBxwekB8l9wAKiw6mVF4c=; b=adoeg+gCyGNv1WWfHrujKljcBNQdifgQMgknPj1c4M/rfgBp6Lr5X0EQidqLixgaD/ASaQ5Usriq0A6jKJMI0C9f8NISvC4BEAQ13bFpZz/i58m+Dk2Irr9m3Ylf6VJgpYUhDTUCqCYc14w4I8g5UU/7uHAaUJDY+/cWJe7XmZw= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0W2mJh7n_1710757871; Received: from 30.97.56.66(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W2mJh7n_1710757871) by smtp.aliyun-inc.com; Mon, 18 Mar 2024 18:31:12 +0800 Message-ID: Date: Mon, 18 Mar 2024 18:31:10 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing To: David Hildenbrand , "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, wangkefeng.wang@huawei.com, jhubbard@nvidia.com, 21cnbao@gmail.com, ryan.roberts@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com> <871q88vzc4.fsf@yhuang6-desk2.ccr.corp.intel.com> <3bf2c3e1-44fd-4bc8-a97b-9da7b606aec0@linux.alibaba.com> <8e13bce5-e353-4258-9891-97158b8ccd84@redhat.com> From: Baolin Wang In-Reply-To: <8e13bce5-e353-4258-9891-97158b8ccd84@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 2024/3/18 18:15, David Hildenbrand wrote: > On 18.03.24 11:13, Baolin Wang wrote: >> >> >> On 2024/3/18 17:48, David Hildenbrand wrote: >>> On 18.03.24 10:42, Baolin Wang wrote: >>>> >>>> >>>> On 2024/3/18 14:16, Huang, Ying wrote: >>>>> Baolin Wang writes: >>>>> >>>>>> Now the anonymous page allocation already supports multi-size THP >>>>>> (mTHP), >>>>>> but the numa balancing still prohibits mTHP migration even though it >>>>>> is an >>>>>> exclusive mapping, which is unreasonable. Thus let's support the >>>>>> exclusive >>>>>> mTHP numa balancing firstly. >>>>>> >>>>>> Allow scanning mTHP: >>>>>> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data >>>>>> section >>>>>> pages") skips shared CoW pages' NUMA page migration to avoid shared >>>>>> data >>>>>> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't >>>>>> try to >>>>>> NUMA-migrate COW pages that have other uses") change to use >>>>>> page_count() >>>>>> to avoid GUP pages migration, that will also skip the mTHP numa >>>>>> scaning. >>>>>> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP >>>>>> issue, although there is still a GUP race, the issue seems to have >>>>>> been >>>>>> resolved by commit 80d47f5de5e3. Meanwhile, use the >>>>>> folio_estimated_sharers() >>>>>> to skip shared CoW pages though this is not a precise sharers >>>>>> count. To >>>>>> check if the folio is shared, ideally we want to make sure every >>>>>> page is >>>>>> mapped to the same process, but doing that seems expensive and using >>>>>> the estimated mapcount seems can work when running autonuma >>>>>> benchmark. >>>>>> >>>>>> Allow migrating mTHP: >>>>>> As mentioned in the previous thread[1], large folios are more >>>>>> susceptible >>>>>> to false sharing issues, leading to pages ping-pong back and forth >>>>>> during >>>>>> numa balancing, which is currently hard to resolve. Therefore, as a >>>>>> start to >>>>>> support mTHP numa balancing, only exclusive mappings are allowed to >>>>>> perform >>>>>> numa migration to avoid the false sharing issues with large folios. >>>>>> Similarly, >>>>>> use the estimated mapcount to skip shared mappings, which seems can >>>>>> work >>>>>> in most cases (?), and we've used folio_estimated_sharers() to skip >>>>>> shared >>>>>> mappings in migrate_misplaced_folio() for numa balancing, seems no >>>>>> real >>>>>> complaints. >>>>> >>>>> IIUC, folio_estimated_sharers() cannot identify multi-thread >>>>> applications.  If some mTHP is shared by multiple threads in one >>>> >>>> Right. >>>> >>> >>> Wasn't this "false sharing" previously raised/described by Mel in this >>> context? >> >> Yes, I got confused with the process's false sharing. >> >>>>> process, how to deal with that? >>>> >>>> IMHO, seems the should_numa_migrate_memory() already did something to >>>> help? >>>> >>>> ...... >>>>      if (!cpupid_pid_unset(last_cpupid) && >>>>                  cpupid_to_nid(last_cpupid) != dst_nid) >>>>          return false; >>>> >>>>      /* Always allow migrate on private faults */ >>>>      if (cpupid_match_pid(p, last_cpupid)) >>>>          return true; >>>> ...... >>>> >>>> If the node of the CPU that accessed the mTHP last time is different >>>> from this time, which means there is some contention for that mTHP >>>> among >>>> threads. So it will not allow migration. >>>> >>>> If the contention for the mTHP among threads is light or the accessing >>>> is relatively stable, then we can allow migration? >>>> >>>>> For example, I think that we should avoid to migrate on the first >>>>> fault >>>>> for mTHP in should_numa_migrate_memory(). >>>>> >>>>> More thoughts?  Can we add a field in struct folio for mTHP to count >>>>> hint page faults from the same node? >>>> >>>> IIUC, we do not need add a new field for folio, seems we can reuse >>>> ->_flags_2a field. But how to use it? If there are multiple consecutive >>>> NUMA faults from the same node, then allow migration? >>> >>> _flags_2a cannot be used. You could place something after _deferred_list >> >> Could you be more explicit? I didn't see _flags_2 currently being used, >> did I miss something? > > Yes, that we use it implicitly via page->flags on subpages (for example, > some flags are still per-subpage and not per-folio). Yes, thanks for reminding:)