Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp96529pxf; Wed, 31 Mar 2021 17:55:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJypsn4CXpco2NuYT9M3+ZKyJEGtuOO8y9Y3Gn1Uc/+R8gtpRuHujelI9ovOx9/V9tU5MapT X-Received: by 2002:a17:907:162b:: with SMTP id hb43mr6719745ejc.41.1617238546049; Wed, 31 Mar 2021 17:55:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617238546; cv=none; d=google.com; s=arc-20160816; b=HCP/nBWP/xdlEbggdVgyJZW8D1xeE3eC3sxD6C2ozDyTIhG8W383lK7twiwfm0fZIN s57bKB/DCwDC1OZIbW7YpAVF+q9arkQskKqPRnLPKF2gouZTqVAMWZojD+dOWlHjtt9e dc0i/jru70EUC0eQFlz+tcfWghuO6JGDO52bkPfuHkZhfuWBqdlwVL/B6nsFYYNNbMm/ On9UyPYlCrB9sDNdGEnpkfs0aTne4sZwtmhyQ4YjV7dSYo+wARFEOGQKIXkbQ2WKolDw QaK2q51SPNCNlA3RznO3ixfDInaoOv9ZRv1ZWiTZv90EwL1uXBLE2f0WpRf4swwgJEQ6 z21Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:ironport-sdr:ironport-sdr; bh=5YM6CgcdU1+QTDS03b1AZYCl3PF3ULZFZVFel/AT9wY=; b=GKxdBb+Ea2xdlh8W1d2igL1KquGDXQb5Mz8c6I5FWZldh9b8SxVGad08kwuNycFvn6 ZBNSwgbHNsTXWXfejvZNxDEwLeWhREX4g/uToyx9I6d5vm5H5d9N8aZVnnajA0lG8q2Y s7GfWQ7mIlTYqjeVZVpFkZ3dmYlgFX0ls9f36jIo+Rb6GWy+YpO3t1axfvdjsUcrpqUg qQvB4yiDEKMJUPRF5JZOC5u3VWmaYmtqdxZR1EoEzbVis51VdAdRK0HdUJTUDpp0Ddjr CX1l7DWmDlVgeZva1kD0PM+33Tv/ULs40vHf1Lcj+VU+u6NVQCSUZ1bkcsIrvZ9e1uhF ONvA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hd19si3164086ejc.62.2021.03.31.17.55.14; Wed, 31 Mar 2021 17:55:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232367AbhDAAwj (ORCPT + 99 others); Wed, 31 Mar 2021 20:52:39 -0400 Received: from mga02.intel.com ([134.134.136.20]:41443 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230073AbhDAAwf (ORCPT ); Wed, 31 Mar 2021 20:52:35 -0400 IronPort-SDR: cQWJdAHBqpo2u6KhTIqbPseUgFc55rHfRieBmh1g9O/E2y1v/rt+oZs8c2Maxt3W8E7uhcQqpz ZFLcIzEAihyQ== X-IronPort-AV: E=McAfee;i="6000,8403,9940"; a="179252294" X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; d="scan'208";a="179252294" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:34 -0700 IronPort-SDR: WI0DpeLHC2+Wo6SqSQWqD3/QV8iwauzz8oZJzT2WYB6ycnYdHoawatwQ481lJ59dOA+Pqve+HB z2NCI7bs/2nA== X-IronPort-AV: E=Sophos;i="5.81,295,1610438400"; d="scan'208";a="418933735" Received: from yhuang6-desk1.sh.intel.com (HELO yhuang6-desk1.ccr.corp.intel.com) ([10.239.13.1]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2021 17:52:31 -0700 From: "Huang, Ying" To: Mel Gorman Cc: , Andrew Morton , , Peter Zijlstra , "Peter Xu" , Johannes Weiner , "Vlastimil Babka" , Matthew Wilcox , Will Deacon , Michel Lespinasse , Arjun Roy , "Kirill A. Shutemov" Subject: Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault References: <20210329062651.2487905-1-ying.huang@intel.com> <20210330133310.GT15768@suse.de> <87a6qj8t92.fsf@yhuang6-desk1.ccr.corp.intel.com> <20210331131658.GV15768@suse.de> Date: Thu, 01 Apr 2021 08:52:29 +0800 In-Reply-To: <20210331131658.GV15768@suse.de> (Mel Gorman's message of "Wed, 31 Mar 2021 14:16:58 +0100") Message-ID: <875z16967m.fsf@yhuang6-desk1.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Mel Gorman writes: > On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote: >> Mel Gorman writes: >> >> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: >> >> For NUMA balancing, in hint page fault handler, the faulting page will >> >> be migrated to the accessing node if necessary. During the migration, >> >> TLB will be shot down on all CPUs that the process has run on >> >> recently. Because in the hint page fault handler, the PTE will be >> >> made accessible before the migration is tried. The overhead of TLB >> >> shooting down is high, so it's better to be avoided if possible. In >> >> fact, if we delay mapping the page in PTE until migration, that can be >> >> avoided. This is what this patch doing. >> >> >> > >> > Why would the overhead be high? It was previously inaccessibly so it's >> > only parallel accesses making forward progress that trigger the need >> > for a flush. >> >> Sorry, I don't understand this. Although the page is inaccessible, the >> threads may access other pages, so TLB flushing is still necessary. >> > > You assert the overhead of TLB shootdown is high and yes, it can be > very high but you also said "the benchmark score has no visible changes" > indicating the TLB shootdown cost is not a major problem for the workload. > It does not mean we should ignore it though. > >> > >> > >> > If migration is attempted, then the time until the migration PTE is >> > created is variable. The page has to be isolated from the LRU so there >> > could be contention on the LRU lock, a new page has to be allocated and >> > that allocation potentially has to enter the page allocator slow path >> > etc. During that time, parallel threads make forward progress but with >> > the patch, multiple threads potentially attempt the allocation and fail >> > instead of doing real work. >> >> If my understanding of the code were correct, only the first thread will >> attempt the isolation and allocation. Because TestClearPageLRU() is >> called in >> >> migrate_misplaced_page() >> numamigrate_isolate_page() >> isolate_lru_page() >> >> And migrate_misplaced_page() will return 0 immediately if >> TestClearPageLRU() returns false. Then the second thread will make the >> page accessible and make forward progress. >> > > Ok, that's true. While additional work is done, the cost is reasonably > low -- lower than I initially imagined and with fewer side-effects. > >> But there's still some timing difference between the original and >> patched kernel. We have several choices to reduce the difference. >> >> 1. Check PageLRU() with PTL held in do_numa_page() >> >> If PageLRU() return false, do_numa_page() can make the page accessible >> firstly. So the second thread will make the page accessible earlier. >> >> 2. Try to lock the page with PTL held in do_numa_page() >> >> If the try-locking succeeds, it's the first thread, so it can delay >> mapping. If try-locking fails, it may be the second thread, so it will >> make the page accessible firstly. We need to teach >> migrate_misplaced_page() to work with the page locked. This will >> enlarge the duration that the page is locked. Is it a problem? >> >> 3. Check page_count() with PTL held in do_numa_page() >> >> The first thread will call get_page() in numa_migrate_prep(). So if the >> second thread can detect that, it can make the page accessible firstly. >> The difficulty is that it appears hard to identify the expected >> page_count() for the file pages. For anonymous pages, that is much >> easier, so at least if a page passes the following test, we can delay >> mapping, >> >> PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page) >> >> This will disable the optimization for the file pages. But it may be >> good enough? >> >> Which one do you think is better? Maybe the first one is good enough? >> > > The first one is probably the most straight-forward but it's more > important to figure out why interrupts were higher with at least one > workload when the exact opposite is expected. Investigating which of > options 1-3 are best and whether it's worth the duplicated check could > be done as a separate patch. > >> > You should consider the following question -- is the potential saving >> > of an IPI transmission enough to offset the cost of parallel accesses >> > not making forward progress while one migration is setup and having >> > different migration attempts collide? >> > >> > I have tests running just in case but I think the answer may be "no". >> > So far only one useful test as completed (specjbb2005 with one VM per NUMA >> > node) and it showed a mix of small gains and losses but with *higher* >> > interrupts contrary to what was expected from the changelog. >> >> That is hard to be understood. May be caused by the bug you pointed out >> (about was_writable)? >> > > It's possible and I could not figure out what the rationale behind the > change was :/ > > Fix it and run it through your tests to make sure it works as you > expect. Assuming it passes your tests and it's posted, I'll read it again > and run it through a battery of tests. If it shows that interrupts are > lower and is either netural or improves performance in enough cases then > I think it'll be ok. Even if it's only neutral in terms of performance > but interrupts are lower, it'll be acceptable. Will do it. Thanks a lot for your help! Best Regards, Huang, Ying