Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4381149pxf; Tue, 30 Mar 2021 06:37:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyzEdIpik3cUZ2myCPHLcEZWmJWBAHxytvMRoY1HP9Fg3HtBcVvaWl705IpQUxzHvAyPRkT X-Received: by 2002:a17:906:53d7:: with SMTP id p23mr33636740ejo.140.1617111447384; Tue, 30 Mar 2021 06:37:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617111447; cv=none; d=google.com; s=arc-20160816; b=KFS2eMjhkTuLSPeFpkq+s0vzzJR3WVI0zBQSNZ85/Rbt9xqfOe9tarttWa/CY82UFm JmXw2BhUfhrd1Rjbag1Z49pGJym/Ay2HmFvrIaJ1dA8HcPtsiJ34vW2tqLSELxcO00iA QGN1wRlM6rlPneVUP/syQkjdnVFptkl+2E4U7qwqyYDqDGTd3P/ZuBpOHaHlg9bGC3dC SYFGwfLfLLO8w7jHD8jq418WrGJ4iTOHZF178JWSxx+4YbB4Dq8MOcPmt70c6URU1Bcf OM6GdcvIzmyy4NJQHV970QNbV/4g36r3KygUQ1MMKJGE0nWUDHCU4C9C5Gkh0Pwi0O8G Z56A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=M6l2FsMvjmbQzhKk2JOczaNcH3IyqgDECd7JZINIlUQ=; b=S6QQzbeAgXio0IJ3tDVGvYH7wZs/KkBZTK7Xau43lUV9v2yJuiKvmcYgqZYF1EeCui CImc7gitx/MZkLaCZ4xR/ljMBOZGYkpGX3sQ43fqkeN0rczjme7FVONvWnQ/TZxGH1uq GC4FYuJmKUr06GxyBxi5UkauMuIUPo4xudggewkxjf7+ic59t06Y81IOAuqtjL6jeH9S 82jynbJnxtfWv24fQDDOX7SRr1mcjQ+OtgN7MrKk4AxTrHMfAWitHSkxZfDMPKiK6r9h sf+dTXtVAYuBy9BFl7zEEdls45+gYH6IAsfhOwNrcILObqJyyMiY1fRuaDhwzLN98ink nNyw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hq7si15630705ejc.300.2021.03.30.06.37.04; Tue, 30 Mar 2021 06:37:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232038AbhC3Nd4 (ORCPT + 99 others); Tue, 30 Mar 2021 09:33:56 -0400 Received: from mx2.suse.de ([195.135.220.15]:43586 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231701AbhC3NdP (ORCPT ); Tue, 30 Mar 2021 09:33:15 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id DD404B2FB; Tue, 30 Mar 2021 13:33:12 +0000 (UTC) Date: Tue, 30 Mar 2021 14:33:10 +0100 From: Mel Gorman To: Huang Ying Cc: linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Peter Zijlstra , Peter Xu , Johannes Weiner , Vlastimil Babka , Matthew Wilcox , Will Deacon , Michel Lespinasse , Arjun Roy , "Kirill A. Shutemov" Subject: Re: [RFC] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault Message-ID: <20210330133310.GT15768@suse.de> References: <20210329062651.2487905-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20210329062651.2487905-1-ying.huang@intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: > For NUMA balancing, in hint page fault handler, the faulting page will > be migrated to the accessing node if necessary. During the migration, > TLB will be shot down on all CPUs that the process has run on > recently. Because in the hint page fault handler, the PTE will be > made accessible before the migration is tried. The overhead of TLB > shooting down is high, so it's better to be avoided if possible. In > fact, if we delay mapping the page in PTE until migration, that can be > avoided. This is what this patch doing. > Why would the overhead be high? It was previously inaccessibly so it's only parallel accesses making forward progress that trigger the need for a flush. As your change notes -- "The benchmark score has no visible changes". The patch was neither a win nor a loss for your target workload but there are more fundamental issues to consider. > We have tested the patch with the pmbench memory accessing benchmark > on a 2-socket Intel server, and found that the number of the TLB > shooting down IPI reduces up to 99% (from ~6.0e6 to ~2.3e4) if NUMA > balancing is triggered (~8.8e6 pages migrated). The benchmark score > has no visible changes. > > Known issues: > > For the multiple threads applications, it's possible that the page is > accessed by 2 threads almost at the same time. In the original > implementation, the second thread may go accessing the page directly > because the first thread has installed the accessible PTE. While with > this patch, there will be a window that the second thread will find > the PTE is still inaccessible. But the difference between the > accessible window is small. Because the page will be made > inaccessible soon for migrating. > If multiple threads trap the hinting fault, only one potentially attempts a migration as the others observe the PTE has changed when the PTL is acquired and return to userspace. Such threads then have a short window to make progress before the PTE *potentially* becomes a migration PTE and during that window, the parallel access may not need the page any more and never stall on the migration. That migration PTE may never be created if migrate_misplaced_page chooses to ignore the PTE in which case there is minimal disruption. If migration is attempted, then the time until the migration PTE is created is variable. The page has to be isolated from the LRU so there could be contention on the LRU lock, a new page has to be allocated and that allocation potentially has to enter the page allocator slow path etc. During that time, parallel threads make forward progress but with the patch, multiple threads potentially attempt the allocation and fail instead of doing real work. You should consider the following question -- is the potential saving of an IPI transmission enough to offset the cost of parallel accesses not making forward progress while one migration is setup and having different migration attempts collide? I have tests running just in case but I think the answer may be "no". So far only one useful test as completed (specjbb2005 with one VM per NUMA node) and it showed a mix of small gains and losses but with *higher* interrupts contrary to what was expected from the changelog. For some thread counts, the results showed large differences in variability, sometimes lower and sometimes much higher. It makes me think that a workload should be identified that really benefits from the IPI savings are enough to justify stalling parallel accesses that could be making forward progress. One nit below > Signed-off-by: "Huang, Ying" > Cc: Peter Zijlstra > Cc: Mel Gorman > Cc: Peter Xu > Cc: Johannes Weiner > Cc: Vlastimil Babka > Cc: "Matthew Wilcox" > Cc: Will Deacon > Cc: Michel Lespinasse > Cc: Arjun Roy > Cc: "Kirill A. Shutemov" > --- > mm/memory.c | 54 +++++++++++++++++++++++++++++++---------------------- > 1 file changed, 32 insertions(+), 22 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index d3273bd69dbb..a9a8ed1ac06c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4148,29 +4148,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > goto out; > } > > - /* > - * Make it present again, Depending on how arch implementes non > - * accessible ptes, some can allow access by kernel mode. > - */ > - old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); > + /* Get the normal PTE */ > + old_pte = ptep_get(vmf->pte); > pte = pte_modify(old_pte, vma->vm_page_prot); > - pte = pte_mkyoung(pte); > - if (was_writable) > - pte = pte_mkwrite(pte); > - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); > - update_mmu_cache(vma, vmf->address, vmf->pte); > > page = vm_normal_page(vma, vmf->address, pte); > - if (!page) { > - pte_unmap_unlock(vmf->pte, vmf->ptl); > - return 0; > - } > + if (!page) > + goto out_map; > > /* TODO: handle PTE-mapped THP */ > - if (PageCompound(page)) { > - pte_unmap_unlock(vmf->pte, vmf->ptl); > - return 0; > - } > + if (PageCompound(page)) > + goto out_map; > > /* > * Avoid grouping on RO pages in general. RO pages shouldn't hurt as > @@ -4180,7 +4168,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > * pte_dirty has unpredictable behaviour between PTE scan updates, > * background writeback, dirty balancing and application behaviour. > */ > - if (!pte_write(pte)) > + if (was_writable) > flags |= TNF_NO_GROUP; > > /* Not clear why this change was made. > @@ -4194,23 +4182,45 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > page_nid = page_to_nid(page); > target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, > &flags); > - pte_unmap_unlock(vmf->pte, vmf->ptl); > if (target_nid == NUMA_NO_NODE) { > put_page(page); > - goto out; > + goto out_map; > } > + pte_unmap_unlock(vmf->pte, vmf->ptl); > > /* Migrate to the requested node */ > if (migrate_misplaced_page(page, vma, target_nid)) { > page_nid = target_nid; > flags |= TNF_MIGRATED; > - } else > + } else { > flags |= TNF_MIGRATE_FAIL; > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + spin_lock(vmf->ptl); > + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + goto out; > + } > + goto out_map; > + } > > out: > if (page_nid != NUMA_NO_NODE) > task_numa_fault(last_cpupid, page_nid, 1, flags); > return 0; > +out_map: > + /* > + * Make it present again, Depending on how arch implementes non > + * accessible ptes, some can allow access by kernel mode. > + */ > + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); > + pte = pte_modify(old_pte, vma->vm_page_prot); > + pte = pte_mkyoung(pte); > + if (was_writable) > + pte = pte_mkwrite(pte); > + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); > + update_mmu_cache(vma, vmf->address, vmf->pte); > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + goto out; > } > > static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) > -- > 2.30.2 > -- Mel Gorman SUSE Labs