Received: by 2002:ab2:710b:0:b0:1ef:a325:1205 with SMTP id z11csp1594553lql; Wed, 13 Mar 2024 02:37:12 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUsdpK1tesm9zWeetBcvT0PvJXElln6+hWTh6lBOlKs8Qs1DGKeeXnHXgvO+3V+16LZaiU4CThdhzxn45qmcNTKx+qLrmqX630ylv14Xg== X-Google-Smtp-Source: AGHT+IFUyzpvgTNCP+62r9uBFiGrs/lcObiU3kva/Gh+ivJbYySX9JpdfKCX5KOG5FhJJoqWxxEu X-Received: by 2002:a05:6a21:6da7:b0:1a3:2b94:dbaf with SMTP id wl39-20020a056a216da700b001a32b94dbafmr4302832pzb.1.1710322632307; Wed, 13 Mar 2024 02:37:12 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1710322632; cv=pass; d=google.com; s=arc-20160816; b=AbfUnaOxWif/E4hBy3LuaKhzPTEy3jtb9h2Dgv2eHtKLmt6t7yzP70eAflQtC+sTXu /bsysscDoOoZv9fDEi2Q69FSs+0zCQR2cB3Ql+6jEU9rkEIvb7KmgBImMbe7HEu6a5KN L5ychlHfeivf9a4QOrXWn4+2gI4GhbSOYcNlSA2kU7k4CM9m43Wq8JMRl61ccZJL+i/X USMv29rGUKHUWxJvXltU5AawMdLUA1QAjFcWwEcbBzKVwdm80E5gCfpmYB27M2aa6s8B QGDWXavfBdUH36OQvvD01to22Xsr+5XKUEb9eGz/PAqPLtwv3gORh9zvl6wzAjPZgfYo PQSQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=olUPq5YlquUJ7MNZeYi6TSyOBefV/rSjU9La2NtNLKE=; fh=BB1fA45cDv7DsyZvD5WexqpJKdbh2lPaxtYQ9eCG+gg=; b=JxJp1g/nsya/8Ui4cHqBsJw4W2bkJ3fGdj/mmpCYY7QUTx23NZYuF7KKy38oWsmyiP YNXxFdyRcmxT77A5OEvQRqzoRjgPX5cIH4A1N+j1GoKAYBZj4aTo7s26tjYvY+c+v65Z p3gkgr3M5pd66gylbLAFljLHFR4xQeACcVt/JTxzfNAwF3Hh71PODfW5AY6oWc7NFsc4 NDPdGtTgMfH3Yp6omW7IB/suhtQM8ECA51mZXaFvZGoCqg7t8/NiCTZeC/UoM6ZCZngn 2hXBlUTEaRSQMqMXkI5Md2tXb1oI3wKdc3ZoZ+dKWUhbij+9PQiPWYGo7yBnZSYvAgZ6 ZaYA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-101285-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-101285-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id i1-20020a170902cf0100b001d8a9162f22si8756388plg.60.2024.03.13.02.37.11 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Mar 2024 02:37:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-101285-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-101285-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-101285-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id EFFC7B21568 for ; Wed, 13 Mar 2024 09:37:08 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 589DE20B33; Wed, 13 Mar 2024 09:37:01 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 37B84168CE for ; Wed, 13 Mar 2024 09:36:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710322620; cv=none; b=X14wj/I2PZLPv+/W20wDbR/ygjjWhnEknMyGRSrGQpomIG/6peoOyHfyqp9Xvqs7VL0qf+6vZeNLxoGLeJXgBfvr+MaFwU/tSTmSK9XV9XeY0t0x5wzAGP752co0GTvgSPykjZGh5YGmL/KQ2dWQEEscvfXmuvbQU96SXIrqzAk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710322620; c=relaxed/simple; bh=3rLB7do1aTmx5DccixOPQKqFlSVMSmgZv9epuXeiOME=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=W+MZAkW8dhNCXJUNqq1/Y4X2ieF9GCsRA7ZRZYJvcYqXRFeNFMov4oHcD137mrU4/NHDb/On1QYFtHc+MY8bO1e45mLyaq7BpSHPBrDSiAoP12ltTsIbaw333SXQUysvE85aIwEJF3PE3jC+1vHHf4o6fhusn9NgjBFJI+vtCmM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3B8501007; Wed, 13 Mar 2024 02:37:34 -0700 (PDT) Received: from [10.57.67.164] (unknown [10.57.67.164]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 636663F73F; Wed, 13 Mar 2024 02:36:54 -0700 (PDT) Message-ID: <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com> Date: Wed, 13 Mar 2024 09:36:52 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: Lance Yang , Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Chris Li , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240311150058.1122862-1-ryan.roberts@arm.com> <20240311150058.1122862-7-ryan.roberts@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 13/03/2024 09:16, Barry Song wrote: > On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts wrote: >> >> On 13/03/2024 07:19, Barry Song wrote: >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts wrote: >>>> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large >>>> folio that is fully and contiguously mapped in the pageout/cold vm >>>> range. This change means that large folios will be maintained all the >>>> way to swap storage. This both improves performance during swap-out, by >>>> eliding the cost of splitting the folio, and sets us up nicely for >>>> maintaining the large folio when it is swapped back in (to be covered in >>>> a separate series). >>>> >>>> Folios that are not fully mapped in the target range are still split, >>>> but note that behavior is changed so that if the split fails for any >>>> reason (folio locked, shared, etc) we now leave it as is and move to the >>>> next pte in the range and continue work on the proceeding folios. >>>> Previously any failure of this sort would cause the entire operation to >>>> give up and no folios mapped at higher addresses were paged out or made >>>> cold. Given large folios are becoming more common, this old behavior >>>> would have likely lead to wasted opportunities. >>>> >>>> While we are at it, change the code that clears young from the ptes to >>>> use ptep_test_and_clear_young(), which is more efficent than >>>> get_and_clear/modify/set, especially for contpte mappings on arm64, >>>> where the old approach would require unfolding/refolding and the new >>>> approach can be done in place. >>>> >>>> Signed-off-by: Ryan Roberts >>> >>> This looks so much better than our initial RFC. >>> Thank you for your excellent work! >> >> Thanks - its a team effort - I had your PoC and David's previous batching work >> to use as a template. >> >>> >>>> --- >>>> mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- >>>> 1 file changed, 51 insertions(+), 38 deletions(-) >>>> >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 547dcd1f7a39..56c7ba7bd558 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> LIST_HEAD(folio_list); >>>> bool pageout_anon_only_filter; >>>> unsigned int batch_count = 0; >>>> + int nr; >>>> >>>> if (fatal_signal_pending(current)) >>>> return -EINTR; >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> return 0; >>>> flush_tlb_batched_pending(mm); >>>> arch_enter_lazy_mmu_mode(); >>>> - for (; addr < end; pte++, addr += PAGE_SIZE) { >>>> + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { >>>> + nr = 1; >>>> ptent = ptep_get(pte); >>>> >>>> if (++batch_count == SWAP_CLUSTER_MAX) { >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, >>>> continue; >>>> >>>> /* >>>> - * Creating a THP page is expensive so split it only if we >>>> - * are sure it's worth. Split it if we are only owner. >>>> + * If we encounter a large folio, only split it if it is not >>>> + * fully mapped within the range we are operating on. Otherwise >>>> + * leave it as is so that it can be swapped out whole. If we >>>> + * fail to split a folio, leave it in place and advance to the >>>> + * next pte in the range. >>>> */ >>>> if (folio_test_large(folio)) { >>>> - int err; >>>> - >>>> - if (folio_estimated_sharers(folio) > 1) >>>> - break; >>>> - if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> - break; >>>> - if (!folio_trylock(folio)) >>>> - break; >>>> - folio_get(folio); >>>> - arch_leave_lazy_mmu_mode(); >>>> - pte_unmap_unlock(start_pte, ptl); >>>> - start_pte = NULL; >>>> - err = split_folio(folio); >>>> - folio_unlock(folio); >>>> - folio_put(folio); >>>> - if (err) >>>> - break; >>>> - start_pte = pte = >>>> - pte_offset_map_lock(mm, pmd, addr, &ptl); >>>> - if (!start_pte) >>>> - break; >>>> - arch_enter_lazy_mmu_mode(); >>>> - pte--; >>>> - addr -= PAGE_SIZE; >>>> - continue; >>>> + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | >>>> + FPB_IGNORE_SOFT_DIRTY; >>>> + int max_nr = (end - addr) / PAGE_SIZE; >>>> + >>>> + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, >>>> + fpb_flags, NULL); >>> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users >>> are doing madvise() on a portion of a large folio. >> >> Good idea. Something like this?: >> >> if (pte_pfn(pte) == folio_pfn(folio) > > what about > > "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)" > > just to account for cases where the user's end address falls within > the middle of a large folio? yes, even better. I'll add this for the next version. > > > BTW, another minor issue is here: > > if (++batch_count == SWAP_CLUSTER_MAX) { > batch_count = 0; > if (need_resched()) { > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(start_pte, ptl); > cond_resched(); > goto restart; > } > } > > We are increasing 1 for nr ptes, thus, we are holding PTL longer > than small folios case? we used to increase 1 for each PTE. > Does it matter? I thought about that, but the vast majority of the work is per-folio, not per-pte. So I concluded it would be best to continue to increment per-folio. > >> nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, >> fpb_flags, NULL); >> >> If we are not mapping the first page of the folio, then it can't be a full >> mapping, so no need to call folio_pte_batch(). Just split it. >> >>> >>>> + >>>> + if (nr < folio_nr_pages(folio)) { >>>> + int err; >>>> + >>>> + if (folio_estimated_sharers(folio) > 1) >>>> + continue; >>>> + if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> + continue; >>>> + if (!folio_trylock(folio)) >>>> + continue; >>>> + folio_get(folio); >>>> + arch_leave_lazy_mmu_mode(); >>>> + pte_unmap_unlock(start_pte, ptl); >>>> + start_pte = NULL; >>>> + err = split_folio(folio); >>>> + folio_unlock(folio); >>>> + folio_put(folio); >>>> + if (err) >>>> + continue; >>>> + start_pte = pte = >>>> + pte_offset_map_lock(mm, pmd, addr, &ptl); >>>> + if (!start_pte) >>>> + break; >>>> + arch_enter_lazy_mmu_mode(); >>>> + nr = 0; >>>> + continue; >>>> + } >>>> } >>>> >>>> /* >>>> * Do not interfere with other mappings of this folio and >>>> - * non-LRU folio. >>>> + * non-LRU folio. If we have a large folio at this point, we >>>> + * know it is fully mapped so if its mapcount is the same as its >>>> + * number of pages, it must be exclusive. >>>> */ >>>> - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) >>>> + if (!folio_test_lru(folio) || >>>> + folio_mapcount(folio) != folio_nr_pages(folio)) >>>> continue; >>> >>> This looks so perfect and is exactly what I wanted to achieve. >>> >>>> >>>> if (pageout_anon_only_filter && !folio_test_anon(folio)) >>>> continue; >>>> >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>> - >>>> - if (!pageout && pte_young(ptent)) { >>>> - ptent = ptep_get_and_clear_full(mm, addr, pte, >>>> - tlb->fullmm); >>>> - ptent = pte_mkold(ptent); >>>> - set_pte_at(mm, addr, pte, ptent); >>>> - tlb_remove_tlb_entry(tlb, pte, addr); >>>> + if (!pageout) { >>>> + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { >>>> + if (ptep_test_and_clear_young(vma, addr, pte)) >>>> + tlb_remove_tlb_entry(tlb, pte, addr); >>>> + } >>> >>> This looks so smart. if it is not pageout, we have increased pte >>> and addr here; so nr is 0 and we don't need to increase again in >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) >>> >>> otherwise, nr won't be 0. so we will increase addr and >>> pte by nr. >> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for >> madvise_free_pte_range(). >> >> >>> >>> >>>> } >>>> >>>> /* >>>> -- >>>> 2.25.1 >>>> >>> >>> Overall, LGTM, >>> >>> Reviewed-by: Barry Song >> >> Thanks! >> >>