Received: by 2002:a89:48b:0:b0:1f5:f2ab:c469 with SMTP id a11csp261828lqd; Wed, 24 Apr 2024 01:07:14 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXJvzDxQhBNxyG6oAU8ro4roCblND0kPv/AhZ21yjplcXfTk2jefHG3b6PQPHkk9k+dDLi3NVDc6oRQNtIJ6iS5GdpFskAshb3run+Puw== X-Google-Smtp-Source: AGHT+IEnoTPG24P6xZ6jP1+lFi+AYnpAuWb4OHk1nO5rxOabydB3023VorHsw50Cbasqk1I/2HuH X-Received: by 2002:aa7:d04e:0:b0:571:badc:203d with SMTP id n14-20020aa7d04e000000b00571badc203dmr5263384edo.16.1713946034157; Wed, 24 Apr 2024 01:07:14 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713946034; cv=pass; d=google.com; s=arc-20160816; b=IsRvYBss3OxebCk5c++2JUUhTfqL/xt6PZwJ20vXbLpcdIhQCyOQIeGuOBwPrA0ziC BFT4he4O+ZbDHGfR/FydrX/8Uvc2D7OOnUFfX+fleJtDwNvWv/30mZV2T1t/tRZ7BkBz OQjhF5MJ/Lns9BGp3V3bG43OAZKZeXo0snNXe5UgETp41S75DgbiWEHI9r9jVIARtlAp Lq7z4pwQ9msQVu5CPsjMk5s/NNIqELGv3TUnoq4S2AVfOIdKrkY5pmY4SOC1LQcTc+6C BslUl7BaeF8uj2Ycvdt/MUwZHx4hXkeQqynZThCVoJga+urrG0DVjozE7IQUK1FKIi5y PLkg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=s/3YnpdFiX/zx5gDQOIdhB9bgH7ucWB4mTTGOG87YnA=; fh=6zSiAGoAgSPa8hr0PPQjB1JgZAt8NAX3xQE/BuoU18Y=; b=oiTP7DLqlxYA3PsZZIRhuWJ23UvweOkO+UV363G2tkoRQCSk+WQBzgdgffo3vNNw8O I+W3Aywb69o5ISqVeguTSSXG1H5y8h51rIaf5RND6OmwV+b5u+mneCKaII/iKsIFs+tr r7af9u66XAjUsk7Dz6cYrBRcz+CFlT/Ycu1QHpICllQfEpUUa4b0+h8G59FL30+mi2Y7 VevZekmNr21bJrgqu4eDu0R7m2QSRP/IeT1p4oGzIHlNsFRfD7Wq1KOuukYLZ7+Fvf0Z aJ7hExjUQEngmIUgbjtZh2XeSPk/7M7fCEmJq4keejRJlBdQ6F3GdUr51ov5Dy4LgcZg moSg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-156521-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-156521-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id x6-20020a50d606000000b0056c07b4ac22si7939886edi.58.2024.04.24.01.07.14 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Apr 2024 01:07:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-156521-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-156521-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-156521-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id D2F4E1F237D9 for ; Wed, 24 Apr 2024 08:07:13 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BF5681586F6; Wed, 24 Apr 2024 08:07:08 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 52C7436D for ; Wed, 24 Apr 2024 08:07:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713946028; cv=none; b=qYouvTGKOMn/hIfj/ENlB058ckYK+wYANIZ5B6CDcz4KyVxHiYaZglpMK7ooDMJIbOxYhG48tjsqvsZ5REE7bmQNUPifvH7ac2QeiiI/SFSVky8WDS3fAY/prjg19j6AjeoFl2ySiseBpyXown9vqFJo7o3RhwlL693/Hz4BcIo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713946028; c=relaxed/simple; bh=kdOy2K+ZN+F0va8Jer7NoUqu8Cq6WNl+H9rDHKKRJ30=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=IS1JetcCYpCsll0uzWCoZHAxf8BCNtFTB+jIfNpJW+CD2vzz//2UHbPSL1/5yyNKb4+cgDL/MbGEv4wgOorXB1ZQiiDpKEKj4+q33Ni16Cj081d+iY0VVNUuj5JpDakti9Wfgnpwx+MfZ5QsbvNVt1Blw5IEzBxvvTr8zt9xPH4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 57379339; Wed, 24 Apr 2024 01:07:33 -0700 (PDT) Received: from [10.57.74.127] (unknown [10.57.74.127]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 23DBD3F64C; Wed, 24 Apr 2024 01:07:04 -0700 (PDT) Message-ID: <6c418d70-a75d-4019-a0f5-56a61002d37a@arm.com> Date: Wed, 24 Apr 2024 09:07:02 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 1/5] mm: memory: extend finish_fault() to support large folio Content-Language: en-GB To: Baolin Wang , akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ying.huang@intel.com, shy828301@gmail.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <358aefb1858b63164894d7d8504f3dae0b495366.1713755580.git.baolin.wang@linux.alibaba.com> <6aa25e2a-a6b6-4ab7-8300-053ca3c0d748@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 24/04/2024 04:23, Baolin Wang wrote: > > > On 2024/4/23 19:03, Ryan Roberts wrote: >> On 22/04/2024 08:02, Baolin Wang wrote: >>> Add large folio mapping establishment support for finish_fault() as a >>> preparation, >>> to support multi-size THP allocation of anonymous shared pages in the following >>> patches. >>> >>> Signed-off-by: Baolin Wang >>> --- >>>   mm/memory.c | 25 ++++++++++++++++++------- >>>   1 file changed, 18 insertions(+), 7 deletions(-) >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index b6fa5146b260..094a76730776 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -4766,7 +4766,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>   { >>>       struct vm_area_struct *vma = vmf->vma; >>>       struct page *page; >>> +    struct folio *folio; >>>       vm_fault_t ret; >>> +    int nr_pages, i; >>> +    unsigned long addr; >>>         /* Did we COW the page? */ >>>       if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>> @@ -4797,22 +4800,30 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>               return VM_FAULT_OOM; >>>       } >>>   +    folio = page_folio(page); >>> +    nr_pages = folio_nr_pages(folio); >>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); >> >> I'm not sure this is safe. IIUC, finish_fault() is called for any file-backed >> mapping. So you could have a situation where part of a (regular) file is mapped >> in the process, faults and hits in the pagecache. But the folio returned by the >> pagecache is bigger than the portion that the process has mapped. So you now end >> up mapping beyond the VMA limits? In the pagecache case, you also can't assume >> that the folio is naturally aligned in virtual address space. > > Good point. Yes, I think you are right, I need consider the VMA limits, and I > should refer to the calculations of the start pte and end pte in do_fault_around(). You might also need to be careful not to increase reported RSS. I have a vague recollection that David once mentioned a problem with fault-around because it causes the reported RSS to increase for the process and this could lead to different decisions in other places. IIRC Redhat had an advisory somewhere with suggested workaround being to disable fault-around. For the anon-shared memory case, it shouldn't be a problem because the user has opted into allocating bigger blocks, but there may be a need to ensure we don't also start eagerly mapping regular files beyond what fault-around is configured for. > >>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >>> -                      vmf->address, &vmf->ptl); >>> +                       addr, &vmf->ptl); >>>       if (!vmf->pte) >>>           return VM_FAULT_NOPAGE; >>>         /* Re-check under ptl */ >>> -    if (likely(!vmf_pte_changed(vmf))) { >>> -        struct folio *folio = page_folio(page); >>> - >>> -        set_pte_range(vmf, folio, page, 1, vmf->address); >>> -        ret = 0; >>> -    } else { >>> +    if (nr_pages == 1 && vmf_pte_changed(vmf)) { >>>           update_mmu_tlb(vma, vmf->address, vmf->pte); >>>           ret = VM_FAULT_NOPAGE; >>> +        goto unlock; >>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { >> >> I think you have grabbed this from do_anonymous_page()? But I'm not sure it >> works in the same way here as it does there. For the anon case, if userfaultfd >> is armed, alloc_anon_folio() will only ever allocate order-0. So we end up in > > IMO, the userfaultfd validation should do in the vma->vm_ops->fault() callback, > to make sure the nr_pages is always 1 if userfaultfd is armed. OK. Are you saying there is already logic to do that today? Great! > >> the vmf_pte_changed() path, which will allow overwriting a uffd entry. But here, >> there is nothing stopping nr_pages being greater than 1 when there could be a >> uffd entry present, and you will fail due to the pte_range_none() check. (see >> pte_marker_handle_uffd_wp()). > > So if we do the userfaultfd validation in ->fault() callback, then here we can > use the same logic as with anonymous case.