Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2561593rwd; Fri, 2 Jun 2023 11:05:36 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5WnRDUGDhfsExJ+4Mc72amP+BAWzPvcmP4edqYY6kA5dFan1CjoMnGEiMPXWO4UXUrrISs X-Received: by 2002:a05:6a20:3d0c:b0:10b:4539:fa0a with SMTP id y12-20020a056a203d0c00b0010b4539fa0amr17576345pzi.1.1685729135944; Fri, 02 Jun 2023 11:05:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685729135; cv=none; d=google.com; s=arc-20160816; b=0WscFEvEc5Z5ZNKDVwRbhABGz4O4nd/SGt2hkq+4y8sBy0X3Wa2xoJN4hGm3vjObVa Rx5ujW6hreyYTAhddaqM7wiTKpDJKxWdJoIQfiXs8osHaOjNQdr2HvljA0b81lLmEB+T TrUGtollOZC6XzGq45elFJoo0+zicSUavAoCbvV++/Qe/447Du2TgMukLLSWVqUNy2NO L9gpFEiylrDlt2K9DDB5xL9g/9T+Xe+7uy8PlA+ltaT6AwpuX6wn1FSY7+NXj1FHdDfN FyrZklbODTcSNLIeabc8u7pn3PjenVjo256A7nw8CRi6uRTBl4VeTemX182vfNtUqnO3 8+qw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=cSARUz2O3TmQ5YM6iQ0ZZCNg+rFHS4+v0lFyDbqBuwM=; b=VK2lQV0ZtpfRfFotTgPtXSxszDXHdnO4x0SAlrRFCVThJxIwsMtqd8oV/jm+vrGAVv 4BRXzJ1AcNlRsfs/puDpSCkA56fSWawcDgFr+2dVqob9uWYUzOqO9YEBP0SRNMWf36ts rPFXlN0ptJLFVEqphoS7Sq9CHPPSq/vfrFAhyAYd4lZGfDLaIxXWakHXlAGErcycuWFr tFKl8feYsAMff0RJnDjxgnSVfAgrQtrs/4tdHnGWngEkBBqAxw7lizv8zyEkogt2qiB1 PMZMcPDYelV/fqv/ME/jN1f8Oudaprqam1nbvpgiR3GWT+yjzi3VDt7rtgk9/JFebMeN ctNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="G6QD/WhP"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w2-20020aa79a02000000b006530136dafesi1131322pfj.56.2023.06.02.11.05.21; Fri, 02 Jun 2023 11:05:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="G6QD/WhP"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236210AbjFBR7y (ORCPT + 99 others); Fri, 2 Jun 2023 13:59:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58666 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236877AbjFBR7w (ORCPT ); Fri, 2 Jun 2023 13:59:52 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AF85B9F for ; Fri, 2 Jun 2023 10:59:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685728743; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cSARUz2O3TmQ5YM6iQ0ZZCNg+rFHS4+v0lFyDbqBuwM=; b=G6QD/WhPiQkN4mv+BomHgICN35/Vv4iynsi9xaK6rE4ezcpDStJh2898wv20hzQli1Tk1a 1mugQvLmc59iMaRZq7ztSX5ZUOCks9bZx+Qa/cVJc1bw+n6PC2TXxe8a3WFIdQYFlUzdDH LQG/Q2i3DsgKzEiBuYVpe6/TDU/hy3g= Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-627-RubznLp0P8ayhW0usP6GbQ-1; Fri, 02 Jun 2023 13:59:02 -0400 X-MC-Unique: RubznLp0P8ayhW0usP6GbQ-1 Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-75b175cf0d1so29647985a.0 for ; Fri, 02 Jun 2023 10:59:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685728742; x=1688320742; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cSARUz2O3TmQ5YM6iQ0ZZCNg+rFHS4+v0lFyDbqBuwM=; b=bfW3Z/jlU4/FuoEt5vB3ZbKJC5a0ISriLvda1TSRw397aL1y50Ea5vccJ1ZLldW1TO zNAtTUmtGP4/ik6cKtVsRDL3NUX8wFV2ermioFdXXVheUY8V9CJ2KAS3dJ312U9djn+Q TPCifVd+5h2VUCxv0BUFgrpirBYIVfh3J60L4wx8KiiBOFg4MQFAo7nw4cOUH4sWzPSB CpDZgoDSpvQnBSHnVqzkJXvvrv3nby1q1dUUhU5cNZlWrX20IrUcB2vD/Wk1E063z8cy tWmHvRgm1ythMzaU/L+SrdX40PTkg9aUK0Q0MjDWvgYRFq5c1PMF9/cOoVDbveIIv0sr xFAw== X-Gm-Message-State: AC+VfDyzzoiuBceoWHAwBSn0Pdxm5zL7IoviAuor0Z9N0ZGuwKjE8Boq 7deUqAMaEsMj3dHIrHYMwt1CaCx9QgJqcBADZYuLnaQE/RJ7tPoi1snP0BrKfH/eqgq+ztrpqbB u/QA2eL1d/nRa6RrW80cloJ0I X-Received: by 2002:a05:620a:43a4:b0:75b:23a1:69f0 with SMTP id a36-20020a05620a43a400b0075b23a169f0mr11249970qkp.7.1685728742220; Fri, 02 Jun 2023 10:59:02 -0700 (PDT) X-Received: by 2002:a05:620a:43a4:b0:75b:23a1:69f0 with SMTP id a36-20020a05620a43a400b0075b23a169f0mr11249931qkp.7.1685728741857; Fri, 02 Jun 2023 10:59:01 -0700 (PDT) Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id b10-20020a05620a118a00b0074ca7c33b79sm910734qkk.23.2023.06.02.10.58.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Jun 2023 10:59:01 -0700 (PDT) Date: Fri, 2 Jun 2023 13:58:58 -0400 From: Peter Xu To: Muhammad Usama Anjum Cc: David Hildenbrand , Andrew Morton , =?utf-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v16 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Message-ID: References: <20230525085517.281529-1-usama.anjum@collabora.com> <20230525085517.281529-3-usama.anjum@collabora.com> <3589803d-5594-71de-d078-ad4499f233b6@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 02, 2023 at 10:42:45PM +0500, Muhammad Usama Anjum wrote: > Thank you for reviewing and helping out. > > On 6/2/23 9:47 PM, Peter Xu wrote: > [...] > >>>> static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma, > >>>> unsigned long addr, pte_t *ptep, > >>>> pte_t ptent) > >>>> { > >>>> if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent)) > >>>> return; > >>>> > >>>> if (is_hugetlb_entry_migration(ptent)) > >>>> set_huge_pte_at(vma->vm_mm, addr, ptep, > >>>> pte_swp_mkuffd_wp(ptent)); > >>>> else if (!huge_pte_none(ptent)) > >>>> ptep_modify_prot_commit(vma, addr, ptep, ptent, > >>>> huge_pte_mkuffd_wp(ptent)); > >>>> else > >>>> set_huge_pte_at(vma->vm_mm, addr, ptep, > >>>> make_pte_marker(PTE_MARKER_UFFD_WP)); > >>>> } > >>> > >>> the is_pte_marker() check can be extended to double check > >>> pte_marker_uffd_wp() bit, but shouldn't matter a lot since besides the > >>> uffd-wp bit currently we only support swapin error which should sigbus when > >>> accessed, so no point in tracking anyway. > >> Yeah, we are good with what we have as even if more bits are supported in > >> pte markers, this function is only reached when UNPOPULATED + ASYNC WP are > >> enabled. So no other bit would be set on the marker. > > > > I think we don't know? swapin error bit can be set there afaiu, if someone > > swapoff -a and found a swap device broken for some swapped out ptes. > Oops, so I remove returning on pte marker detection. Instead the last else > is already setting marker wp-ed. What I meant is your current code is fine, please don't remove the check. Removing is_pte_marker() means you could potentially overwrite one swap error entry with a marker only having uffd-wp bit set. It can corrupt the guest because swapin error is not recoverable and higher priority than wp. > > > > > But again I think that's fine for now. > > > >> > >>> > >>>> > >>>> As we always set UNPOPULATED, so markers are always set on none ptes > >>>> initially. Is it possible that a none pte becomes present, then swapped and > >>>> finally none again? So I'll do the following addition for make_uffd_wp_pte(): > >>>> > >>>> --- a/fs/proc/task_mmu.c > >>>> +++ b/fs/proc/task_mmu.c > >>>> @@ -1800,6 +1800,9 @@ static inline void make_uffd_wp_pte(struct > >>>> vm_area_struct *vma, > >>>> } else if (is_swap_pte(ptent)) { > >>>> ptent = pte_swp_mkuffd_wp(ptent); > >>>> set_pte_at(vma->vm_mm, addr, pte, ptent); > >>>> + } else { > >>>> + set_pte_at(vma->vm_mm, addr, pte, > >>>> + make_pte_marker(PTE_MARKER_UFFD_WP)); > >>>> } > >>>> } > >>> > >>> Makes sense, you can leverage userfaultfd_wp_use_markers() here, and you > >>> should probably keep the protocol (only set the marker when WP_UNPOPULATED > >>> for anon). > >> This function is only reachable when UNPOPULATED + Async WP are set. So we > >> don't need to use userfaultfd_wp_use_markers(). > > > > I don't remember where you explicitly checked that to make sure it'll > > always be the case, but if you do it'll still be nice if you can add a > > comment right above here explaining. > I was only testing for WP and WP Async in pagemap_scan_test_walk() as WP > async can only be enabled if unpopulated is enabled which in turn enables > the markers. I'll add userfaultfd_wp_use_markers() to pagemap_scan_test_walk(). OK. > > > > > Or, maybe you can still use userfaultfd_wp_use_markers() here, then you can > > just WARN_ON_ONCE() on it, which looks even better? > > > > [...] > > > >>> Hmm, is it a bug for pagemap? pagemapread.buffer should be linear to the > >>> address range to be scanned to me. If it skips some unstable pmd without > >>> filling in anything it seems everything later will be shifted with > >>> PMD_SIZE.. I had a feeling that it should set walk->action==ACTION_AGAIN > >>> before return. > >> I don't think this is a bug if this is how it was implemented in the first > >> place. In this task_mmu.c file, we can find several examples of the same > >> pattern that error isn't returned if pmd_trans_unstable() succeeds. > > > > I don't see why multiple same usages mean it's correct.. maybe they're all > > buggy? > > > > I can post a patch for this to collect opinions to see if I missed > > something. I'd suggest you figure out what's the right thing to do for the > > new interface and make it right from the start, no matter how it was > > implemented elsewhere. > Alright. At first sight, it seems I should return -EAGAIN here. But there > maybe a case where there are 3 THPs. I've cleared WP over the first THP and > put data in the user buffer. If I return -EAGAIN, the data in the user > buffer would be not used by the user correctly as negative value has been > returned. So the best way here would be to skip this second VMA and don't > abort the operation. Thus we'll not be giving any information about the > second THP as it was in transition. > > I'll think about it again before sending next version. Here when reaching unstable==true I think it means a rare race happened, that a thp was _just_ installed. You can try to read the comment over pmd_trans_unstable() and pmd_none_or_trans_huge_or_clear_bad() to understand what it wants to avoid. Basically afaiu that's the ultimate guard if one wants to walk the pte pgtable (that fact can change again if Hugh's series will land, but that's another story). As said, a patch will be posted (will have you copied) soon just for that, so people can also comment and may help you / us know which is the right way, probably not in a few hours (need to go very soon..), but shouldn't be too long. I just need to double check more users out of task_mmu.c when we're at it. -- Peter Xu