Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1292563rwd; Thu, 1 Jun 2023 13:19:34 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ65eCobJe6lVaQxDjFaM5xEx3fVU3ZOo5NrUNWhwFme5maCsJS4Seb9btBr3Og8unK2TfHr X-Received: by 2002:a05:6a00:230f:b0:641:3bf8:6514 with SMTP id h15-20020a056a00230f00b006413bf86514mr13987208pfh.10.1685650774097; Thu, 01 Jun 2023 13:19:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685650774; cv=none; d=google.com; s=arc-20160816; b=gx4sVHi2NUTe/Inx4XEqc1rGTM+s64G1X5oAxfZ8XpZtf2Nf9ScP5mO5MLuPawW0uE GYKN3A9q0B9wCozpvK1WrUfcGVQJaY91cpAHionk72kMpSiMRslF5UYj57K6I8QsK6PL n1Fx83ZbmuMTNLW2sIVBKlFUa9keVif3TXsjCKmBOWwVbMUgBCI+R70YvWk7r2OVHzF1 QLRqj/2wb4K5NgT5QrBQ7e28zeVgz8dStUlYdQPq4wEi88qZVRVQINFIFF2EAKdN7cSy O3tUJJ2wqg+zWyNomrFx4Qpw/4BkqEp+JlmUFIzOm6L0KQ8n7foppmjV+hfrLAXTvkTb Dgow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=wH52BVikKcnaVU74ot3GHAVVa3YUi3Nj+bJLSNVnGpE=; b=tZGok08h+vAnzMzKcoOaU1tPaKr9WDNtDRSPbFiVc/C6X15Y1z6jxAYAUWRR23vZGj X9nGjH5UBfseFqljLLfAnc/2+4sG9eZGyC14pLorls0IowCnuoWqCkEr2luAmDPuR//A qXxq3sB8cWsRBY8d1QBID+3bsIFqTgZA/kcpXBFRXuwGTINr+3Nuim76HPY6ZIAsmj3F A7YOTm7TDnwSru0lvwa+MAtYGkf/o0mN7JLQiwHX1Po0jbJOmnQIrRXzOVCM155WIe26 11NrKaHSlwGN/5QKu7RVo/hqhS17f6JWNML4wQBTuH0s3g/d9SvcYu0TISD5eNeX1UYX R6Vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=BFzbfEf4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y22-20020a637d16000000b00527d158ec6dsi3362050pgc.406.2023.06.01.13.19.19; Thu, 01 Jun 2023 13:19:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=BFzbfEf4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230220AbjFAUMv (ORCPT + 99 others); Thu, 1 Jun 2023 16:12:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229459AbjFAUMu (ORCPT ); Thu, 1 Jun 2023 16:12:50 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C538D1 for ; Thu, 1 Jun 2023 13:12:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1685650323; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wH52BVikKcnaVU74ot3GHAVVa3YUi3Nj+bJLSNVnGpE=; b=BFzbfEf4Dgh3pzn11IEVGIS5h69yYH6KpZFrmMlrO93754o7375SGvAZtLUE2hy1+54y4b 9dpP67z3iNuHptEgUckZZe58Gat8GTJSfKKYuXPJ/k3iefD8pxGZKU3SoE74fO2fg2tP9l SB9JD4LEQEOsg9KV0olzLN/d0zQ/FnQ= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-220-DjXWBQwzOpa2a6x8pyLUyw-1; Thu, 01 Jun 2023 16:12:02 -0400 X-MC-Unique: DjXWBQwzOpa2a6x8pyLUyw-1 Received: by mail-qt1-f197.google.com with SMTP id d75a77b69052e-3f6b46e281eso537731cf.0 for ; Thu, 01 Jun 2023 13:12:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685650322; x=1688242322; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wH52BVikKcnaVU74ot3GHAVVa3YUi3Nj+bJLSNVnGpE=; b=TLLXZ4Zg6V0BRhXTAW74aT/ThJA468rAkTpC9c1vBQ5tdGuIsawDSZnZdZjrLBANYS R2P9tqedkOxrVK1ll3g+8lJCFIRv1KCCMSqqKSdSaA/wRc6ohllAKJ5XQ8+3ACUXbopV DXSIwxMBRqsz4gi4ExyFbjq9hRo2VK54B/qW53IzyuModoYa2zWNZH05L/bOiPHsMzeK 0QMR/hVYEkUzirJW/mXNN31yeLiFmOFA37GVPZhY+3FXmcB47QVLiH+LRNmQIWw+jrJC 2jRthu4fH0f52TydDKeb5AYng6OVry9weg8FbB0QxaQtIsZSq7UZg+E4KP/fMmu2BrZk EGKQ== X-Gm-Message-State: AC+VfDxlIOT07CmDalUnWlfPgfhhRV1Qdk8IkMhz0H9WzTLrgE3Kby/J 5f8fIH7DlMX5ZI1CtmEYiIcJXeFFrmhQhbAjUfTPgcFA37qKaHQN4PP3Kah6d13KCK578wbsvtB 3xVJ4WYnMMw2Ht1SIJXC6ptQY X-Received: by 2002:a05:622a:1a88:b0:3f7:f680:c922 with SMTP id s8-20020a05622a1a8800b003f7f680c922mr8739832qtc.6.1685650321869; Thu, 01 Jun 2023 13:12:01 -0700 (PDT) X-Received: by 2002:a05:622a:1a88:b0:3f7:f680:c922 with SMTP id s8-20020a05622a1a8800b003f7f680c922mr8739796qtc.6.1685650321514; Thu, 01 Jun 2023 13:12:01 -0700 (PDT) Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id h18-20020ac85152000000b003f6b0562ad7sm8041197qtn.16.2023.06.01.13.11.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Jun 2023 13:12:00 -0700 (PDT) Date: Thu, 1 Jun 2023 16:11:58 -0400 From: Peter Xu To: Muhammad Usama Anjum Cc: David Hildenbrand , Andrew Morton , =?utf-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v16 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Message-ID: References: <20230525085517.281529-1-usama.anjum@collabora.com> <20230525085517.281529-3-usama.anjum@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 01, 2023 at 01:16:14PM +0500, Muhammad Usama Anjum wrote: > On 6/1/23 2:46 AM, Peter Xu wrote: > > Muhammad, > > > > Sorry, I probably can only review the non-interface part, and leave the > > interface/buffer handling, etc. review for others and real potential users > > of it.. > Thank you so much for the review. I think mostly we should be okay with > interface as everybody has been making suggestions over the past revisions. > > > > > On Thu, May 25, 2023 at 01:55:14PM +0500, Muhammad Usama Anjum wrote: > >> +static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma, > >> + unsigned long addr, pte_t *ptep, > >> + pte_t ptent) > >> +{ > >> + pte_t old_pte; > >> + > >> + if (!huge_pte_none(ptent)) { > >> + old_pte = huge_ptep_modify_prot_start(vma, addr, ptep); > >> + ptent = huge_pte_mkuffd_wp(old_pte); > >> + ptep_modify_prot_commit(vma, addr, ptep, old_pte, ptent); > > > > huge_ptep_modify_prot_start()? > Sorry, I didn't realized that huge_ptep_modify_prot_start() is different > from its pte version. Here I meant huge_ptep_modify_prot_commit().. > > > > > The other thing is what if it's a pte marker already? What if a hugetlb > > migration entry? Please check hugetlb_change_protection(). > I've updated it in more better way. Please let me know what do you think > about the following: > > static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep, > pte_t ptent) > { > if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent)) > return; > > if (is_hugetlb_entry_migration(ptent)) > set_huge_pte_at(vma->vm_mm, addr, ptep, > pte_swp_mkuffd_wp(ptent)); > else if (!huge_pte_none(ptent)) > ptep_modify_prot_commit(vma, addr, ptep, ptent, > huge_pte_mkuffd_wp(ptent)); > else > set_huge_pte_at(vma->vm_mm, addr, ptep, > make_pte_marker(PTE_MARKER_UFFD_WP)); > } the is_pte_marker() check can be extended to double check pte_marker_uffd_wp() bit, but shouldn't matter a lot since besides the uffd-wp bit currently we only support swapin error which should sigbus when accessed, so no point in tracking anyway. > > As we always set UNPOPULATED, so markers are always set on none ptes > initially. Is it possible that a none pte becomes present, then swapped and > finally none again? So I'll do the following addition for make_uffd_wp_pte(): > > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1800,6 +1800,9 @@ static inline void make_uffd_wp_pte(struct > vm_area_struct *vma, > } else if (is_swap_pte(ptent)) { > ptent = pte_swp_mkuffd_wp(ptent); > set_pte_at(vma->vm_mm, addr, pte, ptent); > + } else { > + set_pte_at(vma->vm_mm, addr, pte, > + make_pte_marker(PTE_MARKER_UFFD_WP)); > } > } Makes sense, you can leverage userfaultfd_wp_use_markers() here, and you should probably keep the protocol (only set the marker when WP_UNPOPULATED for anon). > > > > > > > >> + } else { > >> + set_huge_pte_at(vma->vm_mm, addr, ptep, > >> + make_pte_marker(PTE_MARKER_UFFD_WP)); > >> + } > >> +} > >> +#endif > > > > [...] > > > >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > >> + unsigned long end, struct mm_walk *walk) > >> +{ > >> + struct pagemap_scan_private *p = walk->private; > >> + struct vm_area_struct *vma = walk->vma; > >> + unsigned long addr = end; > >> + pte_t *pte, *orig_pte; > >> + spinlock_t *ptl; > >> + bool is_written; > >> + int ret = 0; > >> + > >> + arch_enter_lazy_mmu_mode(); > >> + > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >> + ptl = pmd_trans_huge_lock(pmd, vma); > >> + if (ptl) { > >> + unsigned long n_pages = (end - start)/PAGE_SIZE; > >> + > >> + if (p->max_pages && n_pages > p->max_pages - p->found_pages) > >> + n_pages = p->max_pages - p->found_pages; > >> + > >> + is_written = !is_pmd_uffd_wp(*pmd); > >> + > >> + /* > >> + * Break huge page into small pages if the WP operation need to > >> + * be performed is on a portion of the huge page. > >> + */ > >> + if (is_written && IS_PM_SCAN_WP(p->flags) && > >> + n_pages < HPAGE_SIZE/PAGE_SIZE) { > >> + spin_unlock(ptl); > >> + > >> + split_huge_pmd(vma, pmd, start); > >> + goto process_smaller_pages; > >> + } > >> + > >> + if (IS_PM_SCAN_GET(p->flags)) > >> + ret = pagemap_scan_output(is_written, vma->vm_file, > >> + pmd_present(*pmd), > >> + is_swap_pmd(*pmd), > >> + p, start, n_pages); > >> + > >> + if (ret >= 0 && is_written && IS_PM_SCAN_WP(p->flags)) > >> + make_uffd_wp_pmd(vma, addr, pmd); > >> + > >> + if (IS_PM_SCAN_WP(p->flags)) > >> + flush_tlb_range(vma, start, end); > >> + > >> + spin_unlock(ptl); > >> + > >> + arch_leave_lazy_mmu_mode(); > >> + return ret; > >> + } > >> + > >> +process_smaller_pages: > >> + if (pmd_trans_unstable(pmd)) { > >> + arch_leave_lazy_mmu_mode(); > >> + return 0; > > > > I'm not sure whether this is right.. Shouldn't you return with -EAGAIN and > > let the user retry? Returning 0 means you'll move on with the next pmd > > afaict and ignoring this one. > This has come up before. We are just replicating pagemap_pmd_range() here > as we are doing almost the same thing through IOCTL. It doesn't return any > error in this case and just skips it. So we are doing the same. Hmm, is it a bug for pagemap? pagemapread.buffer should be linear to the address range to be scanned to me. If it skips some unstable pmd without filling in anything it seems everything later will be shifted with PMD_SIZE.. I had a feeling that it should set walk->action==ACTION_AGAIN before return. -- Peter Xu