Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp2119404pxb; Mon, 8 Mar 2021 14:58:04 -0800 (PST) X-Google-Smtp-Source: ABdhPJwP4EAyQuRzGD6Nk/alrYtgmH4j5c6RtqBDKyHqUoCD9+6u/U83XQ2kELiN3f7woFRM+M74 X-Received: by 2002:a17:906:3552:: with SMTP id s18mr17400128eja.497.1615244283759; Mon, 08 Mar 2021 14:58:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1615244283; cv=none; d=google.com; s=arc-20160816; b=gPtYGs5TR/C4VZVMGy9H/AVXG90q4CqHF7OOI9OlVU2roHobosHr1lfxWwVMYgFCIh 5fr7bzyEWWiudstdrbqRceM7BOiOX5ALCOe1nXpBtq0Wc2IqjliHhAhPrHQhs43sEqAm eBEqlFLzuc0VQ+Gzf12rn6s6uZjobiRldPRMTmcw4Bn3AidVxzLUbRUSJ7hNZu+n7QBq FOCHomUtNQEBAVAjwwIFLPT/b9eRkRDqyvYoLkvyR4hP0MTb8M7ugJXecBUZ342gdEiY AXmi2sLucIrseg+3F/9s7r3Ixr3un27BaRw7RgUym/4SA7gNTCwFfVYDV8gEhno8zCIn 0heQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8Q8D8LboIHh4Rz0+Zc87mq3F8gHasOwTkfuMAD39SmA=; b=Xs2Iu6hqiMijNFhPn0f4ATQXOA5He9iOsaayjheJYhfG4mvIMycqNxqddY9NS+QxdG kUWjHKFFGfYrM1Fma2BcH+st4hxnugxLOcbSZ1pBq6+JmdgoA4ElRk+Zj6MBE3+KzubD 33YnZGrjybNh2nHrnT+HqZLBCBqpBMe+zFXf+z6nW3sOle6OuV9GPw2FHNR/mv1lEbQk ubZeHvrbG6JzMyj7eaDZjSNOSlLDosmRT+xTOb/z7k8yYDzTgIisbdIsKW87g20bfZZs jq9DofTYJPCisxCBqU3I86ox3EIUt4/g0NyhSib2oWY8mjebQjG6M5kqlGDtM7it/coS hEvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="Tr/BhAhW"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id do7si2082976ejc.307.2021.03.08.14.57.41; Mon, 08 Mar 2021 14:58:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="Tr/BhAhW"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229471AbhCHW4l (ORCPT + 99 others); Mon, 8 Mar 2021 17:56:41 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:30558 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231126AbhCHW4c (ORCPT ); Mon, 8 Mar 2021 17:56:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1615244191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8Q8D8LboIHh4Rz0+Zc87mq3F8gHasOwTkfuMAD39SmA=; b=Tr/BhAhWAfVR1daArLChVR1ARn07tz8p1h8VD42WRjxFBNz+ZSH6qUNN1Q4GdSEHaQn4YZ 1xXiMkZ1tZrouiZnHqY1/4vlY/jThb3NfP4c2zmaypH2o3NlmiKuxTTu9cGmJgm1iqAhsV REgh4/hQmrzWI7DlZM35bVoG2hwddEc= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-185-47u352i9NRa0bImlGlsKgA-1; Mon, 08 Mar 2021 17:56:30 -0500 X-MC-Unique: 47u352i9NRa0bImlGlsKgA-1 Received: by mail-qv1-f70.google.com with SMTP id h12so8813623qvm.9 for ; Mon, 08 Mar 2021 14:56:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8Q8D8LboIHh4Rz0+Zc87mq3F8gHasOwTkfuMAD39SmA=; b=tFMp451viRaBx0V2cwEKHT6eHkuF1xR8yMW/0NJhClv/oIAmvyEj48dYG8Kc5qY3qg GQiqTy7AdqYFCjtQ4Xh45t/eeuPqXgVPANu3cOASCd53TRofDuBmNiaGHhbVeN1KX3GF acEQUSxADRftgPFz5GtG1+1rDHvUs6iCkHDZQDlDhrb5a8CfTEngwcGO+6EaHRmtSDog Z21xg0Id2ioQC8vMSKzwacg4e3FBlm+aZl6DE6MpT+VUgGIM4rci+N5S4mIBUjTPOpQE J2a2aPMU24EOWwmmipFYA1ObaKWHNFz28tj1/rO07qspoRgnEuAqEm6WKBi0r2IYj1mP nR7Q== X-Gm-Message-State: AOAM533v54gCsOmnJG5clAiadf+uL5yo89dtfHy9FlaImtuRp1PM2fkD T033M5qs85QYf49YgCLLwk0IHgzs0KDj7E8NHLMuLcur3BZqszqpZswaIqSJMbGbjPZjXrVXauO ithOaZHajaLix68IWB6t9dVMX X-Received: by 2002:ae9:f81a:: with SMTP id x26mr22056587qkh.497.1615244188954; Mon, 08 Mar 2021 14:56:28 -0800 (PST) X-Received: by 2002:ae9:f81a:: with SMTP id x26mr22056567qkh.497.1615244188573; Mon, 08 Mar 2021 14:56:28 -0800 (PST) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-25-174-95-95-253.dsl.bell.ca. [174.95.95.253]) by smtp.gmail.com with ESMTPSA id d23sm8679920qka.125.2021.03.08.14.56.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Mar 2021 14:56:28 -0800 (PST) Date: Mon, 8 Mar 2021 17:56:26 -0500 From: Peter Xu To: Alex Williamson Cc: Zeng Tao , linuxarm@huawei.com, Cornelia Huck , Kevin Tian , Andrew Morton , Giovanni Cabiddu , Michel Lespinasse , Jann Horn , Max Gurtovoy , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe Subject: Re: [PATCH] vfio/pci: make the vfio_pci_mmap_fault reentrant Message-ID: <20210308225626.GN397383@xz-x1> References: <1615201890-887-1-git-send-email-prime.zeng@hisilicon.com> <20210308132106.49da42e2@omen.home.shazbot.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210308132106.49da42e2@omen.home.shazbot.org> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 08, 2021 at 01:21:06PM -0700, Alex Williamson wrote: > On Mon, 8 Mar 2021 19:11:26 +0800 > Zeng Tao wrote: > > > We have met the following error when test with DPDK testpmd: > > [ 1591.733256] kernel BUG at mm/memory.c:2177! > > [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP > > [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O) > > [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1 > > [ 1591.770735] Hardware name: , BIOS HixxxxFPGA 1P B600 V121-1 > > [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) > > [ 1591.786134] pc : remap_pfn_range+0x214/0x340 > > [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340 > > [ 1591.799117] sp : ffff80001068bbd0 > > [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000 > > [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000 > > [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358 > > [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041 > > [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000 > > [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000 > > [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540 > > [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000 > > [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880 > > [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000 > > [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0 > > [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000 > > [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001 > > [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3 > > [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868 > > [ 1591.910234] Call trace: > > [ 1591.914837] remap_pfn_range+0x214/0x340 > > [ 1591.921765] vfio_pci_mmap_fault+0xac/0x130 [vfio_pci] > > [ 1591.931200] __do_fault+0x44/0x12c > > [ 1591.937031] handle_mm_fault+0xcc8/0x1230 > > [ 1591.942475] do_page_fault+0x16c/0x484 > > [ 1591.948635] do_translation_fault+0xbc/0xd8 > > [ 1591.954171] do_mem_abort+0x4c/0xc0 > > [ 1591.960316] el0_da+0x40/0x80 > > [ 1591.965585] el0_sync_handler+0x168/0x1b0 > > [ 1591.971608] el0_sync+0x174/0x180 > > [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000) > > > > The cause is that the vfio_pci_mmap_fault function is not reentrant, if > > multiple threads access the same address which will lead to a page fault > > at the same time, we will have the above error. > > > > Fix the issue by making the vfio_pci_mmap_fault reentrant, and there is > > another issue that when the io_remap_pfn_range fails, we need to undo > > the __vfio_pci_add_vma, fix it by moving the __vfio_pci_add_vma down > > after the io_remap_pfn_range. > > > > Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking") > > Signed-off-by: Zeng Tao > > --- > > drivers/vfio/pci/vfio_pci.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > > index 65e7e6b..6928c37 100644 > > --- a/drivers/vfio/pci/vfio_pci.c > > +++ b/drivers/vfio/pci/vfio_pci.c > > @@ -1613,6 +1613,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > struct vm_area_struct *vma = vmf->vma; > > struct vfio_pci_device *vdev = vma->vm_private_data; > > vm_fault_t ret = VM_FAULT_NOPAGE; > > + unsigned long pfn; > > > > mutex_lock(&vdev->vma_lock); > > down_read(&vdev->memory_lock); > > @@ -1623,18 +1624,23 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > goto up_out; > > } > > > > - if (__vfio_pci_add_vma(vdev, vma)) { > > - ret = VM_FAULT_OOM; > > + if (!follow_pfn(vma, vma->vm_start, &pfn)) { > > mutex_unlock(&vdev->vma_lock); > > goto up_out; > > } > > > > - mutex_unlock(&vdev->vma_lock); > > > If I understand correctly, I think you're using (perhaps slightly > abusing) the vma_lock to extend the serialization of the vma_list > manipulation to include io_remap_pfn_range() such that you can test > whether the pte has already been populated using follow_pfn(). In that > case we return VM_FAULT_NOPAGE without trying to repopulate the page > and therefore avoid the BUG_ON in remap_pte_range() triggered by trying > to overwrite an existing pte, and less importantly, a duplicate vma in > our list. I wonder if use of follow_pfn() is still strongly > discouraged for this use case. > > I'm surprised that it's left to the fault handler to provide this > serialization, is this because we're filling the entire vma rather than > only the faulting page? There's definitely some kind of serialization in the process using pgtable locks, which gives me the feeling that the BUG_ON() in remap_pte_range() seems too strong on "!pte_none(*pte)" rather than -EEXIST. However there'll still be the issue of duplicated vma in vma_list - that seems to be a sign that it's still better to fix it from vfio layer. > > As we move to unmap_mapping_range()[1] we remove all of the complexity > of managing a list of vmas to zap based on whether device memory is > enabled, including the vma_lock. Are we going to need to replace that > with another lock here, or is there a better approach to handling > concurrency of this fault handler? Jason/Peter? Thanks, Not looked into the new series of unmap_mapping_range() yet.. But for the current code base: instead of follow_pte(), maybe we could simply do the ordering by searching the vma list first before inserting into the vma list? Because if vma existed, it means the pte installation has done, or at least in progress. Then we could return VM_FAULT_RETRY hoping that it'll be done soon. Then maybe it would also make some sense to have vma_lock protect the whole io_remap_pfn_range() too? - it'll not be for the ordering, but just that it'll guarantee after we're with the vma_lock it means current vma has all ptes installed, then the next memory access will guaranteed to success. It seems more efficient than multiple VM_FAULT_RETRY page fault looping until it's done. Thanks, -- Peter Xu