Received: by 2002:a05:6a10:a852:0:0:0:0 with SMTP id d18csp351526pxy; Fri, 30 Apr 2021 06:52:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxnM9Ubylq7fOTcuVaT3jkUzYopqg9EmGU5jGhgupt6v4i9btBb5ZfVEnD+VDWIBcgHkY0j X-Received: by 2002:a63:5301:: with SMTP id h1mr4646786pgb.109.1619790730716; Fri, 30 Apr 2021 06:52:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619790730; cv=none; d=google.com; s=arc-20160816; b=JnGaaJE790VdOFbs++3ep1pfMVnKp3qpEdLWticIjkAGl9gYz3n7xCMhc8ANjJBBTA Ye4w+e84+XDSvo1icKIC4qtz4OkBjoqUEjUNSX18CdO+rm4QM547zw7ts5uH5v2SNURo Mx2GBx7muDGjGz5fXo2j15bZz/S95TPEcpSFh3pwB+QdWegt6vQcsKghPK2RASZeTMHW sYxmmFvwmPBI/WLeCv1ifpeA0H2rcNV9eDwgc7C4l6CxD0piABFPmmx2/33PZi3lpGLL dO/rGXCOBrARHp0JnLO/oFZ95DousY3kIv83w2Jvd9/f3pobvVIcotIS38vQ1tZbKhl5 fLrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-language:content-transfer-encoding :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:dkim-signature; bh=O2j5F4fhc46JGejyh96okpN24V0XsQ1KpEWdcsoSTK0=; b=BYSlcwBJ8gDz/6cUn7XVnhQbeO6Lo18gMq3tFqH1aZ0crKbdMAF0fiCY+qhz7RVu2+ SYNBDpXmYMJWh20A8taE6h6WfE9GJ+7LLuBQyfe4/OteQyHNLMhOIBAQxe3o1oblziMF jJbJqqGaYUPuoGHS+ZzkzrzJbXxpNpxGVg32Y/93j7IprRv1PKR6nMgXIRvxSVEerI5s gXOHX8lfz/9yvgX8qrQP2dEPt1Ovrf7PHnN6l2uUYbH8k1UwBZpA7zML8JfyO81u0VoP b6M2X25vubUOwdeWGjt+/O6GgKs7+3bhNqZFtTQULbzEwKc1gbX/UlLeSWEKWZn6KXJd oQLA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dsDsHsks; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z7si3522284pln.293.2021.04.30.06.51.52; Fri, 30 Apr 2021 06:52:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dsDsHsks; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230196AbhD3Nu6 (ORCPT + 99 others); Fri, 30 Apr 2021 09:50:58 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:60218 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229688AbhD3Nu4 (ORCPT ); Fri, 30 Apr 2021 09:50:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1619790608; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=O2j5F4fhc46JGejyh96okpN24V0XsQ1KpEWdcsoSTK0=; b=dsDsHskspvwJjQ7c4OmFYeIyjrnEouukRx7oF6N+bRP9b7LKywOX6xuiUrlULQVSzk5b0D RToIU4lfjbYopk0rB+X5s/lBOSZXOT+n3W4eNjGOJw9pggK0RKMh5of4mc3h0JPApgjWc/ 6ArHJ5YduFzDrfa+IL2TJqcChgwpL14= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-441-V8kcEtmoMBSUQFSPzOG68g-1; Fri, 30 Apr 2021 09:50:04 -0400 X-MC-Unique: V8kcEtmoMBSUQFSPzOG68g-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A8CE7107ACCA; Fri, 30 Apr 2021 13:50:02 +0000 (UTC) Received: from wangxiaodeMacBook-Air.local (ovpn-12-38.pek2.redhat.com [10.72.12.38]) by smtp.corp.redhat.com (Postfix) with ESMTP id 778B15D9E2; Fri, 30 Apr 2021 13:49:56 +0000 (UTC) Subject: Re: [PATCH] vdpa/mlx5: Add support for doorbell bypassing To: "Zhu, Lingshan" , Eli Cohen Cc: mst@redhat.com, virtualization@lists.linux-foundation.org, linux-kernel@vger.kernel.org References: <20210421104145.115907-1-elic@nvidia.com> <20210422060358.GA140698@mtl-vdi-166.wap.labs.mlnx> <20210422080725.GB140698@mtl-vdi-166.wap.labs.mlnx> <9d3d8976-800d-bb14-0a4a-c4b008f6872c@redhat.com> <20210422083902.GA146406@mtl-vdi-166.wap.labs.mlnx> <20210429100033.GA215200@mtl-vdi-166.wap.labs.mlnx> <836263af-6791-0bd3-22c7-22197da021e9@intel.com> <79d57f53-a5c9-58df-4a79-6cc7892ab1a2@redhat.com> <35c30715-f24b-704c-af3c-2b0259c2fd43@intel.com> From: Jason Wang Message-ID: <2b5dc35a-13ee-863f-16cb-cc6a96d7f738@redhat.com> Date: Fri, 30 Apr 2021 21:49:54 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: <35c30715-f24b-704c-af3c-2b0259c2fd43@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2021/4/30 下午5:25, Zhu, Lingshan 写道: > > > On 4/30/2021 3:03 PM, Jason Wang wrote: >> >> 在 2021/4/30 下午2:31, Zhu, Lingshan 写道: >>> >>> >>> On 4/30/2021 12:40 PM, Jason Wang wrote: >>>> >>>> 在 2021/4/29 下午6:00, Eli Cohen 写道: >>>>> On Thu, Apr 22, 2021 at 04:59:11PM +0800, Jason Wang wrote: >>>>>> 在 2021/4/22 下午4:39, Eli Cohen 写道: >>>>>>> On Thu, Apr 22, 2021 at 04:21:45PM +0800, Jason Wang wrote: >>>>>>>> 在 2021/4/22 下午4:07, Eli Cohen 写道: >>>>>>>>> On Thu, Apr 22, 2021 at 09:03:58AM +0300, Eli Cohen wrote: >>>>>>>>>> On Thu, Apr 22, 2021 at 10:37:38AM +0800, Jason Wang wrote: >>>>>>>>>>> 在 2021/4/21 下午6:41, Eli Cohen 写道: >>>>>>>>>>>> Implement mlx5_get_vq_notification() to return the doorbell >>>>>>>>>>>> address. >>>>>>>>>>>> Size is set to one system page as required. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Eli Cohen >>>>>>>>>>>> --- >>>>>>>>>>>>      drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1+ >>>>>>>>>>>>      drivers/vdpa/mlx5/core/resources.c | 1+ >>>>>>>>>>>>      drivers/vdpa/mlx5/net/mlx5_vnet.c  | 6 ++++++ >>>>>>>>>>>>      3 files changed, 8 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h >>>>>>>>>>>> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h >>>>>>>>>>>> index b6cc53ba980c..49de62cda598 100644 >>>>>>>>>>>> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h >>>>>>>>>>>> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h >>>>>>>>>>>> @@ -41,6 +41,7 @@ struct mlx5_vdpa_resources { >>>>>>>>>>>>          u32 pdn; >>>>>>>>>>>>          struct mlx5_uars_page *uar; >>>>>>>>>>>>          void __iomem *kick_addr; >>>>>>>>>>>> +    u64 phys_kick_addr; >>>>>>>>>>>>          u16 uid; >>>>>>>>>>>>          u32 null_mkey; >>>>>>>>>>>>          bool valid; >>>>>>>>>>>> diff --git a/drivers/vdpa/mlx5/core/resources.c >>>>>>>>>>>> b/drivers/vdpa/mlx5/core/resources.c >>>>>>>>>>>> index 6521cbd0f5c2..665f8fc1710f 100644 >>>>>>>>>>>> --- a/drivers/vdpa/mlx5/core/resources.c >>>>>>>>>>>> +++ b/drivers/vdpa/mlx5/core/resources.c >>>>>>>>>>>> @@ -247,6 +247,7 @@ int mlx5_vdpa_alloc_resources(struct >>>>>>>>>>>> mlx5_vdpa_dev *mvdev) >>>>>>>>>>>>              goto err_key; >>>>>>>>>>>>          kick_addr = mdev->bar_addr + offset; >>>>>>>>>>>> +    res->phys_kick_addr = kick_addr; >>>>>>>>>>>>          res->kick_addr= ioremap(kick_addr, PAGE_SIZE); >>>>>>>>>>>>          if (!res->kick_addr) { >>>>>>>>>>>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c >>>>>>>>>>>> b/drivers/vdpa/mlx5/net/mlx5_vnet.c >>>>>>>>>>>> index 10c5fef3c020..680751074d2a 100644 >>>>>>>>>>>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c >>>>>>>>>>>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c >>>>>>>>>>>> @@ -1865,8 +1865,14 @@ static void mlx5_vdpa_free(struct >>>>>>>>>>>> vdpa_device *vdev) >>>>>>>>>>>>      static struct vdpa_notification_area >>>>>>>>>>>> mlx5_get_vq_notification(struct vdpa_device *vdev, u16 idx) >>>>>>>>>>>>      { >>>>>>>>>>>> +    struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); >>>>>>>>>>>>          struct vdpa_notification_area ret = {}; >>>>>>>>>>>> +    struct mlx5_vdpa_net *ndev; >>>>>>>>>>>> + >>>>>>>>>>>> +    ndev = to_mlx5_vdpa_ndev(mvdev); >>>>>>>>>>>> +    ret.addr = (phys_addr_t)ndev->mvdev.res.phys_kick_addr; >>>>>>>>>>>> +    ret.size = PAGE_SIZE; >>>>>>>>>>> Note that the page will be mapped in to guest, so it's only >>>>>>>>>>> safe if the >>>>>>>>>>> doorbeel exclusively own the page. This means if there're >>>>>>>>>>> other registers in >>>>>>>>>>> the page, we can not let the doorbell bypass to work. >>>>>>>>>>> >>>>>>>>>>> So this is suspicious at least in the case of subfunction >>>>>>>>>>> where we calculate >>>>>>>>>>> the bar length in mlx5_sf_dev_table_create() as: >>>>>>>>>>> >>>>>>>>>>> table->sf_bar_length = 1 << (MLX5_CAP_GEN(dev, >>>>>>>>>>> log_min_sf_size) + 12); >>>>>>>>>>> >>>>>>>>>>> It looks to me this can only work for the arch with >>>>>>>>>>> PAGE_SIZE = 4096, >>>>>>>>>>> otherwise we can map more into the userspace(guest). >>>>>>>>>>> >>>>>>>>>> Correct, so I guess I should return here 4096. >>>>>>>> I'm not quite sure but since the calculation of the >>>>>>>> sf_bar_length is doen >>>>>>>> via a shift of 12, it might be correct. >>>>>>>> >>>>>>>> And please double check if the doorbell own the page exclusively. >>>>>>> I am checking if it is safe to map the any part of the SF's BAR to >>>>>>> userspace without harming other functions. If this is true, I >>>>>>> will check >>>>>>> if I can return PAGE_SIZE without compromising security. >>>>>> >>>>>> It's usally not safe and a layer violation if other registers are >>>>>> placed at >>>>>> the same page. >>>>>> >>>>>> >>>>>>>    I think we may >>>>>>> need to extend struct vdpa_notification_area to contain another >>>>>>> field >>>>>>> offset which indicates the offset from addr where the actual >>>>>>> doorbell >>>>>>> resides. >>>>>> >>>>>> The movitiaton of the current design is to be fit seamless into >>>>>> how Qemu >>>>>> model doorbell layouts currently: >>>>>> >>>>>> 1) page-per-vq, each vq has its own page aligned doorbell >>>>>> 2) 2 bytes doorbell, each vq has its own 2 byte aligend doorbell >>>>>> >>>>>> Only 1) is support in vhost-vDPA (and vhost-user) since it's >>>>>> rather simple >>>>>> and secure (page aligned) to be modelled and implemented via mmap(). >>>>>> >>>>>> Exporting a complex layout is possbile but requires careful design. >>>>>> >>>>>> Actually, we had antoher option >>>>>> >>>>>> 3) shared doorbell: all virtqueue shares a single page aligned >>>>>> doorbell >>>>> I am not sure how this could solve the problem of 64KB archs. >>>>> The point is that in ConnectX devices, the virtio queue objects >>>>> doorbell >>>>> is aligned to 4K. For larger system page sizes, the doorbell may >>>>> not be >>>>> aligned to a system page. >>>>> So it seems not too complex to introduce offset within the page. >>>> >>>> >>>> Three major issues: >>>> >>>> 1) single mmap() works at page level, it means we need map 64K to >>>> guest and we can only do this safely if no other registers are >>>> placed into the same page >>>> 2) new uAPI to let the userspace know the offset >>>> 3) how to model them with the virtio-pci in Qemu, and this may >>>> introduce burdens for management (need some changes in the qemu >>>> command line) to deal with the migration compatibility >>>> >>>> So consider the complexity, we can just stick to the current code. >>>> That means mmap() will fail and qemu will keep using the eventfd >>>> based kick. >>> There is another case, mmap() works at page level, page size is at >>> least 4K. Consider if a device has a bar containing the shared >>> doorbell page at its last 4K space. In this bar layout, map a >>> arch.page_size=64K page to usersapce would lead to fatal errors. >> >> >> Why it's a fatal error? Userspace should survive from mmap() errors >> and keep using the kickfd. > I mean vhost-vdpa should not only check the alignment, also need to > check whether the doorbell size no less than a arch.page_size. The code has already did this, isn't it?         if (vma->vm_end - vma->vm_start != PAGE_SIZE)                 return -EINVAL; ...         notify = ops->get_vq_notification(vdpa, index);         if (notify.addr & (PAGE_SIZE - 1))                 return -EINVAL;         if (vma->vm_end - vma->vm_start != notify.size)                 return -ENOTSUPP; > If the doorbell placed at the last 4K in bar, map 64k page could be an > error. mmap() will fail in this case. Thanks > > Thanks >> >> >>> I think we can assign the actual size of the doorbell area size to >>> vdpa_notification.size than arch.page_size to avoid such issues. >>> Then upper layers like vhost_vdpa should check whether this size can >>> work with the machine arch and its alignment, if not, should fail >>> over to use eventfd. >> >> >> Isn't this how get_vet_notification() designed and implemented right >> now? What parent need is just to report the doorbell size, it's the >> bus driver (vhost-vDPA) to decide if and how it is used. >> >> Thanks >> >> >>> Then do we still need a uAPI tell the offset within the page? >>> >>> Thanks >>> Zhu Lingshan >>>> >>>> >>>> >>>>> >>>>> BTW, for now, I am going to send another patch that makes sure page >>>>> boundaries are not vilated. It requires some support from mlx5_core >>>>> which is currently being reviewed internally. >>>> >>>> >>>> Sure. >>>> >>>> Thanks >>>> >>>> >>>>> >>>>>> This is not yet supported by Qemu. >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>>>>>> I also think that the check in vhost_vdpa_mmap() should >>>>>>>>>> verify that the >>>>>>>>>> returned size is not smaller than PAGE_SIZE because the >>>>>>>>>> returned address >>>>>>>>> Actually I think it's ok since you verify the size equals >>>>>>>>> vma->vm_end - >>>>>>>>> vma->vm_start which must be at least PAGE_SIZE. >>>>>>>> Yes. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> >>>>>>>>>> might just be aligned to PAGE_SIZE. I think this should be >>>>>>>>>> enoght but >>>>>>>>>> maybe also use the same logic in vhost_vdpa_fault(). >>>> >>> >> >