Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3715887imm; Thu, 17 May 2018 13:28:15 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqDwBJ9l8DvvGVKQikmmUXk7l2cpX/LxX5MDlTE+HNCnbfCwK0SqfO43nrsIsWn+GB5YtBo X-Received: by 2002:a63:8049:: with SMTP id j70-v6mr5261878pgd.12.1526588895683; Thu, 17 May 2018 13:28:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526588895; cv=none; d=google.com; s=arc-20160816; b=j6OXDsym8bUYdSOWDbX98Sqba+hz5qnFmydLA02iQf9NvdUEfVQStYguhFBUaX3tDC RwcGiACS2nF88+i0M9HhvrlSVjLZqKNQHrUwZ82AfK7Wk82QuB1/icMlmEFJ157ut4vS LmeSz9SjNt15SKL7W8OxTmLr/XFielZJMceJC/Cl4xPGJENd+XN18zA0ZDI9ymgsUCuR BDSEIHbvz9lFJuSPLhKsZWBwjskG47bJw0jlC3Zuj4jy8oTcUEvMSsI0gFO+mDVrSyYI bdJU0JWjFsPLsuksel1Piz+u/NHShJ3CdkcBDBWlYW5vCtPVlSDfyvspT3ZvRniXElf2 S0Yw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:date:message-id:from :references:cc:to:subject:arc-authentication-results; bh=JQKyqJOxouuEKhCIMlMZyC6UrHmjiI5v65Y5d/N7630=; b=pSJ8JXCnjbWIfxGJOlThE5/dOVx/ZXpIk+VQryqohYCk4yYrVJV1gmmr8IED2cGGmB 0lW40hz8/PBG3bVTDpPNbIZ/CJw/YSfhn3YiLtZ5lIrjzo+PWmocP4XE+0gWUdrq/i5l 0AovX9pGciyO7yE7O/bfy+w/C3ypdWcnMdD68A2v/g/Q50dCu4rHCpoegolyTW/m/iO9 F3Q4W2mSPXCr3tTvejYVsxcoP2nyj3ZYKABkotDaRTM7qPaDNoNCILdoIRy2aIBh4RSK afVdCK9gIo3iFNTN2x2dxeAcETD5O6pMnIMxTLHlrM2JzPf2E2qvCHrGbaUo890CHk5W DDEQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l2-v6si4601932pgc.438.2018.05.17.13.28.01; Thu, 17 May 2018 13:28:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752219AbeEQU1L (ORCPT + 99 others); Thu, 17 May 2018 16:27:11 -0400 Received: from hqemgate15.nvidia.com ([216.228.121.64]:17462 "EHLO hqemgate15.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752075AbeEQU1J (ORCPT ); Thu, 17 May 2018 16:27:09 -0400 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com id ; Thu, 17 May 2018 13:27:18 -0700 Received: from HQMAIL104.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Thu, 17 May 2018 13:27:08 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Thu, 17 May 2018 13:27:08 -0700 Received: from BGMAIL102.nvidia.com (10.25.59.11) by HQMAIL104.nvidia.com (172.18.146.11) with Microsoft SMTP Server (TLS) id 15.0.1347.2; Thu, 17 May 2018 20:27:06 +0000 Received: from [10.24.70.199] (10.24.70.199) by bgmail102.nvidia.com (10.25.59.11) with Microsoft SMTP Server (TLS) id 15.0.1347.2; Thu, 17 May 2018 20:27:01 +0000 Subject: Re: [PATCH v3] vfio/mdev: Check globally for duplicate devices To: Alex Williamson CC: Cornelia Huck , , , Dong Jia Shi , "Halil Pasic" References: <20180517031746.32242.17768.stgit@gimli.home> <20180517100928.5949b11e.cohuck@redhat.com> <50191254-4437-8ff3-06db-8d6b2df20b65@nvidia.com> <20180517102000.50656267@w520.home> X-Nvconfidentiality: public From: Kirti Wankhede Message-ID: <3ab0fe44-9290-8c29-5e4c-2599ba0de373@nvidia.com> Date: Fri, 18 May 2018 01:56:50 +0530 MIME-Version: 1.0 In-Reply-To: <20180517102000.50656267@w520.home> X-Originating-IP: [10.24.70.199] X-ClientProxiedBy: DRBGMAIL104.nvidia.com (10.18.16.23) To bgmail102.nvidia.com (10.25.59.11) Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/17/2018 9:50 PM, Alex Williamson wrote: > On Thu, 17 May 2018 21:25:22 +0530 > Kirti Wankhede wrote: > >> On 5/17/2018 1:39 PM, Cornelia Huck wrote: >>> On Wed, 16 May 2018 21:30:19 -0600 >>> Alex Williamson wrote: >>> >>>> When we create an mdev device, we check for duplicates against the >>>> parent device and return -EEXIST if found, but the mdev device >>>> namespace is global since we'll link all devices from the bus. We do >>>> catch this later in sysfs_do_create_link_sd() to return -EEXIST, but >>>> with it comes a kernel warning and stack trace for trying to create >>>> duplicate sysfs links, which makes it an undesirable response. >>>> >>>> Therefore we should really be looking for duplicates across all mdev >>>> parent devices, or as implemented here, against our mdev device list. >>>> Using mdev_list to prevent duplicates means that we can remove >>>> mdev_parent.lock, but in order not to serialize mdev device creation >>>> and removal globally, we add mdev_device.active which allows UUIDs to >>>> be reserved such that we can drop the mdev_list_lock before the mdev >>>> device is fully in place. >>>> >>>> NB. there was never intended to be any serialization guarantee >>>> provided by the mdev core with respect to creation and removal of mdev >>>> devices, mdev_parent.lock provided this only as a side-effect of the >>>> implementation for locking the namespace per parent. That >>>> serialization is now removed. >>> >> >> mdev_parent.lock is to serialize create and remove of that mdev device, >> that handles race condition that Cornelia mentioned below. > > Previously it was stated: > > On Thu, 17 May 2018 01:01:40 +0530 > Kirti Wankhede wrote: >> Here lock is not for create/remove routines of vendor driver, its about >> mdev device creation and device registration, which is a common code >> path, and so is part of mdev core module. > > So the race condition was handled previously, but as a side-effect of > protecting the namespace, aiui. I'm trying to state above that the > serialization of create/remove was never intended as a guarantee > provided to mdev vendor drivers. I don't see that there's a need to > protect "mdev device creation and device registration" beyond conflicts > in the UUID namespace, which is done here. Are there others? > Sorry not being elaborative in my earlier response to > If we can > show that vendor drivers handle the create/remove paths themselves, > perhaps we can refine the locking granularity. mdev_device_create() function does : - create mdev device - register device - call vendor driver->create - create sysfs files. mdev_device_remove() removes sysfs files, unregister device and delete device. There is common code in mdev_device_create() and mdev_device_remove() independent of what vendor driver does in its create()/remove() callback. Moving this code to each vendor driver to handle create/remove themselves doesn't make sense to me. mdev_parent.lock here does take care of race conditions that could occur during mdev device creation and deletion in this common code path. What is the urge to remove mdev_parent.lock if that handles all race conditions without bothering user to handle -EAGAIN? Thanks, Kirti >>> This is probably fine; but I noted that documentation on the locking >>> conventions and serialization guarantees for mdev is a bit sparse, and >>> this topic also came up during the vfio-ap review. >>> >>> We probably want to add some more concrete documentation; would the >>> kernel doc for the _ops or vfio-mediated-device.txt be a better place >>> for that? > > I'll look to see where we can add a note withing that file, I suspect > that's the right place to put it. > >>> [Dong Jia, Halil: Can you please take a look whether vfio-ccw is really >>> ok? I don't think we open up any new races, but I'd appreciate a second >>> or third opinion.] >>> >>>> >>>> Signed-off-by: Alex Williamson >>>> --- >>>> >>>> v3: Rework locking and add a field to mdev_device so we can track >>>> completed instances vs those added to reserve the namespace. >>>> >>>> drivers/vfio/mdev/mdev_core.c | 94 +++++++++++++------------------------- >>>> drivers/vfio/mdev/mdev_private.h | 2 - >>>> 2 files changed, 34 insertions(+), 62 deletions(-) >>>> >>>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c >>>> index 126991046eb7..55ea9d34ec69 100644 >>>> --- a/drivers/vfio/mdev/mdev_core.c >>>> +++ b/drivers/vfio/mdev/mdev_core.c >>>> @@ -66,34 +66,6 @@ uuid_le mdev_uuid(struct mdev_device *mdev) >>>> } >>>> EXPORT_SYMBOL(mdev_uuid); >>>> >>>> -static int _find_mdev_device(struct device *dev, void *data) >>>> -{ >>>> - struct mdev_device *mdev; >>>> - >>>> - if (!dev_is_mdev(dev)) >>>> - return 0; >>>> - >>>> - mdev = to_mdev_device(dev); >>>> - >>>> - if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0) >>>> - return 1; >>>> - >>>> - return 0; >>>> -} >>>> - >>>> -static bool mdev_device_exist(struct mdev_parent *parent, uuid_le uuid) >>>> -{ >>>> - struct device *dev; >>>> - >>>> - dev = device_find_child(parent->dev, &uuid, _find_mdev_device); >>>> - if (dev) { >>>> - put_device(dev); >>>> - return true; >>>> - } >>>> - >>>> - return false; >>>> -} >>>> - >>>> /* Should be called holding parent_list_lock */ >>>> static struct mdev_parent *__find_parent_device(struct device *dev) >>>> { >>>> @@ -221,7 +193,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops) >>>> } >>>> >>>> kref_init(&parent->ref); >>>> - mutex_init(&parent->lock); >>>> >>>> parent->dev = dev; >>>> parent->ops = ops; >>>> @@ -304,7 +275,7 @@ static void mdev_device_release(struct device *dev) >>>> int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid) >>>> { >>>> int ret; >>>> - struct mdev_device *mdev; >>>> + struct mdev_device *mdev, *tmp; >>>> struct mdev_parent *parent; >>>> struct mdev_type *type = to_mdev_type(kobj); >>>> >>>> @@ -312,21 +283,26 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid) >>>> if (!parent) >>>> return -EINVAL; >>>> >>>> - mutex_lock(&parent->lock); >>>> + mutex_lock(&mdev_list_lock); >>>> >>>> /* Check for duplicate */ >>>> - if (mdev_device_exist(parent, uuid)) { >>>> - ret = -EEXIST; >>>> - goto create_err; >>>> + list_for_each_entry(tmp, &mdev_list, next) { >>>> + if (!uuid_le_cmp(tmp->uuid, uuid)) { >>>> + mutex_unlock(&mdev_list_lock); >>>> + return -EEXIST; >>>> + } >>>> } >>>> >> >> mdev_put_parent(parent) missing before return. >> >> >>>> mdev = kzalloc(sizeof(*mdev), GFP_KERNEL); >>>> if (!mdev) { >>>> - ret = -ENOMEM; >>>> - goto create_err; >>>> + mutex_unlock(&mdev_list_lock); >>>> + return -ENOMEM; >>>> } >>>> >> >> mdev_put_parent(parent) missing here again. > > > Oops, will fix both. > > >>>> memcpy(&mdev->uuid, &uuid, sizeof(uuid_le)); >>>> + list_add(&mdev->next, &mdev_list); >>>> + mutex_unlock(&mdev_list_lock); >>>> + >>>> mdev->parent = parent; >>>> kref_init(&mdev->ref); >>>> >>>> @@ -352,21 +328,18 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid) >>>> } >>>> >>>> mdev->type_kobj = kobj; >>>> + mdev->active = true; >>>> dev_dbg(&mdev->dev, "MDEV: created\n"); >>>> >>>> - mutex_unlock(&parent->lock); >>>> - >>>> - mutex_lock(&mdev_list_lock); >>>> - list_add(&mdev->next, &mdev_list); >>>> - mutex_unlock(&mdev_list_lock); >>>> - >>>> - return ret; >>>> + return 0; >>>> >>>> create_failed: >>>> device_unregister(&mdev->dev); >>>> >>>> create_err: >>>> - mutex_unlock(&parent->lock); >>>> + mutex_lock(&mdev_list_lock); >>>> + list_del(&mdev->next); >>>> + mutex_unlock(&mdev_list_lock); >>>> mdev_put_parent(parent); >>>> return ret; >>>> } >>>> @@ -377,44 +350,43 @@ int mdev_device_remove(struct device *dev, bool force_remove) >>>> struct mdev_parent *parent; >>>> struct mdev_type *type; >>>> int ret; >>>> - bool found = false; >>>> >>>> mdev = to_mdev_device(dev); >>>> >>>> mutex_lock(&mdev_list_lock); >>>> list_for_each_entry(tmp, &mdev_list, next) { >>>> - if (tmp == mdev) { >>>> - found = true; >>>> + if (tmp == mdev) >>>> break; >>>> - } >>>> } >>>> >>>> - if (found) >>>> - list_del(&mdev->next); >>>> + if (tmp != mdev) { >>>> + mutex_unlock(&mdev_list_lock); >>>> + return -ENODEV; >>>> + } >>>> >>>> - mutex_unlock(&mdev_list_lock); >>>> + if (!mdev->active) { >>>> + mutex_unlock(&mdev_list_lock); >>>> + return -EAGAIN; >>>> + } >>> >>> I'm not sure whether this is 100% watertight. Consider: >>> >>> - device gets registered, we have added it to the list, made it visible >>> in sysfs and have added the remove attribute, but not yet the symlinks >>> - userspace can access the remove attribute and trigger removal >>> - we do an early exit here because not yet active >>> - ??? >>> >>> (If there's any problem, it's a very pathological case, and I don't >>> think anything really bad can happen. I just want to make sure we don't >>> miss anything.) > > The presented scenario is exactly the use case that the -EAGAIN return > is intended to handle. I can't put a mutex on the mdev_device to block > this path as the mdev creation might ultimately fail and the device > released (perhaps not possible in our code flow, but that would be a > very subtle detail to rely on). So any sort of blocking approach would > require releasing mdev_list_lock and re-scanning in a busy loop. Why > bother to do that when we can indicate the same to the user through > -EAGAIN. AIUI, this is the purpose of -EAGAIN and it's up to userspace > to decide how they'd like to handle it, try again or abort. Are there > suggestions for alternatives? Thanks, >