v4: Fix the 'create' racing 'remove' gap noted by Kirti by moving
removal from mdev_list to mdev_device_release(). Fix missing
mdev_put_parent() cases in mdev_device_create(), also noted
by Kirti. Added documention update regarding serialization as
noted by Cornelia. Added additional commit log comment about
-EAGAIN vs -ENODEV for 'remove' racing 'create'. Added second
patch to re-order sysfs attributes, with this my targeted
scripts can no longer hit the gap where -EAGAIN is regurned.
BTW, the gap where the current code returns -ENODEV in this
race condition is about 50% easier to hit than it exists in
this series with patch 1 alone.
Thanks,
Alex
---
Alex Williamson (2):
vfio/mdev: Check globally for duplicate devices
vfio/mdev: Re-order sysfs attribute creation
Documentation/vfio-mediated-device.txt | 5 ++
drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
drivers/vfio/mdev/mdev_private.h | 2 -
drivers/vfio/mdev/mdev_sysfs.c | 14 ++--
4 files changed, 49 insertions(+), 74 deletions(-)
When we create an mdev device, we check for duplicates against the
parent device and return -EEXIST if found, but the mdev device
namespace is global since we'll link all devices from the bus. We do
catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
with it comes a kernel warning and stack trace for trying to create
duplicate sysfs links, which makes it an undesirable response.
Therefore we should really be looking for duplicates across all mdev
parent devices, or as implemented here, against our mdev device list.
Using mdev_list to prevent duplicates means that we can remove
mdev_parent.lock, but in order not to serialize mdev device creation
and removal globally, we add mdev_device.active which allows UUIDs to
be reserved such that we can drop the mdev_list_lock before the mdev
device is fully in place.
Two behavioral notes; first, mdev_parent.lock had the side-effect of
serializing mdev create and remove ops per parent device. This was
an implementation detail, not an intentional guarantee provided to
the mdev vendor drivers. Vendor drivers can trivially provide this
serialization internally if necessary. Second, review comments note
the new -EAGAIN behavior when the device, and in particular the remove
attribute, becomes visible in sysfs. If a remove is triggered prior
to completion of mdev_device_create() the user will see a -EAGAIN
error. While the errno is different, receiving an error during this
period is not, the previous implementation returned -ENODEV for the
same condition. Furthermore, the consistency to the user is improved
in the case where mdev_device_remove_ops() returns error. Previously
concurrent calls to mdev_device_remove() could see the device
disappear with -ENODEV and return in the case of error. Now a user
would see -EAGAIN while the device is in this transitory state.
Signed-off-by: Alex Williamson <[email protected]>
---
Documentation/vfio-mediated-device.txt | 5 ++
drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
drivers/vfio/mdev/mdev_private.h | 2 -
3 files changed, 42 insertions(+), 67 deletions(-)
diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
index 1b3950346532..c3f69bcaf96e 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -145,6 +145,11 @@ The functions in the mdev_parent_ops structure are as follows:
* create: allocate basic resources in a driver for a mediated device
* remove: free resources in a driver when a mediated device is destroyed
+(Note that mdev-core provides no implicit serialization of create/remove
+callbacks per mdev parent device, per mdev type, or any other categorization.
+Vendor drivers are expected to be fully asynchronous in this respect or
+provide their own internal resource protection.)
+
The callbacks in the mdev_parent_ops structure are as follows:
* open: open callback of mediated device
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 126991046eb7..0212f0ee8aea 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -66,34 +66,6 @@ uuid_le mdev_uuid(struct mdev_device *mdev)
}
EXPORT_SYMBOL(mdev_uuid);
-static int _find_mdev_device(struct device *dev, void *data)
-{
- struct mdev_device *mdev;
-
- if (!dev_is_mdev(dev))
- return 0;
-
- mdev = to_mdev_device(dev);
-
- if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
- return 1;
-
- return 0;
-}
-
-static bool mdev_device_exist(struct mdev_parent *parent, uuid_le uuid)
-{
- struct device *dev;
-
- dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
- if (dev) {
- put_device(dev);
- return true;
- }
-
- return false;
-}
-
/* Should be called holding parent_list_lock */
static struct mdev_parent *__find_parent_device(struct device *dev)
{
@@ -221,7 +193,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
}
kref_init(&parent->ref);
- mutex_init(&parent->lock);
parent->dev = dev;
parent->ops = ops;
@@ -297,6 +268,10 @@ static void mdev_device_release(struct device *dev)
{
struct mdev_device *mdev = to_mdev_device(dev);
+ mutex_lock(&mdev_list_lock);
+ list_del(&mdev->next);
+ mutex_unlock(&mdev_list_lock);
+
dev_dbg(&mdev->dev, "MDEV: destroying\n");
kfree(mdev);
}
@@ -304,7 +279,7 @@ static void mdev_device_release(struct device *dev)
int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
{
int ret;
- struct mdev_device *mdev;
+ struct mdev_device *mdev, *tmp;
struct mdev_parent *parent;
struct mdev_type *type = to_mdev_type(kobj);
@@ -312,21 +287,28 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
if (!parent)
return -EINVAL;
- mutex_lock(&parent->lock);
+ mutex_lock(&mdev_list_lock);
/* Check for duplicate */
- if (mdev_device_exist(parent, uuid)) {
- ret = -EEXIST;
- goto create_err;
+ list_for_each_entry(tmp, &mdev_list, next) {
+ if (!uuid_le_cmp(tmp->uuid, uuid)) {
+ mutex_unlock(&mdev_list_lock);
+ ret = -EEXIST;
+ goto mdev_fail;
+ }
}
mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
if (!mdev) {
+ mutex_unlock(&mdev_list_lock);
ret = -ENOMEM;
- goto create_err;
+ goto mdev_fail;
}
memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+ list_add(&mdev->next, &mdev_list);
+ mutex_unlock(&mdev_list_lock);
+
mdev->parent = parent;
kref_init(&mdev->ref);
@@ -338,35 +320,28 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
ret = device_register(&mdev->dev);
if (ret) {
put_device(&mdev->dev);
- goto create_err;
+ goto mdev_fail;
}
ret = mdev_device_create_ops(kobj, mdev);
if (ret)
- goto create_failed;
+ goto create_fail;
ret = mdev_create_sysfs_files(&mdev->dev, type);
if (ret) {
mdev_device_remove_ops(mdev, true);
- goto create_failed;
+ goto create_fail;
}
mdev->type_kobj = kobj;
+ mdev->active = true;
dev_dbg(&mdev->dev, "MDEV: created\n");
- mutex_unlock(&parent->lock);
-
- mutex_lock(&mdev_list_lock);
- list_add(&mdev->next, &mdev_list);
- mutex_unlock(&mdev_list_lock);
-
- return ret;
+ return 0;
-create_failed:
+create_fail:
device_unregister(&mdev->dev);
-
-create_err:
- mutex_unlock(&parent->lock);
+mdev_fail:
mdev_put_parent(parent);
return ret;
}
@@ -377,44 +352,39 @@ int mdev_device_remove(struct device *dev, bool force_remove)
struct mdev_parent *parent;
struct mdev_type *type;
int ret;
- bool found = false;
mdev = to_mdev_device(dev);
mutex_lock(&mdev_list_lock);
list_for_each_entry(tmp, &mdev_list, next) {
- if (tmp == mdev) {
- found = true;
+ if (tmp == mdev)
break;
- }
}
- if (found)
- list_del(&mdev->next);
+ if (tmp != mdev) {
+ mutex_unlock(&mdev_list_lock);
+ return -ENODEV;
+ }
- mutex_unlock(&mdev_list_lock);
+ if (!mdev->active) {
+ mutex_unlock(&mdev_list_lock);
+ return -EAGAIN;
+ }
- if (!found)
- return -ENODEV;
+ mdev->active = false;
+ mutex_unlock(&mdev_list_lock);
type = to_mdev_type(mdev->type_kobj);
parent = mdev->parent;
- mutex_lock(&parent->lock);
ret = mdev_device_remove_ops(mdev, force_remove);
if (ret) {
- mutex_unlock(&parent->lock);
-
- mutex_lock(&mdev_list_lock);
- list_add(&mdev->next, &mdev_list);
- mutex_unlock(&mdev_list_lock);
-
+ mdev->active = true;
return ret;
}
mdev_remove_sysfs_files(dev, type);
device_unregister(dev);
- mutex_unlock(&parent->lock);
mdev_put_parent(parent);
return 0;
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index a9cefd70a705..b5819b7d7ef7 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -20,7 +20,6 @@ struct mdev_parent {
struct device *dev;
const struct mdev_parent_ops *ops;
struct kref ref;
- struct mutex lock;
struct list_head next;
struct kset *mdev_types_kset;
struct list_head type_list;
@@ -34,6 +33,7 @@ struct mdev_device {
struct kref ref;
struct list_head next;
struct kobject *type_kobj;
+ bool active;
};
#define to_mdev_device(dev) container_of(dev, struct mdev_device, dev)
There exists a gap at the end of mdev_device_create() where the device
is visible to userspace, but we're not yet ready to handle removal, as
triggered through the 'remove' attribute. We handle this properly in
mdev_device_remove() with an -EAGAIN return, but we can marginally
reduce this gap by adding this attribute as a final step of our sysfs
setup.
Signed-off-by: Alex Williamson <[email protected]>
---
drivers/vfio/mdev/mdev_sysfs.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index 802df210929b..249472f05509 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -257,24 +257,24 @@ int mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
{
int ret;
- ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
- if (ret)
- return ret;
-
ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
if (ret)
- goto device_link_failed;
+ return ret;
ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
if (ret)
goto type_link_failed;
+ ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
+ if (ret)
+ goto create_files_failed;
+
return ret;
+create_files_failed:
+ sysfs_remove_link(&dev->kobj, "mdev_type");
type_link_failed:
sysfs_remove_link(type->devices_kobj, dev_name(dev));
-device_link_failed:
- sysfs_remove_files(&dev->kobj, mdev_device_attrs);
return ret;
}
On 5/19/2018 12:40 AM, Alex Williamson wrote:
> v4: Fix the 'create' racing 'remove' gap noted by Kirti by moving
> removal from mdev_list to mdev_device_release(). Fix missing
> mdev_put_parent() cases in mdev_device_create(), also noted
> by Kirti. Added documention update regarding serialization as
> noted by Cornelia. Added additional commit log comment about
> -EAGAIN vs -ENODEV for 'remove' racing 'create'. Added second
> patch to re-order sysfs attributes, with this my targeted
> scripts can no longer hit the gap where -EAGAIN is regurned.
> BTW, the gap where the current code returns -ENODEV in this
> race condition is about 50% easier to hit than it exists in
> this series with patch 1 alone.
>
Thanks for fixing this. This patch set looks good to me.
Thanks,
Kirti
> Thanks,
> Alex
>
> ---
>
> Alex Williamson (2):
> vfio/mdev: Check globally for duplicate devices
> vfio/mdev: Re-order sysfs attribute creation
>
>
> Documentation/vfio-mediated-device.txt | 5 ++
> drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> drivers/vfio/mdev/mdev_private.h | 2 -
> drivers/vfio/mdev/mdev_sysfs.c | 14 ++--
> 4 files changed, 49 insertions(+), 74 deletions(-)
>
On 5/19/2018 12:40 AM, Alex Williamson wrote:
> When we create an mdev device, we check for duplicates against the
> parent device and return -EEXIST if found, but the mdev device
> namespace is global since we'll link all devices from the bus. We do
> catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
> with it comes a kernel warning and stack trace for trying to create
> duplicate sysfs links, which makes it an undesirable response.
>
> Therefore we should really be looking for duplicates across all mdev
> parent devices, or as implemented here, against our mdev device list.
> Using mdev_list to prevent duplicates means that we can remove
> mdev_parent.lock, but in order not to serialize mdev device creation
> and removal globally, we add mdev_device.active which allows UUIDs to
> be reserved such that we can drop the mdev_list_lock before the mdev
> device is fully in place.
>
> Two behavioral notes; first, mdev_parent.lock had the side-effect of
> serializing mdev create and remove ops per parent device. This was
> an implementation detail, not an intentional guarantee provided to
> the mdev vendor drivers. Vendor drivers can trivially provide this
> serialization internally if necessary. Second, review comments note
> the new -EAGAIN behavior when the device, and in particular the remove
> attribute, becomes visible in sysfs. If a remove is triggered prior
> to completion of mdev_device_create() the user will see a -EAGAIN
> error. While the errno is different, receiving an error during this
> period is not, the previous implementation returned -ENODEV for the
> same condition. Furthermore, the consistency to the user is improved
> in the case where mdev_device_remove_ops() returns error. Previously
> concurrent calls to mdev_device_remove() could see the device
> disappear with -ENODEV and return in the case of error. Now a user
> would see -EAGAIN while the device is in this transitory state.
>
> Signed-off-by: Alex Williamson <[email protected]>
Looks good to me.
Reviewed by: Kirti Wankhede <[email protected]>
> ---
> Documentation/vfio-mediated-device.txt | 5 ++
> drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> drivers/vfio/mdev/mdev_private.h | 2 -
> 3 files changed, 42 insertions(+), 67 deletions(-)
>
> diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
> index 1b3950346532..c3f69bcaf96e 100644
> --- a/Documentation/vfio-mediated-device.txt
> +++ b/Documentation/vfio-mediated-device.txt
> @@ -145,6 +145,11 @@ The functions in the mdev_parent_ops structure are as follows:
> * create: allocate basic resources in a driver for a mediated device
> * remove: free resources in a driver when a mediated device is destroyed
>
> +(Note that mdev-core provides no implicit serialization of create/remove
> +callbacks per mdev parent device, per mdev type, or any other categorization.
> +Vendor drivers are expected to be fully asynchronous in this respect or
> +provide their own internal resource protection.)
> +
> The callbacks in the mdev_parent_ops structure are as follows:
>
> * open: open callback of mediated device
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 126991046eb7..0212f0ee8aea 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -66,34 +66,6 @@ uuid_le mdev_uuid(struct mdev_device *mdev)
> }
> EXPORT_SYMBOL(mdev_uuid);
>
> -static int _find_mdev_device(struct device *dev, void *data)
> -{
> - struct mdev_device *mdev;
> -
> - if (!dev_is_mdev(dev))
> - return 0;
> -
> - mdev = to_mdev_device(dev);
> -
> - if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
> - return 1;
> -
> - return 0;
> -}
> -
> -static bool mdev_device_exist(struct mdev_parent *parent, uuid_le uuid)
> -{
> - struct device *dev;
> -
> - dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
> - if (dev) {
> - put_device(dev);
> - return true;
> - }
> -
> - return false;
> -}
> -
> /* Should be called holding parent_list_lock */
> static struct mdev_parent *__find_parent_device(struct device *dev)
> {
> @@ -221,7 +193,6 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
> }
>
> kref_init(&parent->ref);
> - mutex_init(&parent->lock);
>
> parent->dev = dev;
> parent->ops = ops;
> @@ -297,6 +268,10 @@ static void mdev_device_release(struct device *dev)
> {
> struct mdev_device *mdev = to_mdev_device(dev);
>
> + mutex_lock(&mdev_list_lock);
> + list_del(&mdev->next);
> + mutex_unlock(&mdev_list_lock);
> +
> dev_dbg(&mdev->dev, "MDEV: destroying\n");
> kfree(mdev);
> }
> @@ -304,7 +279,7 @@ static void mdev_device_release(struct device *dev)
> int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> {
> int ret;
> - struct mdev_device *mdev;
> + struct mdev_device *mdev, *tmp;
> struct mdev_parent *parent;
> struct mdev_type *type = to_mdev_type(kobj);
>
> @@ -312,21 +287,28 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> if (!parent)
> return -EINVAL;
>
> - mutex_lock(&parent->lock);
> + mutex_lock(&mdev_list_lock);
>
> /* Check for duplicate */
> - if (mdev_device_exist(parent, uuid)) {
> - ret = -EEXIST;
> - goto create_err;
> + list_for_each_entry(tmp, &mdev_list, next) {
> + if (!uuid_le_cmp(tmp->uuid, uuid)) {
> + mutex_unlock(&mdev_list_lock);
> + ret = -EEXIST;
> + goto mdev_fail;
> + }
> }
>
> mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> if (!mdev) {
> + mutex_unlock(&mdev_list_lock);
> ret = -ENOMEM;
> - goto create_err;
> + goto mdev_fail;
> }
>
> memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> + list_add(&mdev->next, &mdev_list);
> + mutex_unlock(&mdev_list_lock);
> +
> mdev->parent = parent;
> kref_init(&mdev->ref);
>
> @@ -338,35 +320,28 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> ret = device_register(&mdev->dev);
> if (ret) {
> put_device(&mdev->dev);
> - goto create_err;
> + goto mdev_fail;
> }
>
> ret = mdev_device_create_ops(kobj, mdev);
> if (ret)
> - goto create_failed;
> + goto create_fail;
>
> ret = mdev_create_sysfs_files(&mdev->dev, type);
> if (ret) {
> mdev_device_remove_ops(mdev, true);
> - goto create_failed;
> + goto create_fail;
> }
>
> mdev->type_kobj = kobj;
> + mdev->active = true;
> dev_dbg(&mdev->dev, "MDEV: created\n");
>
> - mutex_unlock(&parent->lock);
> -
> - mutex_lock(&mdev_list_lock);
> - list_add(&mdev->next, &mdev_list);
> - mutex_unlock(&mdev_list_lock);
> -
> - return ret;
> + return 0;
>
> -create_failed:
> +create_fail:
> device_unregister(&mdev->dev);
> -
> -create_err:
> - mutex_unlock(&parent->lock);
> +mdev_fail:
> mdev_put_parent(parent);
> return ret;
> }
> @@ -377,44 +352,39 @@ int mdev_device_remove(struct device *dev, bool force_remove)
> struct mdev_parent *parent;
> struct mdev_type *type;
> int ret;
> - bool found = false;
>
> mdev = to_mdev_device(dev);
>
> mutex_lock(&mdev_list_lock);
> list_for_each_entry(tmp, &mdev_list, next) {
> - if (tmp == mdev) {
> - found = true;
> + if (tmp == mdev)
> break;
> - }
> }
>
> - if (found)
> - list_del(&mdev->next);
> + if (tmp != mdev) {
> + mutex_unlock(&mdev_list_lock);
> + return -ENODEV;
> + }
>
> - mutex_unlock(&mdev_list_lock);
> + if (!mdev->active) {
> + mutex_unlock(&mdev_list_lock);
> + return -EAGAIN;
> + }
>
> - if (!found)
> - return -ENODEV;
> + mdev->active = false;
> + mutex_unlock(&mdev_list_lock);
>
> type = to_mdev_type(mdev->type_kobj);
> parent = mdev->parent;
> - mutex_lock(&parent->lock);
>
> ret = mdev_device_remove_ops(mdev, force_remove);
> if (ret) {
> - mutex_unlock(&parent->lock);
> -
> - mutex_lock(&mdev_list_lock);
> - list_add(&mdev->next, &mdev_list);
> - mutex_unlock(&mdev_list_lock);
> -
> + mdev->active = true;
> return ret;
> }
>
> mdev_remove_sysfs_files(dev, type);
> device_unregister(dev);
> - mutex_unlock(&parent->lock);
> mdev_put_parent(parent);
>
> return 0;
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index a9cefd70a705..b5819b7d7ef7 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -20,7 +20,6 @@ struct mdev_parent {
> struct device *dev;
> const struct mdev_parent_ops *ops;
> struct kref ref;
> - struct mutex lock;
> struct list_head next;
> struct kset *mdev_types_kset;
> struct list_head type_list;
> @@ -34,6 +33,7 @@ struct mdev_device {
> struct kref ref;
> struct list_head next;
> struct kobject *type_kobj;
> + bool active;
> };
>
> #define to_mdev_device(dev) container_of(dev, struct mdev_device, dev)
>
On 5/19/2018 12:40 AM, Alex Williamson wrote:
> There exists a gap at the end of mdev_device_create() where the device
> is visible to userspace, but we're not yet ready to handle removal, as
> triggered through the 'remove' attribute. We handle this properly in
> mdev_device_remove() with an -EAGAIN return, but we can marginally
> reduce this gap by adding this attribute as a final step of our sysfs
> setup.
>
> Signed-off-by: Alex Williamson <[email protected]>
Looks good.
Reviewed by: Kirti Wankhede <[email protected]>
> ---
> drivers/vfio/mdev/mdev_sysfs.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index 802df210929b..249472f05509 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -257,24 +257,24 @@ int mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
> {
> int ret;
>
> - ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
> - if (ret)
> - return ret;
> -
> ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
> if (ret)
> - goto device_link_failed;
> + return ret;
>
> ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
> if (ret)
> goto type_link_failed;
>
> + ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
> + if (ret)
> + goto create_files_failed;
> +
> return ret;
>
> +create_files_failed:
> + sysfs_remove_link(&dev->kobj, "mdev_type");
> type_link_failed:
> sysfs_remove_link(type->devices_kobj, dev_name(dev));
> -device_link_failed:
> - sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> return ret;
> }
>
>
On Fri, 18 May 2018 13:10:25 -0600
Alex Williamson <[email protected]> wrote:
> When we create an mdev device, we check for duplicates against the
> parent device and return -EEXIST if found, but the mdev device
> namespace is global since we'll link all devices from the bus. We do
> catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
> with it comes a kernel warning and stack trace for trying to create
> duplicate sysfs links, which makes it an undesirable response.
>
> Therefore we should really be looking for duplicates across all mdev
> parent devices, or as implemented here, against our mdev device list.
> Using mdev_list to prevent duplicates means that we can remove
> mdev_parent.lock, but in order not to serialize mdev device creation
> and removal globally, we add mdev_device.active which allows UUIDs to
> be reserved such that we can drop the mdev_list_lock before the mdev
> device is fully in place.
>
> Two behavioral notes; first, mdev_parent.lock had the side-effect of
> serializing mdev create and remove ops per parent device. This was
> an implementation detail, not an intentional guarantee provided to
> the mdev vendor drivers. Vendor drivers can trivially provide this
> serialization internally if necessary. Second, review comments note
> the new -EAGAIN behavior when the device, and in particular the remove
> attribute, becomes visible in sysfs. If a remove is triggered prior
> to completion of mdev_device_create() the user will see a -EAGAIN
> error. While the errno is different, receiving an error during this
> period is not, the previous implementation returned -ENODEV for the
> same condition. Furthermore, the consistency to the user is improved
> in the case where mdev_device_remove_ops() returns error. Previously
> concurrent calls to mdev_device_remove() could see the device
> disappear with -ENODEV and return in the case of error. Now a user
> would see -EAGAIN while the device is in this transitory state.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> Documentation/vfio-mediated-device.txt | 5 ++
> drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> drivers/vfio/mdev/mdev_private.h | 2 -
> 3 files changed, 42 insertions(+), 67 deletions(-)
Reviewed-by: Cornelia Huck <[email protected]>
I think it is better to deal with any possible vendor driver
implications on top of this (I still believe that vfio-ccw is fine).
On Fri, 18 May 2018 13:10:35 -0600
Alex Williamson <[email protected]> wrote:
> There exists a gap at the end of mdev_device_create() where the device
> is visible to userspace, but we're not yet ready to handle removal, as
> triggered through the 'remove' attribute. We handle this properly in
> mdev_device_remove() with an -EAGAIN return, but we can marginally
> reduce this gap by adding this attribute as a final step of our sysfs
> setup.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
> drivers/vfio/mdev/mdev_sysfs.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
Reviewed-by: Cornelia Huck <[email protected]>
[Cc +GVT-g maintainers/lists]
On Tue, 22 May 2018 10:13:46 +0200
Cornelia Huck <[email protected]> wrote:
> On Fri, 18 May 2018 13:10:25 -0600
> Alex Williamson <[email protected]> wrote:
>
> > When we create an mdev device, we check for duplicates against the
> > parent device and return -EEXIST if found, but the mdev device
> > namespace is global since we'll link all devices from the bus. We do
> > catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
> > with it comes a kernel warning and stack trace for trying to create
> > duplicate sysfs links, which makes it an undesirable response.
> >
> > Therefore we should really be looking for duplicates across all mdev
> > parent devices, or as implemented here, against our mdev device list.
> > Using mdev_list to prevent duplicates means that we can remove
> > mdev_parent.lock, but in order not to serialize mdev device creation
> > and removal globally, we add mdev_device.active which allows UUIDs to
> > be reserved such that we can drop the mdev_list_lock before the mdev
> > device is fully in place.
> >
> > Two behavioral notes; first, mdev_parent.lock had the side-effect of
> > serializing mdev create and remove ops per parent device. This was
> > an implementation detail, not an intentional guarantee provided to
> > the mdev vendor drivers. Vendor drivers can trivially provide this
> > serialization internally if necessary. Second, review comments note
> > the new -EAGAIN behavior when the device, and in particular the remove
> > attribute, becomes visible in sysfs. If a remove is triggered prior
> > to completion of mdev_device_create() the user will see a -EAGAIN
> > error. While the errno is different, receiving an error during this
> > period is not, the previous implementation returned -ENODEV for the
> > same condition. Furthermore, the consistency to the user is improved
> > in the case where mdev_device_remove_ops() returns error. Previously
> > concurrent calls to mdev_device_remove() could see the device
> > disappear with -ENODEV and return in the case of error. Now a user
> > would see -EAGAIN while the device is in this transitory state.
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> > Documentation/vfio-mediated-device.txt | 5 ++
> > drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> > drivers/vfio/mdev/mdev_private.h | 2 -
> > 3 files changed, 42 insertions(+), 67 deletions(-)
>
> Reviewed-by: Cornelia Huck <[email protected]>
>
> I think it is better to deal with any possible vendor driver
> implications on top of this (I still believe that vfio-ccw is fine).
Thanks Cornelia. So if vfio-ccw is fine, presumably NVIDIA is fine,
then this leaves GVT-g to see if there's any fallout. Zhenyu & Zhi,
I've linked the series under discussion here below[1]. The question to
you is the first of the two behavioral notes listed above, does GVT-g
have any dependency on the mdev core providing serialization per mdev
parent device for the create and remove callbacks within the
mdev_parent_ops? This was never an intended feature of the
implementation and as noted it should be trivial for for an mdev vendor
driver to provide equivalent course grained serialization if
necessary. Of course it would be better to implement that sooner
rather than later if required.
I see that __intel_gvt_create_vgpu() makes use of gvt->lock, which
would seem to already provide this level of per-parent locking. The
remove path makes use of this same lock, so I think we're ok, but
looking for an explicit ack so there are no surprises. I'd like
to queue this series for v4.18. Thanks,
Alex
[1] https://lkml.org/lkml/2018/5/18/1035
From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
be fine with these changes. I'm however not too deeply involved with
the mdev framework, thus I don't feel comfortable r-b-ing. That results
in
Acked-by: Halil Pasic <[email protected]>
for both patches.
While at it I have would like to ask about the semantics and intended
use of the mdev interfaces.
static int vfio_ccw_sch_probe(struct subchannel *sch)
{
/* HALIL: 8< Not so interesting stuff happens here. >8 */
ret = vfio_ccw_mdev_reg(sch);
if (ret)
goto out_disable;
/*
* HALIL:
* This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
* is made available (it calls mdev_register_device()). For instance create will
* attempt to decrement private->avail which is initialized below. I fail to
* understand how is this well synchronized.
*/
INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
atomic_set(&private->avail, 1);
private->state = VFIO_CCW_STATE_STANDBY;
return 0;
out_disable:
cio_disable_subchannel(sch);
out_free:
dev_set_drvdata(&sch->dev, NULL);
kfree(private);
return ret;
}
Should not initialization of go before mdev_register_device(), and then rolled
back if necessary if mdev_register_device() fails?
In practice it does not seem very likely that userspace can trigger
mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
not be a practical problem. But I would like to understand how synchronization
is supposed to work.
[Added Dong Jia, maybe he is also able to answer my question.]
Regards,
Halil
On 05/18/2018 09:10 PM, Alex Williamson wrote:
> v4: Fix the 'create' racing 'remove' gap noted by Kirti by moving
> removal from mdev_list to mdev_device_release(). Fix missing
> mdev_put_parent() cases in mdev_device_create(), also noted
> by Kirti. Added documention update regarding serialization as
> noted by Cornelia. Added additional commit log comment about
> -EAGAIN vs -ENODEV for 'remove' racing 'create'. Added second
> patch to re-order sysfs attributes, with this my targeted
> scripts can no longer hit the gap where -EAGAIN is regurned.
> BTW, the gap where the current code returns -ENODEV in this
> race condition is about 50% easier to hit than it exists in
> this series with patch 1 alone.
>
> Thanks,
> Alex
>
> ---
>
> Alex Williamson (2):
> vfio/mdev: Check globally for duplicate devices
> vfio/mdev: Re-order sysfs attribute creation
>
>
> Documentation/vfio-mediated-device.txt | 5 ++
> drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> drivers/vfio/mdev/mdev_private.h | 2 -
> drivers/vfio/mdev/mdev_sysfs.c | 14 ++--
> 4 files changed, 49 insertions(+), 74 deletions(-)
>
On Tue, 22 May 2018 19:17:07 +0200
Halil Pasic <[email protected]> wrote:
> From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
> be fine with these changes. I'm however not too deeply involved with
> the mdev framework, thus I don't feel comfortable r-b-ing. That results
> in
> Acked-by: Halil Pasic <[email protected]>
> for both patches.
>
> While at it I have would like to ask about the semantics and intended
> use of the mdev interfaces.
>
> static int vfio_ccw_sch_probe(struct subchannel *sch)
> {
>
> /* HALIL: 8< Not so interesting stuff happens here. >8 */
This was interesting:
private->state = VFIO_CCW_STATE_NOT_OPER;
> ret = vfio_ccw_mdev_reg(sch);
> if (ret)
> goto out_disable;
> /*
> * HALIL:
> * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
> * is made available (it calls mdev_register_device()). For instance create will
> * attempt to decrement private->avail which is initialized below. I fail to
> * understand how is this well synchronized.
> */
> INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
> atomic_set(&private->avail, 1);
> private->state = VFIO_CCW_STATE_STANDBY;
>
> return 0;
>
> out_disable:
> cio_disable_subchannel(sch);
> out_free:
> dev_set_drvdata(&sch->dev, NULL);
> kfree(private);
> return ret;
> }
>
> Should not initialization of go before mdev_register_device(), and then rolled
> back if necessary if mdev_register_device() fails?
>
> In practice it does not seem very likely that userspace can trigger
> mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
> not be a practical problem. But I would like to understand how synchronization
> is supposed to work.
>
> [Added Dong Jia, maybe he is also able to answer my question.]
vfio_ccw_mdev_create() requires that private->state is not
VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state
to this value before calling vfio_ccw_mdev_reg(), so a create should
return -ENODEV if racing with parent registration. Is there something
else that I'm missing? Thanks,
Alex
On 2018.05.22 09:53:37 -0600, Alex Williamson wrote:
> [Cc +GVT-g maintainers/lists]
>
> On Tue, 22 May 2018 10:13:46 +0200
> Cornelia Huck <[email protected]> wrote:
>
> > On Fri, 18 May 2018 13:10:25 -0600
> > Alex Williamson <[email protected]> wrote:
> >
> > > When we create an mdev device, we check for duplicates against the
> > > parent device and return -EEXIST if found, but the mdev device
> > > namespace is global since we'll link all devices from the bus. We do
> > > catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
> > > with it comes a kernel warning and stack trace for trying to create
> > > duplicate sysfs links, which makes it an undesirable response.
> > >
> > > Therefore we should really be looking for duplicates across all mdev
> > > parent devices, or as implemented here, against our mdev device list.
> > > Using mdev_list to prevent duplicates means that we can remove
> > > mdev_parent.lock, but in order not to serialize mdev device creation
> > > and removal globally, we add mdev_device.active which allows UUIDs to
> > > be reserved such that we can drop the mdev_list_lock before the mdev
> > > device is fully in place.
> > >
> > > Two behavioral notes; first, mdev_parent.lock had the side-effect of
> > > serializing mdev create and remove ops per parent device. This was
> > > an implementation detail, not an intentional guarantee provided to
> > > the mdev vendor drivers. Vendor drivers can trivially provide this
> > > serialization internally if necessary. Second, review comments note
> > > the new -EAGAIN behavior when the device, and in particular the remove
> > > attribute, becomes visible in sysfs. If a remove is triggered prior
> > > to completion of mdev_device_create() the user will see a -EAGAIN
> > > error. While the errno is different, receiving an error during this
> > > period is not, the previous implementation returned -ENODEV for the
> > > same condition. Furthermore, the consistency to the user is improved
> > > in the case where mdev_device_remove_ops() returns error. Previously
> > > concurrent calls to mdev_device_remove() could see the device
> > > disappear with -ENODEV and return in the case of error. Now a user
> > > would see -EAGAIN while the device is in this transitory state.
> > >
> > > Signed-off-by: Alex Williamson <[email protected]>
> > > ---
> > > Documentation/vfio-mediated-device.txt | 5 ++
> > > drivers/vfio/mdev/mdev_core.c | 102 +++++++++++---------------------
> > > drivers/vfio/mdev/mdev_private.h | 2 -
> > > 3 files changed, 42 insertions(+), 67 deletions(-)
> >
> > Reviewed-by: Cornelia Huck <[email protected]>
> >
> > I think it is better to deal with any possible vendor driver
> > implications on top of this (I still believe that vfio-ccw is fine).
>
> Thanks Cornelia. So if vfio-ccw is fine, presumably NVIDIA is fine,
> then this leaves GVT-g to see if there's any fallout. Zhenyu & Zhi,
> I've linked the series under discussion here below[1]. The question to
> you is the first of the two behavioral notes listed above, does GVT-g
> have any dependency on the mdev core providing serialization per mdev
> parent device for the create and remove callbacks within the
> mdev_parent_ops? This was never an intended feature of the
> implementation and as noted it should be trivial for for an mdev vendor
> driver to provide equivalent course grained serialization if
> necessary. Of course it would be better to implement that sooner
> rather than later if required.
>
> I see that __intel_gvt_create_vgpu() makes use of gvt->lock, which
> would seem to already provide this level of per-parent locking. The
> remove path makes use of this same lock, so I think we're ok, but
> looking for an explicit ack so there are no surprises. I'd like
> to queue this series for v4.18. Thanks,
>
yeah, we don't expect mdev core for parent serialization for create and
remove of mdev device. Series look good to me.
Acked-by: Zhenyu Wang <[email protected]>
> Alex
>
> [1] https://lkml.org/lkml/2018/5/18/1035
--
Open Source Technology Center, Intel ltd.
$gpg --keyserver wwwkeys.pgp.net --recv-keys 4D781827
On Tue, 22 May 2018 12:38:29 -0600
Alex Williamson <[email protected]> wrote:
> On Tue, 22 May 2018 19:17:07 +0200
> Halil Pasic <[email protected]> wrote:
>
> > From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
> > be fine with these changes. I'm however not too deeply involved with
> > the mdev framework, thus I don't feel comfortable r-b-ing. That results
> > in
> > Acked-by: Halil Pasic <[email protected]>
> > for both patches.
> >
> > While at it I have would like to ask about the semantics and intended
> > use of the mdev interfaces.
> >
> > static int vfio_ccw_sch_probe(struct subchannel *sch)
> > {
> >
> > /* HALIL: 8< Not so interesting stuff happens here. >8 */
>
> This was interesting:
>
> private->state = VFIO_CCW_STATE_NOT_OPER;
>
> > ret = vfio_ccw_mdev_reg(sch);
> > if (ret)
> > goto out_disable;
> > /*
> > * HALIL:
> > * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
> > * is made available (it calls mdev_register_device()). For instance create will
> > * attempt to decrement private->avail which is initialized below. I fail to
> > * understand how is this well synchronized.
> > */
> > INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
> > atomic_set(&private->avail, 1);
> > private->state = VFIO_CCW_STATE_STANDBY;
> >
> > return 0;
> >
> > out_disable:
> > cio_disable_subchannel(sch);
> > out_free:
> > dev_set_drvdata(&sch->dev, NULL);
> > kfree(private);
> > return ret;
> > }
> >
> > Should not initialization of go before mdev_register_device(), and then rolled
> > back if necessary if mdev_register_device() fails?
> >
> > In practice it does not seem very likely that userspace can trigger
> > mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
> > not be a practical problem. But I would like to understand how synchronization
> > is supposed to work.
> >
> > [Added Dong Jia, maybe he is also able to answer my question.]
>
> vfio_ccw_mdev_create() requires that private->state is not
> VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state
> to this value before calling vfio_ccw_mdev_reg(), so a create should
> return -ENODEV if racing with parent registration. Is there something
> else that I'm missing? Thanks,
>
> Alex
No, I think your understanding is correct. We move the state from
NOT_OPER to STANDBY only after we're set up completely, so our create
callback will simply fail early with -ENODEV. This looks fine to me.
On 05/23/2018 10:56 AM, Cornelia Huck wrote:
> On Tue, 22 May 2018 12:38:29 -0600
> Alex Williamson <[email protected]> wrote:
>
>> On Tue, 22 May 2018 19:17:07 +0200
>> Halil Pasic <[email protected]> wrote:
>>
>>> From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
>>> be fine with these changes. I'm however not too deeply involved with
>>> the mdev framework, thus I don't feel comfortable r-b-ing. That results
>>> in
>>> Acked-by: Halil Pasic <[email protected]>
>>> for both patches.
>>>
>>> While at it I have would like to ask about the semantics and intended
>>> use of the mdev interfaces.
>>>
>>> static int vfio_ccw_sch_probe(struct subchannel *sch)
>>> {
>>>
>>> /* HALIL: 8< Not so interesting stuff happens here. >8 */
>>
>> This was interesting:
>>
>> private->state = VFIO_CCW_STATE_NOT_OPER;
>>
>>> ret = vfio_ccw_mdev_reg(sch);
>>> if (ret)
>>> goto out_disable;
>>> /*
>>> * HALIL:
>>> * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
>>> * is made available (it calls mdev_register_device()). For instance create will
>>> * attempt to decrement private->avail which is initialized below. I fail to
>>> * understand how is this well synchronized.
>>> */
>>> INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
>>> atomic_set(&private->avail, 1);
>>> private->state = VFIO_CCW_STATE_STANDBY;
>>>
>>> return 0;
>>>
>>> out_disable:
>>> cio_disable_subchannel(sch);
>>> out_free:
>>> dev_set_drvdata(&sch->dev, NULL);
>>> kfree(private);
>>> return ret;
>>> }
>>>
>>> Should not initialization of go before mdev_register_device(), and then rolled
>>> back if necessary if mdev_register_device() fails?
>>>
>>> In practice it does not seem very likely that userspace can trigger
>>> mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
>>> not be a practical problem. But I would like to understand how synchronization
>>> is supposed to work.
>>>
>>> [Added Dong Jia, maybe he is also able to answer my question.]
>>
>> vfio_ccw_mdev_create() requires that private->state is not
>> VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state
>> to this value before calling vfio_ccw_mdev_reg(), so a create should
>> return -ENODEV if racing with parent registration. Is there something
>> else that I'm missing? Thanks,
>>
Disclaimer: I did not do much kernel work up until now. I still have
much to learn.
I mostly agree with your analysis but I'm not sure if the conclusion should be
'and thus everything is good' or 'and thus indeed we do have a race, a
poorly handled one'.
One thing I'm not sure about is: can atomic_set(&private->avail, 1) and
private->state = VFIO_CCW_STATE_STANDBY be perceived as reordered by
e.g. some other cpu and thus vfio_ccw_mdev_create() or not. I tried to
figure it out based on Documentation/atomic_t.txt but was not very successful.
If these can be reordered we could observe -EPERM instead of -ENODEV, I
think.
Furthermore from your analysis I deduce that the client code (I think mdev
calls it vendor code) may rely on mdev_register_device() containing a
(RELEASE) barrier. We use a mutex in there so the barrier is there. And
the client code may rely on a (ACQUIRE) barrier before the create callback
is called. That should also be true and was true in the past too again because
of mutex usage.
>> Alex
>
> No, I think your understanding is correct. We move the state from
> NOT_OPER to STANDBY only after we're set up completely, so our create
> callback will simply fail early with -ENODEV. This looks fine to me.
>
This -ENODEV looks strange to me. Which device does not exist? The
userspace were supposed to retry on this? It's not even -EAGAIN. Is it
documented somewhere?
If it's unavoidable (which I don't see why) I would prefer -EAGAIN. I
think throwing an -ENODEV at our userspace once in a blue moon (if ever)
because that is the way we 'handle' races in our code instead of avoiding
them is not very friendly.
And I'm not sure -EPERM is not possible (see my statement
about reordering of the writes above).
Regards,
Halil
On Wed, 23 May 2018 14:29:28 +0200
Halil Pasic <[email protected]> wrote:
> On 05/23/2018 10:56 AM, Cornelia Huck wrote:
> > On Tue, 22 May 2018 12:38:29 -0600
> > Alex Williamson <[email protected]> wrote:
> >
> >> On Tue, 22 May 2018 19:17:07 +0200
> >> Halil Pasic <[email protected]> wrote:
> >>
> >>> From vfio-ccw perspective I join Connie's assessment: vfio-ccw should
> >>> be fine with these changes. I'm however not too deeply involved with
> >>> the mdev framework, thus I don't feel comfortable r-b-ing. That results
> >>> in
> >>> Acked-by: Halil Pasic <[email protected]>
> >>> for both patches.
> >>>
> >>> While at it I have would like to ask about the semantics and intended
> >>> use of the mdev interfaces.
> >>>
> >>> static int vfio_ccw_sch_probe(struct subchannel *sch)
> >>> {
> >>>
> >>> /* HALIL: 8< Not so interesting stuff happens here. >8 */
> >>
> >> This was interesting:
> >>
> >> private->state = VFIO_CCW_STATE_NOT_OPER;
> >>
> >>> ret = vfio_ccw_mdev_reg(sch);
> >>> if (ret)
> >>> goto out_disable;
> >>> /*
> >>> * HALIL:
> >>> * This might be racy. Somewhere in vfio_ccw_mdev_reg() the create attribute
> >>> * is made available (it calls mdev_register_device()). For instance create will
> >>> * attempt to decrement private->avail which is initialized below. I fail to
> >>> * understand how is this well synchronized.
> >>> */
> >>> INIT_WORK(&private->io_work, vfio_ccw_sch_io_todo);
> >>> atomic_set(&private->avail, 1);
> >>> private->state = VFIO_CCW_STATE_STANDBY;
> >>>
> >>> return 0;
> >>>
> >>> out_disable:
> >>> cio_disable_subchannel(sch);
> >>> out_free:
> >>> dev_set_drvdata(&sch->dev, NULL);
> >>> kfree(private);
> >>> return ret;
> >>> }
> >>>
> >>> Should not initialization of go before mdev_register_device(), and then rolled
> >>> back if necessary if mdev_register_device() fails?
> >>>
> >>> In practice it does not seem very likely that userspace can trigger
> >>> mdev_device_create() before vfio_ccw_sch_probe() finishes so it should
> >>> not be a practical problem. But I would like to understand how synchronization
> >>> is supposed to work.
> >>>
> >>> [Added Dong Jia, maybe he is also able to answer my question.]
> >>
> >> vfio_ccw_mdev_create() requires that private->state is not
> >> VFIO_CCW_STATE_NOT_OPER but vfio_ccw_sch_probe() explicitly sets state
> >> to this value before calling vfio_ccw_mdev_reg(), so a create should
> >> return -ENODEV if racing with parent registration. Is there something
> >> else that I'm missing? Thanks,
> >>
>
>
> Disclaimer: I did not do much kernel work up until now. I still have
> much to learn.
>
> I mostly agree with your analysis but I'm not sure if the conclusion should be
> 'and thus everything is good' or 'and thus indeed we do have a race, a
> poorly handled one'.
Let me throw in that there is more than one way to handle a race, and
one of them is to return an error if something happens at an
inconvenient time :)
>
> One thing I'm not sure about is: can atomic_set(&private->avail, 1) and
> private->state = VFIO_CCW_STATE_STANDBY be perceived as reordered by
> e.g. some other cpu and thus vfio_ccw_mdev_create() or not. I tried to
> figure it out based on Documentation/atomic_t.txt but was not very successful.
> If these can be reordered we could observe -EPERM instead of -ENODEV, I
> think.
I don't think that matters (see below).
>
> Furthermore from your analysis I deduce that the client code (I think mdev
> calls it vendor code) may rely on mdev_register_device() containing a
> (RELEASE) barrier. We use a mutex in there so the barrier is there. And
> the client code may rely on a (ACQUIRE) barrier before the create callback
> is called. That should also be true and was true in the past too again because
> of mutex usage.
>
>
> >> Alex
> >
> > No, I think your understanding is correct. We move the state from
> > NOT_OPER to STANDBY only after we're set up completely, so our create
> > callback will simply fail early with -ENODEV. This looks fine to me.
> >
>
> This -ENODEV looks strange to me. Which device does not exist? The
> userspace were supposed to retry on this? It's not even -EAGAIN. Is it
> documented somewhere?
-ENODEV looks very reasonable if we consider a device in the NOT_OPER
state.
>
> If it's unavoidable (which I don't see why) I would prefer -EAGAIN. I
> think throwing an -ENODEV at our userspace once in a blue moon (if ever)
> because that is the way we 'handle' races in our code instead of avoiding
> them is not very friendly.
>
> And I'm not sure -EPERM is not possible (see my statement
> about reordering of the writes above).
I don't think the actual return code does matter in this case. User
space must be prepared for an error (and -ENODEV was even possible
before, see the discussion in the v3 thread.)
We're dealing with a hard to trigger corner case that is easily handled
by user space here: let's not overthink this.