Currently it is possible that the final put of a KVM reference comes from
vfio during its device close operation. This occurs while the vfio group
lock is held; however, if the vfio device is still in the kvm device list,
then the following call chain could result in a deadlock:
VFIO holds group->group_lock/group_rwsem
-> kvm_put_kvm
-> kvm_destroy_vm
-> kvm_destroy_devices
-> kvm_vfio_destroy
-> kvm_vfio_file_set_kvm
-> vfio_file_set_kvm
-> try to hold group->group_lock/group_rwsem
The key function is the kvm_destroy_devices() which triggers destroy cb
of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
if this path doesn't call back to vfio, this dead lock would be fixed.
Actually, there is a way for it. KVM provides another point to free the
kvm-vfio device which is the point when the device file descriptor is
closed. This can be achieved by providing the release cb instead of the
destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
/*
* Destroy is responsible for freeing dev.
*
* Destroy may be called before or after destructors are called
* on emulated I/O regions, depending on whether a reference is
* held by a vcpu or other kvm component that gets destroyed
* after the emulated I/O.
*/
void (*destroy)(struct kvm_device *dev);
/*
* Release is an alternative method to free the device. It is
* called when the device file descriptor is closed. Once
* release is called, the destroy method will not be called
* anymore as the device is removed from the device list of
* the VM. kvm->lock is held.
*/
void (*release)(struct kvm_device *dev);
Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
Reported-by: Alex Williamson <[email protected]>
Suggested-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
---
virt/kvm/vfio.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 495ceabffe88..e94f3ea718e5 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
return -ENXIO;
}
-static void kvm_vfio_destroy(struct kvm_device *dev)
+static void kvm_vfio_release(struct kvm_device *dev)
{
struct kvm_vfio *kv = dev->private;
struct kvm_vfio_group *kvg, *tmp;
@@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type);
static struct kvm_device_ops kvm_vfio_ops = {
.name = "kvm-vfio",
.create = kvm_vfio_create,
- .destroy = kvm_vfio_destroy,
+ .release = kvm_vfio_release,
.set_attr = kvm_vfio_set_attr,
.has_attr = kvm_vfio_has_attr,
};
--
2.34.1
> From: Liu, Yi L <[email protected]>
> Sent: Friday, January 20, 2023 11:05 PM
>
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation. This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> VFIO holds group->group_lock/group_rwsem
> -> kvm_put_kvm
> -> kvm_destroy_vm
> -> kvm_destroy_devices
> -> kvm_vfio_destroy
> -> kvm_vfio_file_set_kvm
> -> vfio_file_set_kvm
> -> try to hold group->group_lock/group_rwsem
>
> The key function is the kvm_destroy_devices() which triggers destroy cb
> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
> if this path doesn't call back to vfio, this dead lock would be fixed.
> Actually, there is a way for it. KVM provides another point to free the
> kvm-vfio device which is the point when the device file descriptor is
> closed. This can be achieved by providing the release cb instead of the
> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>
> /*
> * Destroy is responsible for freeing dev.
> *
> * Destroy may be called before or after destructors are called
> * on emulated I/O regions, depending on whether a reference is
> * held by a vcpu or other kvm component that gets destroyed
> * after the emulated I/O.
> */
> void (*destroy)(struct kvm_device *dev);
>
> /*
> * Release is an alternative method to free the device. It is
> * called when the device file descriptor is closed. Once
> * release is called, the destroy method will not be called
> * anymore as the device is removed from the device list of
> * the VM. kvm->lock is held.
> */
> void (*release)(struct kvm_device *dev);
>
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <[email protected]>
> Suggested-by: Kevin Tian <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Yi Liu <[email protected]>
More background can be found in Mathew's work.
https://lore.kernel.org/kvm/[email protected]/T/#u
Regards,
Yi Liu
On 1/20/23 10:08 AM, Liu, Yi L wrote:
>> From: Liu, Yi L <[email protected]>
>> Sent: Friday, January 20, 2023 11:05 PM
>>
>> Currently it is possible that the final put of a KVM reference comes from
>> vfio during its device close operation. This occurs while the vfio group
>> lock is held; however, if the vfio device is still in the kvm device list,
>> then the following call chain could result in a deadlock:
>>
>> VFIO holds group->group_lock/group_rwsem
>> -> kvm_put_kvm
>> -> kvm_destroy_vm
>> -> kvm_destroy_devices
>> -> kvm_vfio_destroy
>> -> kvm_vfio_file_set_kvm
>> -> vfio_file_set_kvm
>> -> try to hold group->group_lock/group_rwsem
>>
>> The key function is the kvm_destroy_devices() which triggers destroy cb
>> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
>> if this path doesn't call back to vfio, this dead lock would be fixed.
>> Actually, there is a way for it. KVM provides another point to free the
>> kvm-vfio device which is the point when the device file descriptor is
>> closed. This can be achieved by providing the release cb instead of the
>> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>>
>> /*
>> * Destroy is responsible for freeing dev.
>> *
>> * Destroy may be called before or after destructors are called
>> * on emulated I/O regions, depending on whether a reference is
>> * held by a vcpu or other kvm component that gets destroyed
>> * after the emulated I/O.
>> */
>> void (*destroy)(struct kvm_device *dev);
>>
>> /*
>> * Release is an alternative method to free the device. It is
>> * called when the device file descriptor is closed. Once
>> * release is called, the destroy method will not be called
>> * anymore as the device is removed from the device list of
>> * the VM. kvm->lock is held.
>> */
>> void (*release)(struct kvm_device *dev);
>>
>> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
>> Reported-by: Alex Williamson <[email protected]>
>> Suggested-by: Kevin Tian <[email protected]>
>> Reviewed-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Yi Liu <[email protected]>
>
> More background can be found in Mathew's work.
> https://lore.kernel.org/kvm/[email protected]/T/#u
>
Thanks Yi.
Reviewed-by: Matthew Rosato <[email protected]>
One small nit: There is a comment at the very end of kvm_vfio_release on the kfree(dev) that still references .destroy, this should be updated to .release
On Fri, 20 Jan 2023 10:45:40 -0500
Matthew Rosato <[email protected]> wrote:
> On 1/20/23 10:08 AM, Liu, Yi L wrote:
> >> From: Liu, Yi L <[email protected]>
> >> Sent: Friday, January 20, 2023 11:05 PM
> >>
> >> Currently it is possible that the final put of a KVM reference comes from
> >> vfio during its device close operation. This occurs while the vfio group
> >> lock is held; however, if the vfio device is still in the kvm device list,
> >> then the following call chain could result in a deadlock:
> >>
> >> VFIO holds group->group_lock/group_rwsem
> >> -> kvm_put_kvm
> >> -> kvm_destroy_vm
> >> -> kvm_destroy_devices
> >> -> kvm_vfio_destroy
> >> -> kvm_vfio_file_set_kvm
> >> -> vfio_file_set_kvm
> >> -> try to hold group->group_lock/group_rwsem
> >>
> >> The key function is the kvm_destroy_devices() which triggers destroy cb
> >> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
> >> if this path doesn't call back to vfio, this dead lock would be fixed.
> >> Actually, there is a way for it. KVM provides another point to free the
> >> kvm-vfio device which is the point when the device file descriptor is
> >> closed. This can be achieved by providing the release cb instead of the
> >> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
> >>
> >> /*
> >> * Destroy is responsible for freeing dev.
> >> *
> >> * Destroy may be called before or after destructors are called
> >> * on emulated I/O regions, depending on whether a reference is
> >> * held by a vcpu or other kvm component that gets destroyed
> >> * after the emulated I/O.
> >> */
> >> void (*destroy)(struct kvm_device *dev);
> >>
> >> /*
> >> * Release is an alternative method to free the device. It is
> >> * called when the device file descriptor is closed. Once
> >> * release is called, the destroy method will not be called
> >> * anymore as the device is removed from the device list of
> >> * the VM. kvm->lock is held.
> >> */
> >> void (*release)(struct kvm_device *dev);
> >>
> >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> >> Reported-by: Alex Williamson <[email protected]>
> >> Suggested-by: Kevin Tian <[email protected]>
> >> Reviewed-by: Jason Gunthorpe <[email protected]>
> >> Signed-off-by: Yi Liu <[email protected]>
> >
> > More background can be found in Mathew's work.
> > https://lore.kernel.org/kvm/[email protected]/T/#u
> >
>
> Thanks Yi.
>
> Reviewed-by: Matthew Rosato <[email protected]>
>
> One small nit: There is a comment at the very end of
> kvm_vfio_release on the kfree(dev) that still references .destroy,
> this should be updated to .release
I've fixed this locally, s/destroy/release/ in that comment. Thanks,
Alex
On Fri, 20 Jan 2023 07:05:28 -0800
Yi Liu <[email protected]> wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation. This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> VFIO holds group->group_lock/group_rwsem
> -> kvm_put_kvm
> -> kvm_destroy_vm
> -> kvm_destroy_devices
> -> kvm_vfio_destroy
> -> kvm_vfio_file_set_kvm
> -> vfio_file_set_kvm
> -> try to hold group->group_lock/group_rwsem
>
> The key function is the kvm_destroy_devices() which triggers destroy cb
> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
> if this path doesn't call back to vfio, this dead lock would be fixed.
> Actually, there is a way for it. KVM provides another point to free the
> kvm-vfio device which is the point when the device file descriptor is
> closed. This can be achieved by providing the release cb instead of the
> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>
> /*
> * Destroy is responsible for freeing dev.
> *
> * Destroy may be called before or after destructors are called
> * on emulated I/O regions, depending on whether a reference is
> * held by a vcpu or other kvm component that gets destroyed
> * after the emulated I/O.
> */
> void (*destroy)(struct kvm_device *dev);
>
> /*
> * Release is an alternative method to free the device. It is
> * called when the device file descriptor is closed. Once
> * release is called, the destroy method will not be called
> * anymore as the device is removed from the device list of
> * the VM. kvm->lock is held.
> */
> void (*release)(struct kvm_device *dev);
>
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <[email protected]>
> Suggested-by: Kevin Tian <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Yi Liu <[email protected]>
> ---
> virt/kvm/vfio.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 495ceabffe88..e94f3ea718e5 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> return -ENXIO;
> }
>
> -static void kvm_vfio_destroy(struct kvm_device *dev)
> +static void kvm_vfio_release(struct kvm_device *dev)
> {
> struct kvm_vfio *kv = dev->private;
> struct kvm_vfio_group *kvg, *tmp;
> @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type);
> static struct kvm_device_ops kvm_vfio_ops = {
> .name = "kvm-vfio",
> .create = kvm_vfio_create,
> - .destroy = kvm_vfio_destroy,
> + .release = kvm_vfio_release,
> .set_attr = kvm_vfio_set_attr,
> .has_attr = kvm_vfio_has_attr,
> };
Applied to vfio for-linus branch for v6.2, along with Matthew's R-b,
the comment update, and the extra reference link. Once we get a
linux-next build I'll send a pull request, along with Matthew's
reserved region fix. Thanks,
Alex
I encountered a lockdep splat while running some regression tests today
(see below). I suspected it might be this patch so I reverted it,
rebuilt the kernel and ran the regression tests again; this time, the
test ran cleanly. It looks like this patch may not have fixed the
problem for which it was intended. Here is the relevant dmesg output:
[ 579.471402] hades[1099]: Start test run
[ 579.473486] hades[1099]: Start
'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' test
[ 579.505804] vfio_ap matrix: MDEV: Registered
[ 579.604024] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Adding
to iommu group 0
[ 585.043898] ======================================================
[ 585.043900] WARNING: possible circular locking dependency detected
[ 585.043902] 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 Not tainted
[ 585.043904] ------------------------------------------------------
[ 585.043905] CPU 0/KVM/1173 is trying to acquire lock:
[ 585.043907] 000000008cfb24b0 (&group->group_lock){+.+.}-{3:3}, at:
vfio_file_set_kvm+0x50/0x68 [vfio]
[ 585.043919]
but task is already holding lock:
[ 585.043920] 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at:
kvm_device_release+0x4a/0xb8 [kvm]
[ 585.043960]
which lock already depends on the new lock.
[ 585.043962]
the existing dependency chain (in reverse order) is:
[ 585.043963]
-> #3 (&kvm->lock){+.+.}-{3:3}:
[ 585.043967] __lock_acquire+0x3e2/0x750
[ 585.043974] lock_acquire.part.0+0xe2/0x250
[ 585.043977] lock_acquire+0xac/0x1d0
[ 585.043980] __mutex_lock+0x9e/0x868
[ 585.043985] mutex_lock_nested+0x32/0x40
[ 585.043988] vfio_ap_mdev_open_device+0x9a/0x198 [vfio_ap]
[ 585.043991] vfio_device_open+0x122/0x168 [vfio]
[ 585.043995] vfio_device_open_file+0x64/0x120 [vfio]
[ 585.043999] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[ 585.044002] __s390x_sys_ioctl+0xc0/0x100
[ 585.044007] do_syscall+0xee/0x118
[ 585.044032] __do_syscall+0xd2/0x120
[ 585.044035] system_call+0x82/0xb0
[ 585.044037]
-> #2 (&matrix_dev->guests_lock){+.+.}-{3:3}:
[ 585.044041] __lock_acquire+0x3e2/0x750
[ 585.044044] lock_acquire.part.0+0xe2/0x250
[ 585.044047] lock_acquire+0xac/0x1d0
[ 585.044049] __mutex_lock+0x9e/0x868
[ 585.044052] mutex_lock_nested+0x32/0x40
[ 585.044054] vfio_ap_mdev_open_device+0x8c/0x198 [vfio_ap]
[ 585.044057] vfio_device_open+0x122/0x168 [vfio]
[ 585.044060] vfio_device_open_file+0x64/0x120 [vfio]
[ 585.044064] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[ 585.044068] __s390x_sys_ioctl+0xc0/0x100
[ 585.044070] do_syscall+0xee/0x118
[ 585.044072] __do_syscall+0xd2/0x120
[ 585.044074] system_call+0x82/0xb0
[ 585.044076]
-> #1 (&new_dev_set->lock){+.+.}-{3:3}:
[ 585.044080] __lock_acquire+0x3e2/0x750
[ 585.044082] lock_acquire.part.0+0xe2/0x250
[ 585.044085] lock_acquire+0xac/0x1d0
[ 585.044088] __mutex_lock+0x9e/0x868
[ 585.044090] mutex_lock_nested+0x32/0x40
[ 585.044093] vfio_device_open+0x3e/0x168 [vfio]
[ 585.044096] vfio_device_open_file+0x64/0x120 [vfio]
[ 585.044100] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
[ 585.044104] __s390x_sys_ioctl+0xc0/0x100
[ 585.044106] do_syscall+0xee/0x118
[ 585.044108] __do_syscall+0xd2/0x120
[ 585.044110] system_call+0x82/0xb0
[ 585.044112]
-> #0 (&group->group_lock){+.+.}-{3:3}:
[ 585.044115] check_prev_add+0xd4/0xf10
[ 585.044118] validate_chain+0x698/0x8e8
[ 585.044120] __lock_acquire+0x3e2/0x750
[ 585.044123] lock_acquire.part.0+0xe2/0x250
[ 585.044125] lock_acquire+0xac/0x1d0
[ 585.044128] __mutex_lock+0x9e/0x868
[ 585.044130] mutex_lock_nested+0x32/0x40
[ 585.044133] vfio_file_set_kvm+0x50/0x68 [vfio]
[ 585.044137] kvm_vfio_release+0x5e/0xf8 [kvm]
[ 585.044156] kvm_device_release+0x90/0xb8 [kvm]
[ 585.044175] __fput+0xaa/0x2a0
[ 585.044180] task_work_run+0x76/0xd0
[ 585.044183] do_exit+0x248/0x538
[ 585.044186] do_group_exit+0x40/0xb0
[ 585.044188] get_signal+0x614/0x698
[ 585.044192] arch_do_signal_or_restart+0x58/0x370
[ 585.044195] exit_to_user_mode_loop+0xe8/0x1b8
[ 585.044200] exit_to_user_mode_prepare+0x164/0x190
[ 585.044203] __do_syscall+0xd2/0x120
[ 585.044205] system_call+0x82/0xb0
[ 585.044207]
other info that might help us debug this:
[ 585.044209] Chain exists of:
&group->group_lock --> &matrix_dev->guests_lock -->
&kvm->lock
[ 585.044213] Possible unsafe locking scenario:
[ 585.044214] CPU0 CPU1
[ 585.044216] ---- ----
[ 585.044217] lock(&kvm->lock);
[ 585.044219] lock(&matrix_dev->guests_lock);
[ 585.044221] lock(&kvm->lock);
[ 585.044223] lock(&group->group_lock);
[ 585.044225]
*** DEADLOCK ***
[ 585.044227] 1 lock held by CPU 0/KVM/1173:
[ 585.044228] #0: 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at:
kvm_device_release+0x4a/0xb8 [kvm]
[ 585.044251]
stack backtrace:
[ 585.044253] CPU: 3 PID: 1173 Comm: CPU 0/KVM Not tainted
6.2.0-rc6-00057-g41c03ba9beea-dirty #18
[ 585.044256] Hardware name: IBM 8561 T01 772 (LPAR)
[ 585.044257] Call Trace:
[ 585.044258] [<000000011a818936>] dump_stack_lvl+0x8e/0xc8
[ 585.044261] [<0000000119aca3f2>] check_noncircular+0x132/0x158
[ 585.044264] [<0000000119acba44>] check_prev_add+0xd4/0xf10
[ 585.044267] [<0000000119accf18>] validate_chain+0x698/0x8e8
[ 585.044270] [<0000000119ace70a>] __lock_acquire+0x3e2/0x750
[ 585.044273] [<0000000119acf682>] lock_acquire.part.0+0xe2/0x250
[ 585.044276] [<0000000119acf89c>] lock_acquire+0xac/0x1d0
[ 585.044279] [<000000011a823c66>] __mutex_lock+0x9e/0x868
[ 585.044282] [<000000011a824462>] mutex_lock_nested+0x32/0x40
[ 585.044285] [<000003ff7fbcd6a0>] vfio_file_set_kvm+0x50/0x68 [vfio]
[ 585.044289] [<000003ff7feacab6>] kvm_vfio_release+0x5e/0xf8 [kvm]
[ 585.044308] [<000003ff7fea6d58>] kvm_device_release+0x90/0xb8 [kvm]
[ 585.044328] [<0000000119dbb83a>] __fput+0xaa/0x2a0
[ 585.044331] [<0000000119a67c66>] task_work_run+0x76/0xd0
[ 585.044333] [<0000000119a3ec18>] do_exit+0x248/0x538
[ 585.044335] [<0000000119a3f0c8>] do_group_exit+0x40/0xb0
[ 585.044338] [<0000000119a50dec>] get_signal+0x614/0x698
[ 585.044340] [<00000001199ea030>] arch_do_signal_or_restart+0x58/0x370
[ 585.044343] [<0000000119b0bb50>] exit_to_user_mode_loop+0xe8/0x1b8
[ 585.044346] [<0000000119b0bd84>] exit_to_user_mode_prepare+0x164/0x190
[ 585.044349] [<000000011a818d2a>] __do_syscall+0xd2/0x120
[ 585.044351] [<000000011a82c462>] system_call+0x82/0xb0
[ 585.044354] INFO: lockdep is turned off.
[ 610.595528] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb:
Removing from iommu group 0
[ 610.604408] vfio_ap matrix: MDEV: Unregistering
[ 610.826074] hades[1099]: Stop
'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest'
On 1/20/23 10:05 AM, Yi Liu wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation. This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> VFIO holds group->group_lock/group_rwsem
> -> kvm_put_kvm
> -> kvm_destroy_vm
> -> kvm_destroy_devices
> -> kvm_vfio_destroy
> -> kvm_vfio_file_set_kvm
> -> vfio_file_set_kvm
> -> try to hold group->group_lock/group_rwsem
>
> The key function is the kvm_destroy_devices() which triggers destroy cb
> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
> if this path doesn't call back to vfio, this dead lock would be fixed.
> Actually, there is a way for it. KVM provides another point to free the
> kvm-vfio device which is the point when the device file descriptor is
> closed. This can be achieved by providing the release cb instead of the
> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>
> /*
> * Destroy is responsible for freeing dev.
> *
> * Destroy may be called before or after destructors are called
> * on emulated I/O regions, depending on whether a reference is
> * held by a vcpu or other kvm component that gets destroyed
> * after the emulated I/O.
> */
> void (*destroy)(struct kvm_device *dev);
>
> /*
> * Release is an alternative method to free the device. It is
> * called when the device file descriptor is closed. Once
> * release is called, the destroy method will not be called
> * anymore as the device is removed from the device list of
> * the VM. kvm->lock is held.
> */
> void (*release)(struct kvm_device *dev);
>
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <[email protected]>
> Suggested-by: Kevin Tian <[email protected]>
> Reviewed-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Yi Liu <[email protected]>
> ---
> virt/kvm/vfio.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 495ceabffe88..e94f3ea718e5 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
> return -ENXIO;
> }
>
> -static void kvm_vfio_destroy(struct kvm_device *dev)
> +static void kvm_vfio_release(struct kvm_device *dev)
> {
> struct kvm_vfio *kv = dev->private;
> struct kvm_vfio_group *kvg, *tmp;
> @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type);
> static struct kvm_device_ops kvm_vfio_ops = {
> .name = "kvm-vfio",
> .create = kvm_vfio_create,
> - .destroy = kvm_vfio_destroy,
> + .release = kvm_vfio_release,
> .set_attr = kvm_vfio_set_attr,
> .has_attr = kvm_vfio_has_attr,
> };
On Tue, Jan 31, 2023 at 09:27:54AM -0500, Anthony Krowiak wrote:
> I encountered a lockdep splat while running some regression tests today (see
> below). I suspected it might be this patch so I reverted it, rebuilt the
> kernel and ran the regression tests again; this time, the test ran cleanly.
> It looks like this patch may not have fixed the problem for which it was
> intended. Here is the relevant dmesg output:
Well, it fixes the deadlock it intended to fix and created another one
:)
It means device drivers cannot obtain the kvm lock from their open
functions in this new model.
Why does ap need to touch kvm->lock? (via get_update_locks_for_kvm)
Maybe you should split that lock and have a dedicated apcb lock?
Jason
On 1/31/23 9:27 AM, Anthony Krowiak wrote:
> I encountered a lockdep splat while running some regression tests today (see below). I suspected it might be this patch so I reverted it, rebuilt the kernel and ran the regression tests again; this time, the test ran cleanly. It looks like this patch may not have fixed the problem for which it was intended. Here is the relevant dmesg output:
>
Damn it, the config I used to test must not have had lockdep enabled.
It looks like the issue is that .destroy did not used to acquire the kvm lock prior to calling into vfio_file_set_kvm whereas now .release does. This means .release expects a hierarchy of kvm->lock ... vfio->group_lock but your .open_device does vfio->group_lock ... kvm->lock.
I can reproduce something similar for vfio_pci_zdev.
> [ 579.471402] hades[1099]: Start test run
> [ 579.473486] hades[1099]: Start 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' test
> [ 579.505804] vfio_ap matrix: MDEV: Registered
> [ 579.604024] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Adding to iommu group 0
>
> [ 585.043898] ======================================================
> [ 585.043900] WARNING: possible circular locking dependency detected
> [ 585.043902] 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 Not tainted
> [ 585.043904] ------------------------------------------------------
> [ 585.043905] CPU 0/KVM/1173 is trying to acquire lock:
> [ 585.043907] 000000008cfb24b0 (&group->group_lock){+.+.}-{3:3}, at: vfio_file_set_kvm+0x50/0x68 [vfio]
> [ 585.043919]
> but task is already holding lock:
> [ 585.043920] 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm]
> [ 585.043960]
> which lock already depends on the new lock.
>
> [ 585.043962]
> the existing dependency chain (in reverse order) is:
> [ 585.043963]
> -> #3 (&kvm->lock){+.+.}-{3:3}:
> [ 585.043967] __lock_acquire+0x3e2/0x750
> [ 585.043974] lock_acquire.part.0+0xe2/0x250
> [ 585.043977] lock_acquire+0xac/0x1d0
> [ 585.043980] __mutex_lock+0x9e/0x868
> [ 585.043985] mutex_lock_nested+0x32/0x40
> [ 585.043988] vfio_ap_mdev_open_device+0x9a/0x198 [vfio_ap]
> [ 585.043991] vfio_device_open+0x122/0x168 [vfio]
> [ 585.043995] vfio_device_open_file+0x64/0x120 [vfio]
> [ 585.043999] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
> [ 585.044002] __s390x_sys_ioctl+0xc0/0x100
> [ 585.044007] do_syscall+0xee/0x118
> [ 585.044032] __do_syscall+0xd2/0x120
> [ 585.044035] system_call+0x82/0xb0
> [ 585.044037]
> -> #2 (&matrix_dev->guests_lock){+.+.}-{3:3}:
> [ 585.044041] __lock_acquire+0x3e2/0x750
> [ 585.044044] lock_acquire.part.0+0xe2/0x250
> [ 585.044047] lock_acquire+0xac/0x1d0
> [ 585.044049] __mutex_lock+0x9e/0x868
> [ 585.044052] mutex_lock_nested+0x32/0x40
> [ 585.044054] vfio_ap_mdev_open_device+0x8c/0x198 [vfio_ap]
> [ 585.044057] vfio_device_open+0x122/0x168 [vfio]
> [ 585.044060] vfio_device_open_file+0x64/0x120 [vfio]
> [ 585.044064] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
> [ 585.044068] __s390x_sys_ioctl+0xc0/0x100
> [ 585.044070] do_syscall+0xee/0x118
> [ 585.044072] __do_syscall+0xd2/0x120
> [ 585.044074] system_call+0x82/0xb0
> [ 585.044076]
> -> #1 (&new_dev_set->lock){+.+.}-{3:3}:
> [ 585.044080] __lock_acquire+0x3e2/0x750
> [ 585.044082] lock_acquire.part.0+0xe2/0x250
> [ 585.044085] lock_acquire+0xac/0x1d0
> [ 585.044088] __mutex_lock+0x9e/0x868
> [ 585.044090] mutex_lock_nested+0x32/0x40
> [ 585.044093] vfio_device_open+0x3e/0x168 [vfio]
> [ 585.044096] vfio_device_open_file+0x64/0x120 [vfio]
> [ 585.044100] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio]
> [ 585.044104] __s390x_sys_ioctl+0xc0/0x100
> [ 585.044106] do_syscall+0xee/0x118
> [ 585.044108] __do_syscall+0xd2/0x120
> [ 585.044110] system_call+0x82/0xb0
> [ 585.044112]
> -> #0 (&group->group_lock){+.+.}-{3:3}:
> [ 585.044115] check_prev_add+0xd4/0xf10
> [ 585.044118] validate_chain+0x698/0x8e8
> [ 585.044120] __lock_acquire+0x3e2/0x750
> [ 585.044123] lock_acquire.part.0+0xe2/0x250
> [ 585.044125] lock_acquire+0xac/0x1d0
> [ 585.044128] __mutex_lock+0x9e/0x868
> [ 585.044130] mutex_lock_nested+0x32/0x40
> [ 585.044133] vfio_file_set_kvm+0x50/0x68 [vfio]
> [ 585.044137] kvm_vfio_release+0x5e/0xf8 [kvm]
> [ 585.044156] kvm_device_release+0x90/0xb8 [kvm]
> [ 585.044175] __fput+0xaa/0x2a0
> [ 585.044180] task_work_run+0x76/0xd0
> [ 585.044183] do_exit+0x248/0x538
> [ 585.044186] do_group_exit+0x40/0xb0
> [ 585.044188] get_signal+0x614/0x698
> [ 585.044192] arch_do_signal_or_restart+0x58/0x370
> [ 585.044195] exit_to_user_mode_loop+0xe8/0x1b8
> [ 585.044200] exit_to_user_mode_prepare+0x164/0x190
> [ 585.044203] __do_syscall+0xd2/0x120
> [ 585.044205] system_call+0x82/0xb0
> [ 585.044207]
> other info that might help us debug this:
>
> [ 585.044209] Chain exists of:
> &group->group_lock --> &matrix_dev->guests_lock --> &kvm->lock
>
> [ 585.044213] Possible unsafe locking scenario:
>
> [ 585.044214] CPU0 CPU1
> [ 585.044216] ---- ----
> [ 585.044217] lock(&kvm->lock);
> [ 585.044219] lock(&matrix_dev->guests_lock);
> [ 585.044221] lock(&kvm->lock);
> [ 585.044223] lock(&group->group_lock);
> [ 585.044225]
> *** DEADLOCK ***
>
> [ 585.044227] 1 lock held by CPU 0/KVM/1173:
> [ 585.044228] #0: 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm]
> [ 585.044251]
> stack backtrace:
> [ 585.044253] CPU: 3 PID: 1173 Comm: CPU 0/KVM Not tainted 6.2.0-rc6-00057-g41c03ba9beea-dirty #18
> [ 585.044256] Hardware name: IBM 8561 T01 772 (LPAR)
> [ 585.044257] Call Trace:
> [ 585.044258] [<000000011a818936>] dump_stack_lvl+0x8e/0xc8
> [ 585.044261] [<0000000119aca3f2>] check_noncircular+0x132/0x158
> [ 585.044264] [<0000000119acba44>] check_prev_add+0xd4/0xf10
> [ 585.044267] [<0000000119accf18>] validate_chain+0x698/0x8e8
> [ 585.044270] [<0000000119ace70a>] __lock_acquire+0x3e2/0x750
> [ 585.044273] [<0000000119acf682>] lock_acquire.part.0+0xe2/0x250
> [ 585.044276] [<0000000119acf89c>] lock_acquire+0xac/0x1d0
> [ 585.044279] [<000000011a823c66>] __mutex_lock+0x9e/0x868
> [ 585.044282] [<000000011a824462>] mutex_lock_nested+0x32/0x40
> [ 585.044285] [<000003ff7fbcd6a0>] vfio_file_set_kvm+0x50/0x68 [vfio]
> [ 585.044289] [<000003ff7feacab6>] kvm_vfio_release+0x5e/0xf8 [kvm]
> [ 585.044308] [<000003ff7fea6d58>] kvm_device_release+0x90/0xb8 [kvm]
> [ 585.044328] [<0000000119dbb83a>] __fput+0xaa/0x2a0
> [ 585.044331] [<0000000119a67c66>] task_work_run+0x76/0xd0
> [ 585.044333] [<0000000119a3ec18>] do_exit+0x248/0x538
> [ 585.044335] [<0000000119a3f0c8>] do_group_exit+0x40/0xb0
> [ 585.044338] [<0000000119a50dec>] get_signal+0x614/0x698
> [ 585.044340] [<00000001199ea030>] arch_do_signal_or_restart+0x58/0x370
> [ 585.044343] [<0000000119b0bb50>] exit_to_user_mode_loop+0xe8/0x1b8
> [ 585.044346] [<0000000119b0bd84>] exit_to_user_mode_prepare+0x164/0x190
> [ 585.044349] [<000000011a818d2a>] __do_syscall+0xd2/0x120
> [ 585.044351] [<000000011a82c462>] system_call+0x82/0xb0
> [ 585.044354] INFO: lockdep is turned off.
> [ 610.595528] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Removing from iommu group 0
> [ 610.604408] vfio_ap matrix: MDEV: Unregistering
> [ 610.826074] hades[1099]: Stop 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest'
>
> On 1/20/23 10:05 AM, Yi Liu wrote:
>> Currently it is possible that the final put of a KVM reference comes from
>> vfio during its device close operation. This occurs while the vfio group
>> lock is held; however, if the vfio device is still in the kvm device list,
>> then the following call chain could result in a deadlock:
>>
>> VFIO holds group->group_lock/group_rwsem
>> -> kvm_put_kvm
>> -> kvm_destroy_vm
>> -> kvm_destroy_devices
>> -> kvm_vfio_destroy
>> -> kvm_vfio_file_set_kvm
>> -> vfio_file_set_kvm
>> -> try to hold group->group_lock/group_rwsem
>>
>> The key function is the kvm_destroy_devices() which triggers destroy cb
>> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
>> if this path doesn't call back to vfio, this dead lock would be fixed.
>> Actually, there is a way for it. KVM provides another point to free the
>> kvm-vfio device which is the point when the device file descriptor is
>> closed. This can be achieved by providing the release cb instead of the
>> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
>>
>> /*
>> * Destroy is responsible for freeing dev.
>> *
>> * Destroy may be called before or after destructors are called
>> * on emulated I/O regions, depending on whether a reference is
>> * held by a vcpu or other kvm component that gets destroyed
>> * after the emulated I/O.
>> */
>> void (*destroy)(struct kvm_device *dev);
>>
>> /*
>> * Release is an alternative method to free the device. It is
>> * called when the device file descriptor is closed. Once
>> * release is called, the destroy method will not be called
>> * anymore as the device is removed from the device list of
>> * the VM. kvm->lock is held.
>> */
>> void (*release)(struct kvm_device *dev);
>>
>> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
>> Reported-by: Alex Williamson <[email protected]>
>> Suggested-by: Kevin Tian <[email protected]>
>> Reviewed-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Yi Liu <[email protected]>
>> ---
>> virt/kvm/vfio.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 495ceabffe88..e94f3ea718e5 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>> return -ENXIO;
>> }
>> -static void kvm_vfio_destroy(struct kvm_device *dev)
>> +static void kvm_vfio_release(struct kvm_device *dev)
>> {
>> struct kvm_vfio *kv = dev->private;
>> struct kvm_vfio_group *kvg, *tmp;
>> @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type);
>> static struct kvm_device_ops kvm_vfio_ops = {
>> .name = "kvm-vfio",
>> .create = kvm_vfio_create,
>> - .destroy = kvm_vfio_destroy,
>> + .release = kvm_vfio_release,
>> .set_attr = kvm_vfio_set_attr,
>> .has_attr = kvm_vfio_has_attr,
>> };
On 1/31/23 9:34 AM, Jason Gunthorpe wrote:
> On Tue, Jan 31, 2023 at 09:27:54AM -0500, Anthony Krowiak wrote:
>> I encountered a lockdep splat while running some regression tests today (see
>> below). I suspected it might be this patch so I reverted it, rebuilt the
>> kernel and ran the regression tests again; this time, the test ran cleanly.
>> It looks like this patch may not have fixed the problem for which it was
>> intended. Here is the relevant dmesg output:
> Well, it fixes the deadlock it intended to fix and created another one
> :)
>
> It means device drivers cannot obtain the kvm lock from their open
> functions in this new model.
>
> Why does ap need to touch kvm->lock? (via get_update_locks_for_kvm)
We need the kvm->lock because we take the vCPUs out of SIE in order to
dynamically change values in the APCB.
>
> Maybe you should split that lock and have a dedicated apcb lock?
I don't think that would suffice for taking the vCPUs out of SIE.
>
> Jason
On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote:
> > Maybe you should split that lock and have a dedicated apcb lock?
>
> I don't think that would suffice for taking the vCPUs out of SIE.
Then I think we have to keep this patch and also do Matthew's patch to
keep kvm refs inside vfio as well.
Jason
On 1/31/23 9:48 AM, Jason Gunthorpe wrote:
> On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote:
>
>>> Maybe you should split that lock and have a dedicated apcb lock?
>>
>> I don't think that would suffice for taking the vCPUs out of SIE.
>
> Then I think we have to keep this patch and also do Matthew's patch to
> keep kvm refs inside vfio as well.
>
I don't think keeping kvm refs inside vfio solves this issue though -- Even if we handle the kvm_put_kvm asynchronously within vfio as previously proposed, kvm_vfio_release will eventually get called and it gets called with the kvm->lock already held, then proceeds to call vfio_file_set_kvm which gets the group->lock. That order conflicts with the hierarchy used by the driver during open_device of vfio->group_lock ... kvm->lock.
On Tue, Jan 31, 2023 at 10:00:14AM -0500, Matthew Rosato wrote:
> On 1/31/23 9:48 AM, Jason Gunthorpe wrote:
> > On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote:
> >
> >>> Maybe you should split that lock and have a dedicated apcb lock?
> >>
> >> I don't think that would suffice for taking the vCPUs out of SIE.
> >
> > Then I think we have to keep this patch and also do Matthew's patch to
> > keep kvm refs inside vfio as well.
> >
>
> I don't think keeping kvm refs inside vfio solves this issue though
> -- Even if we handle the kvm_put_kvm asynchronously within vfio as
> previously proposed, kvm_vfio_release will eventually get called and
> it gets called with the kvm->lock already held, then proceeds to
> call vfio_file_set_kvm which gets the group->lock. That order
> conflicts with the hierarchy used by the driver during open_device
> of vfio->group_lock ... kvm->lock.
The group lock is held by vfio_file_set_kvm() only because we don't
have a refcount and we have to hold it across the open call to keep
the pointer alive.
With proper refcounting you'd split this to a spinlock and hold it
only while obtaining the get ref for the open thread.
Jason