On 8/20/19, 8:48 PM, "Nadav Amit" <[email protected]> wrote:
> Francois reported that VMware balloon gets stuck after a balloon reset,
> when the VMCI doorbell is removed. A similar error can occur when the
> balloon driver is removed with the following splat:
>
> [ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds.
> [ 1088.622035] Tainted: G W 5.2.0 #4
> [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1088.622205] modprobe D 0 3565 1450 0x00000000
> [ 1088.622210] Call Trace:
> [ 1088.622246] __schedule+0x2a8/0x690
> [ 1088.622248] schedule+0x2d/0x90
> [ 1088.622250] schedule_timeout+0x1d3/0x2f0
> [ 1088.622252] wait_for_completion+0xba/0x140
> [ 1088.622320] ? wake_up_q+0x80/0x80
> [ 1088.622370] vmci_resource_remove+0xb9/0xc0 [vmw_vmci]
> [ 1088.622373] vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci]
> [ 1088.622379] vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon]
> [ 1088.622381] vmballoon_exit+0x18/0xcc8 [vmw_balloon]
> [ 1088.622394] __x64_sys_delete_module+0x146/0x280
> [ 1088.622408] do_syscall_64+0x5a/0x130
> [ 1088.622410] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1088.622415] RIP: 0033:0x7f54f62791b7
> [ 1088.622421] Code: Bad RIP value.
> [ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7
> [ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68
> [ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000
> [ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68
> [ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0
>
> The cause for the bug is that when the "delayed" doorbell is invoked, it
> takes a reference on the doorbell entry and schedules work that is
> supposed to run the appropriate code and drop the doorbell entry
> reference. The code ignores the fact that if the work is already queued,
> it will not be scheduled to run one more time. As a result one of the
> references would not be dropped. When the code waits for the reference
> to get to zero, during balloon reset or module removal, it gets stuck.
>
> Fix it. Drop the reference if schedule_work() indicates that the work is
> already queued.
>
> Note that this bug got more apparent (or apparent at all) due to
> commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").
>
> Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.")
> Reported-by: Francois Rigault <[email protected]>
> Cc: Jorgen Hansen <[email protected]>
> Cc: Adit Ranadive <[email protected]>
> Cc: Alexios Zavras <[email protected]>
> Cc: Vishnu DASA <[email protected]>
> Cc: [email protected]
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
Thanks for the fix, looks good to me.
Reviewed-by: Vishnu Dasa <[email protected]>
--
vishnu