Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752331AbdIRIba (ORCPT ); Mon, 18 Sep 2017 04:31:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:51274 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752181AbdIRIb0 (ORCPT ); Mon, 18 Sep 2017 04:31:26 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A50BE7E427 Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=vkuznets@redhat.com From: Vitaly Kuznetsov To: Stephen Hemminger Cc: KY Srinivasan , leann.ogasawara@canonical.com, Stephen Hemminger , "apw\@canonical.com" , "olaf\@aepfle.de" , "marcelo.cerri\@canonical.com" , "gregkh\@linuxfoundation.org" , Haiyang Zhang , "linux-kernel\@vger.kernel.org" , "jasowang\@redhat.com" , "devel\@linuxdriverproject.org" Subject: Re: [PATCH 1/1] Drivers: hv: vmbus: Fix rescind handling issues References: <1502471039-5281-1-git-send-email-kys@exchange.microsoft.com> <20170824154102.62a02190@xeon-e3> <87ingkulhp.fsf@vitty.brq.redhat.com> <87efr8ul71.fsf@vitty.brq.redhat.com> Date: Mon, 18 Sep 2017 10:31:21 +0200 In-Reply-To: (Stephen Hemminger's message of "Fri, 15 Sep 2017 14:54:30 -0700") Message-ID: <87vakgtnl2.fsf@vitty.brq.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Mon, 18 Sep 2017 08:31:26 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7904 Lines: 181 Stephen Hemminger writes: > On Sep 15, 2017 11:01 AM, "KY Srinivasan" wrote: > > > Vitaly Kuznetsov writes: > > > > > > > > I'm seeing the same issue, reverting the offending > > > > > > commit 6f3d791f300618caf82a2be0c27456edd76d5164 > > > Author: K. Y. Srinivasan > > > Date: Fri Aug 11 10:03:59 2017 -0700 > > > > > > Drivers: hv: vmbus: Fix rescind handling issues > > > > > > which made it upstream helps. Did you guys do some investigation here? > > > In case not I can take a look next week. > > > > Sorry, I have to resend - K.Y.'s kys@exchange.microsoft.com doesn't > > accept mail from me :-( > > I think this turned out to be a bug in netvsc - Stephen can elaborate. I think the fix > has been submitted upstream as well. > > K. Y > > > > -- > > Vitaly > > It turned out that subchannel call back is run from primary channel call back, and was deadlocking waiting for vmbus open. > > Original code had broken wait for sub channels. > The first for that induced this bug. > > The resolution is in latest network driver was to bring up sub channels from work queue This is a bit confusing, in case you mean commit 8195b1396ec86dddbba443c74b2188b423556c74 Author: Stephen Hemminger Date: Wed Sep 6 13:53:05 2017 -0700 hv_netvsc: fix deadlock on hotplug is supposed to fix the issue. I'm testing the latest net-next which has it: $ git log --oneline drivers/net/hyperv/ 5023a6db7319 netvsc: increase default receive buffer size 8f2bb1de7334 hv_netvsc: avoid unnecessary wakeups on subchannel creation 8195b1396ec8 hv_netvsc: fix deadlock on hotplug db3cd7af9d0f hv_netvsc: Fix the channel limit in netvsc_set_rxfh() 06be580ac7b6 hv_netvsc: Simplify the limit check in netvsc_set_channels() ... and when I do # ip link set dev eth0 mtu 1400 # ip link set dev eth0 mtu 1500 (actually, you can do mtu change just once the deadlock has already happened, doing it twice just reveals the issue faster - it will hang permanently trying to get rtnl lock which is already taken) The log is: [ 248.800089] INFO: task kworker/u480:0:5 blocked for more than 120 seconds. [ 248.804065] Not tainted 4.14.0-rc1 #63 [ 248.806486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.810879] kworker/u480:0 D 0 5 2 0x80000000 [ 248.814225] Workqueue: netns cleanup_net [ 248.816745] Call Trace: [ 248.818286] __schedule+0x1d5/0x790 [ 248.820398] ? sched_clock+0x9/0x10 [ 248.822297] ? select_idle_sibling+0x24/0x420 [ 248.824883] schedule+0x31/0x80 [ 248.826835] schedule_preempt_disabled+0x9/0x10 [ 248.829444] __mutex_lock.isra.1+0x1a3/0x4f0 [ 248.831973] ? sched_clock_cpu+0x5d/0xb0 [ 248.834300] __mutex_lock_slowpath+0xe/0x10 [ 248.836869] ? __mutex_lock_slowpath+0xe/0x10 [ 248.839488] mutex_lock+0x2a/0x30 [ 248.841378] rtnl_lock+0x10/0x20 [ 248.843431] cleanup_net+0x94/0x2e0 [ 248.845479] process_one_work+0x193/0x390 [ 248.847838] worker_thread+0x48/0x3c0 [ 248.850121] kthread+0x120/0x140 [ 248.851927] ? process_one_work+0x390/0x390 [ 248.854425] ? kthread_create_on_node+0x60/0x60 [ 248.856884] ? kthread_create_on_node+0x60/0x60 [ 248.859585] ret_from_fork+0x25/0x30 [ 248.861581] INFO: task kworker/9:2:374 blocked for more than 120 seconds. [ 248.865286] Not tainted 4.14.0-rc1 #63 [ 248.867655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.872135] kworker/9:2 D 0 374 2 0x80000000 [ 248.875175] Workqueue: events rndis_set_subchannel [hv_netvsc] [ 248.878473] Call Trace: [ 248.879825] __schedule+0x1d5/0x790 [ 248.882122] schedule+0x31/0x80 [ 248.884160] rndis_set_subchannel+0x186/0x1e0 [hv_netvsc] [ 248.887741] ? finish_wait+0x80/0x80 [ 248.889960] process_one_work+0x193/0x390 [ 248.892349] worker_thread+0x48/0x3c0 [ 248.894484] kthread+0x120/0x140 [ 248.896364] ? process_one_work+0x390/0x390 [ 248.898756] ? kthread_create_on_node+0x60/0x60 [ 248.901229] ret_from_fork+0x25/0x30 [ 248.903180] INFO: task hypervkvpd:656 blocked for more than 120 seconds. [ 248.907120] Not tainted 4.14.0-rc1 #63 [ 248.909966] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.914387] hypervkvpd D 0 656 1 0x00000000 [ 248.917618] Call Trace: [ 248.919119] __schedule+0x1d5/0x790 [ 248.921343] schedule+0x31/0x80 [ 248.924454] schedule_preempt_disabled+0x9/0x10 [ 248.927792] __mutex_lock.isra.1+0x1a3/0x4f0 [ 248.930951] __mutex_lock_slowpath+0xe/0x10 [ 248.934078] ? __mutex_lock_slowpath+0xe/0x10 [ 248.937466] mutex_lock+0x2a/0x30 [ 248.940228] __netlink_dump_start+0x44/0x190 [ 248.943286] rtnetlink_rcv_msg+0x1a2/0x280 [ 248.946237] ? vsnprintf+0xea/0x4d0 [ 248.948888] ? rtnl_getlink+0x1c0/0x1c0 [ 248.951713] ? rtnl_getlink+0x1c0/0x1c0 [ 248.954644] ? rtnl_calcit.isra.26+0x110/0x110 [ 248.958010] netlink_rcv_skb+0x89/0x130 [ 248.960945] rtnetlink_rcv+0x10/0x20 [ 248.963566] netlink_unicast+0x186/0x220 [ 248.966442] netlink_sendmsg+0x2a8/0x3a0 [ 248.969369] sock_sendmsg+0x33/0x40 [ 248.971955] SYSC_sendto+0x13f/0x180 [ 248.974552] ? SYSC_getsockname+0xc7/0xe0 [ 248.977393] ? fd_install+0x20/0x30 [ 248.980044] ? sock_map_fd+0x3f/0x60 [ 248.982995] SyS_sendto+0x9/0x10 [ 248.986106] entry_SYSCALL_64_fastpath+0x1a/0xa5 [ 248.989785] RIP: 0033:0x7f544421f0d3 [ 248.992576] RSP: 002b:00007ffcb8a81dc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 248.997458] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f544421f0d3 [ 249.002140] RDX: 0000000000000014 RSI: 00007ffcb8a82e20 RDI: 000000000000000a [ 249.006863] RBP: 00007ffcb8a84e50 R08: 00007ffcb8a82e00 R09: 000000000000000c [ 249.011590] R10: 0000000000000000 R11: 0000000000000246 R12: 00005598e3c4d963 [ 249.016558] R13: 00005598e3bb1110 R14: 00005598e3c4d8c0 R15: 0000000000000011 [ 249.021481] INFO: task ip:1004 blocked for more than 120 seconds. [ 249.025703] Not tainted 4.14.0-rc1 #63 [ 249.028713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 249.039417] ip D 0 1004 938 0x00000000 [ 249.043418] Call Trace: [ 249.045633] __schedule+0x1d5/0x790 [ 249.048335] schedule+0x31/0x80 [ 249.050795] schedule_preempt_disabled+0x9/0x10 [ 249.054109] __mutex_lock.isra.1+0x1a3/0x4f0 [ 249.057251] ? security_capable+0x43/0x60 [ 249.060188] __mutex_lock_slowpath+0xe/0x10 [ 249.063299] ? __netlink_ns_capable+0x36/0x40 [ 249.066416] ? __mutex_lock_slowpath+0xe/0x10 [ 249.069520] mutex_lock+0x2a/0x30 [ 249.072032] rtnetlink_rcv_msg+0x1cb/0x280 [ 249.075189] ? rtnl_calcit.isra.26+0x110/0x110 [ 249.079047] netlink_rcv_skb+0x89/0x130 [ 249.082209] rtnetlink_rcv+0x10/0x20 [ 249.085369] netlink_unicast+0x186/0x220 [ 249.088767] netlink_sendmsg+0x2a8/0x3a0 [ 249.091777] sock_sendmsg+0x33/0x40 [ 249.094353] SYSC_sendto+0x13f/0x180 [ 249.096938] ? handle_mm_fault+0x65/0xf0 [ 249.099824] ? __do_page_fault+0x250/0x4a0 [ 249.102817] SyS_sendto+0x9/0x10 [ 249.105240] entry_SYSCALL_64_fastpath+0x1a/0xa5 [ 249.108489] RIP: 0033:0x7f972cc28f7d [ 249.111176] RSP: 002b:00007ffed31097e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 249.115958] RAX: ffffffffffffffda RBX: 000000000066fc90 RCX: 00007f972cc28f7d [ 249.120548] RDX: 0000000000000020 RSI: 00007ffed3109800 RDI: 0000000000000003 [ 249.125439] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000 [ 249.130397] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006 [ 249.135432] R13: 00007ffed3109ed0 R14: 00007ffed3109ed8 R15: 00007ffed310ae1f Reverting 6f3d791f300618caf82a2be0c27456edd76d5164 still helps. Or do you mean some other fix which is not yet in net-next? -- Vitaly