Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756065Ab0DFOtw (ORCPT ); Tue, 6 Apr 2010 10:49:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45561 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753224Ab0DFOtp (ORCPT ); Tue, 6 Apr 2010 10:49:45 -0400 Date: Tue, 6 Apr 2010 10:48:24 -0400 From: Andy Gospodarek To: Cong Wang Cc: linux-kernel@vger.kernel.org, Matt Mackall , netdev@vger.kernel.org, bridge@lists.linux-foundation.org, Andy Gospodarek , Neil Horman , Jeff Moyer , Stephen Hemminger , bonding-devel@lists.sourceforge.net, Jay Vosburgh , David Miller Subject: Re: [v2 Patch 3/3] bonding: make bonding support netpoll Message-ID: <20100406144824.GB10488@gospo.rdu.redhat.com> References: <20100405091605.4890.31181.sendpatchset@localhost.localdomain> <20100405091628.4890.30541.sendpatchset@localhost.localdomain> <20100405194356.GA10488@gospo.rdu.redhat.com> <4BBA9FDB.4040909@redhat.com> <4BBABAB8.4010401@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BBABAB8.4010401@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12785 Lines: 248 On Tue, Apr 06, 2010 at 12:38:16PM +0800, Cong Wang wrote: > Cong Wang wrote: >> Before I try to reproduce it, could you please try to replace the >> 'read_lock()' >> in slaves_support_netpoll() with 'read_lock_bh()'? (read_unlock() too) >> Try if this helps. >> > > Confirmed. Please use the attached patch instead, for your testing. > > Thanks! > Moving those locks to bh-locks will not resolve this. I tried that yesterday and tried your new patch today without success. That warning is a WARN_ON_ONCE so you need to reboot to see that it is still a problem. Simply unloading and loading the new module is not an accurate test. Also, my system still hangs when removing the bonding module. I do not think you intended to fix this with the patch, but wanted it to be clear to everyone on the list. You should also configure your kernel with a some of the lock debugging enabled. I've been using the following: CONFIG_DETECT_HUNG_TASK=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_LOCK_STAT=y CONFIG_DEBUG_LOCKDEP=y Here is the output when I remove a slave from the bond. My xmit_roundrobin patch from earlier (replacing read_lock with read_trylock) was applied. It might be helpful for you when debugging these issues. ------------[ cut here ]------------ WARNING: at kernel/softirq.c:143 local_bh_enable+0x43/0xba() Hardware name: HP xw4400 Workstation Modules linked in: netconsole bonding ipt_REJECT bridge stp autofs4 i2c_dev i2c_core hidp rfcomm l2cap crc16 bluetooth rfki] Pid: 10, comm: events/1 Not tainted 2.6.34-rc3 #6 Call Trace: [] ? cpu_clock+0x2d/0x41 [] ? local_bh_enable+0x43/0xba [] warn_slowpath_common+0x77/0x8f [] ? dev_queue_xmit+0x408/0x467 [] warn_slowpath_null+0xf/0x11 [] local_bh_enable+0x43/0xba [] dev_queue_xmit+0x408/0x467 [] ? dev_queue_xmit+0x10d/0x467 [] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [] bond_start_xmit+0x139/0x3e9 [bonding] [] queue_process+0xa8/0x160 [] ? queue_process+0x0/0x160 [] worker_thread+0x1af/0x2ae [] ? worker_thread+0x156/0x2ae [] ? autoremove_wake_function+0x0/0x38 [] ? worker_thread+0x0/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0x85 [] ? kernel_thread_helper+0x0/0x10 ---[ end trace 241f49bf65e0f4f0 ]--- ========================================================= [ INFO: possible irq lock inversion dependency detected ] 2.6.34-rc3 #6 --------------------------------------------------------- events/1/10 just changed the state of lock: (&bonding_netdev_xmit_lock_key){+.+...}, at: [] queue_process+0x83/0x160 but this lock was taken by another, SOFTIRQ-safe lock in the past: (&(&dev->tx_global_lock)->rlock){+.-...} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 4 locks held by events/1/10: #0: (events){+.+.+.}, at: [] worker_thread+0x156/0x2ae #1: ((&(&npinfo->tx_work)->work)){+.+...}, at: [] worker_thread+0x156/0x2ae #2: (&bonding_netdev_xmit_lock_key){+.+...}, at: [] queue_process+0x83/0x160 #3: (&bond->lock){++.+..}, at: [] bond_start_xmit+0x52/0x3e9 [bonding] the shortest dependencies between 2nd lock and 1st lock: -> (&(&dev->tx_global_lock)->rlock){+.-...} ops: 129 { HARDIRQ-ON-W at: [] __lock_acquire+0x643/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_deactivate+0x6f/0x195 [] linkwatch_do_dev+0x9a/0xae [] __linkwatch_run_queue+0x106/0x14a [] linkwatch_event+0x2a/0x31 [] worker_thread+0x1af/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 IN-SOFTIRQ-W at: [] __lock_acquire+0x5f7/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_watchdog+0x23/0x1f2 [] run_timer_softirq+0x1d1/0x285 [] __do_softirq+0xdb/0x1ab [] call_softirq+0x1c/0x34 [] do_softirq+0x38/0x83 [] irq_exit+0x45/0x47 [] smp_apic_timer_interrupt+0x88/0x98 [] apic_timer_interrupt+0x13/0x20 [] cpu_idle+0x4d/0x6b [] rest_init+0xbe/0xc2 [] start_kernel+0x38c/0x399 [] x86_64_start_reservations+0xb5/0xb9 [] x86_64_start_kernel+0xe6/0xed INITIAL USE at: [] __lock_acquire+0x6b0/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_deactivate+0x6f/0x195 [] linkwatch_do_dev+0x9a/0xae [] __linkwatch_run_queue+0x106/0x14a [] linkwatch_event+0x2a/0x31 [] worker_thread+0x1af/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 } ... key at: [] __key.51521+0x0/0x8 ... acquired at: [] validate_chain+0xb87/0xd3a [] __lock_acquire+0x7ad/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_deactivate+0x96/0x195 [] __dev_close+0x69/0x86 [] __dev_change_flags+0xa8/0x12b [] dev_change_flags+0x1c/0x51 [] devinet_ioctl+0x26e/0x5d0 [] inet_ioctl+0x8a/0xa2 [] sock_do_ioctl+0x26/0x45 [] sock_ioctl+0x213/0x226 [] vfs_ioctl+0x2a/0x9d [] do_vfs_ioctl+0x491/0x4e2 [] sys_ioctl+0x57/0x7a [] system_call_fastpath+0x16/0x1b -> (&bonding_netdev_xmit_lock_key){+.+...} ops: 2 { HARDIRQ-ON-W at: [] __lock_acquire+0x643/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_deactivate+0x96/0x195 [] __dev_close+0x69/0x86 [] __dev_change_flags+0xa8/0x12b [] dev_change_flags+0x1c/0x51 [] devinet_ioctl+0x26e/0x5d0 [] inet_ioctl+0x8a/0xa2 [] sock_do_ioctl+0x26/0x45 [] sock_ioctl+0x213/0x226 [] vfs_ioctl+0x2a/0x9d [] do_vfs_ioctl+0x491/0x4e2 [] sys_ioctl+0x57/0x7a [] system_call_fastpath+0x16/0x1b SOFTIRQ-ON-W at: [] mark_held_locks+0x49/0x69 [] trace_hardirqs_on_caller+0x113/0x13e [] trace_hardirqs_on+0xd/0xf [] local_bh_enable+0xb2/0xba [] dev_queue_xmit+0x408/0x467 [] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [] bond_start_xmit+0x139/0x3e9 [bonding] [] queue_process+0xa8/0x160 [] worker_thread+0x1af/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 INITIAL USE at: [] __lock_acquire+0x6b0/0x813 [] lock_acquire+0xc8/0xed [] _raw_spin_lock+0x31/0x66 [] dev_deactivate+0x96/0x195 [] __dev_close+0x69/0x86 [] __dev_change_flags+0xa8/0x12b [] dev_change_flags+0x1c/0x51 [] devinet_ioctl+0x26e/0x5d0 [] inet_ioctl+0x8a/0xa2 [] sock_do_ioctl+0x26/0x45 [] sock_ioctl+0x213/0x226 [] vfs_ioctl+0x2a/0x9d [] do_vfs_ioctl+0x491/0x4e2 [] sys_ioctl+0x57/0x7a [] system_call_fastpath+0x16/0x1b } ... key at: [] bonding_netdev_xmit_lock_key+0x0/0xffffffffffffa78c [bonding] ... acquired at: [] check_usage_backwards+0xb8/0xc7 [] mark_lock+0x311/0x54d [] mark_held_locks+0x49/0x69 [] trace_hardirqs_on_caller+0x113/0x13e [] trace_hardirqs_on+0xd/0xf [] local_bh_enable+0xb2/0xba [] dev_queue_xmit+0x408/0x467 [] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [] bond_start_xmit+0x139/0x3e9 [bonding] [] queue_process+0xa8/0x160 [] worker_thread+0x1af/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 stack backtrace: Pid: 10, comm: events/1 Tainted: G W 2.6.34-rc3 #6 Call Trace: [] print_irq_inversion_bug+0x121/0x130 [] check_usage_backwards+0xb8/0xc7 [] ? check_usage_backwards+0x0/0xc7 [] mark_lock+0x311/0x54d [] mark_held_locks+0x49/0x69 [] ? local_bh_enable+0xb2/0xba [] trace_hardirqs_on_caller+0x113/0x13e [] ? dev_queue_xmit+0x408/0x467 [] trace_hardirqs_on+0xd/0xf [] local_bh_enable+0xb2/0xba [] dev_queue_xmit+0x408/0x467 [] ? dev_queue_xmit+0x10d/0x467 [] bond_dev_queue_xmit+0x1cd/0x1f9 [bonding] [] bond_start_xmit+0x139/0x3e9 [bonding] [] queue_process+0xa8/0x160 [] ? queue_process+0x0/0x160 [] worker_thread+0x1af/0x2ae [] ? worker_thread+0x156/0x2ae [] ? autoremove_wake_function+0x0/0x38 [] ? worker_thread+0x0/0x2ae [] kthread+0x7d/0x85 [] kernel_thread_helper+0x4/0x10 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0x85 [] ? kernel_thread_helper+0x0/0x10 Dead loop on virtual device bond0, fix it urgently! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/