Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752353Ab3CAVmO (ORCPT ); Fri, 1 Mar 2013 16:42:14 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:28726 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751974Ab3CAVmM (ORCPT ); Fri, 1 Mar 2013 16:42:12 -0500 Message-ID: <513120AF.5030501@oracle.com> Date: Fri, 01 Mar 2013 13:42:07 -0800 From: Srinivas Eeda User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: richard -rw- weinberger CC: ocfs2-devel@oss.oracle.com, linux-fsdevel , ocfs2-users@oss.oracle.com, LKML Subject: Re: [Ocfs2-users] [OCFS2] Crash at o2net_shutdown_sc() References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12990 Lines: 242 Yes that was the crash I was referring to which stopped me from testing my other patch on mainline. I think the crashes started since some workqueue patches introduced by commit 57b30ae77bf00d2318df711ef9a4d2a9be0a3a2a Earlier kernels should be fine. Patch https://lkml.org/lkml/2012/10/18/592 tried to address one fix which helped ramster that uses same ocfs2/o2net code. There still seems to be another problem that crashes ocfs2. On 03/01/2013 01:25 PM, richard -rw- weinberger wrote: > Hi! > > Using 3.8.1 OCFS2 crashes while joining nodes to the cluster. > The cluster consists of 10 nodes, while node3 joins the kernel on node3 crashes. > (Somtimes later...) > See dmesg below. > Is this a known issue? I didn't test older kernels so far. > > node1: > [ 1471.881922] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997 > ( 0 ) 1 nodes > [ 1471.919522] JBD2: Ignoring recovery information on journal > [ 1471.947027] ocfs2: Mounting device (253,16) on (node 0, slot 0) > with ordered data mode. > [ 1475.802497] o2net: Accepted connection from node node2 (num 1) at > 192.168.66.2:7777 > [ 1481.814048] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 8 > [ 1481.814955] o2net: No longer connected to node node2 (num 1) at > 192.168.66.2:7777 > [ 1482.468827] o2net: Accepted connection from node node3 (num 2) at > 192.168.66.3:7777 > [ 1511.904100] o2net: No connection established with node 1 after 30.0 > seconds, giving up. > [ 1514.472995] o2net: Connection to node node3 (num 2) at > 192.168.66.3:7777 shutdown, state 8 > [ 1514.473960] o2net: No longer connected to node node3 (num 2) at > 192.168.66.3:7777 > [ 1516.076044] o2net: Accepted connection from node node2 (num 1) at > 192.168.66.2:7777 > [ 1520.181430] o2dlm: Node 1 joins domain > 1352E2692E704EEB8040E5B8FF560997 ( 0 1 ) 2 nodes > [ 1544.544030] o2net: No connection established with node 2 after 30.0 > seconds, giving up. > [ 1574.624029] o2net: No connection established with node 2 after 30.0 > seconds, giving up. > > node2: > [ 1475.613170] o2net: Connected to node node1 (num 0) at 192.168.66.1:7777 > [ 1481.620253] o2hb: Unable to stabilize heartbeart on region > 1352E2692E704EEB8040E5B8FF560997 (vdb) > [ 1481.622489] o2net: No longer connected to node node1 (num 0) at > 192.168.66.1:7777 > [ 1515.886605] o2net: Connected to node node1 (num 0) at 192.168.66.1:7777 > [ 1519.992766] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997 > ( 0 1 ) 2 nodes > [ 1520.017054] JBD2: Ignoring recovery information on journal > [ 1520.078888] ocfs2: Mounting device (253,16) on (node 1, slot 1) > with ordered data mode. > [ 1520.159590] mount.ocfs2 (2186) used greatest stack depth: 2568 bytes left > > node3: > [ 1482.836865] o2net: Connected to node node1 (num 0) at 192.168.66.1:7777 > [ 1482.837542] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1484.840952] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1486.844994] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1488.848952] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1490.853052] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1492.857046] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1494.861042] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1496.865024] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1498.869021] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1500.873016] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1502.877056] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1504.881042] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1506.885040] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1508.888991] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1510.893077] o2net: Connection to node node2 (num 1) at > 192.168.66.2:7777 shutdown, state 7 > [ 1512.843172] (mount.ocfs2,2179,0):dlm_request_join:1477 ERROR: Error > -107 when sending message 510 (key 0x666c6172) to node 1 > [ 1512.845580] (mount.ocfs2,2179,0):dlm_try_to_join_domain:1653 ERROR: > status = -107 > [ 1512.847778] (mount.ocfs2,2179,0):dlm_join_domain:1955 ERROR: status = -107 > [ 1512.849334] (mount.ocfs2,2179,0):dlm_register_domain:2214 ERROR: > status = -107 > [ 1512.850921] (mount.ocfs2,2179,0):o2cb_cluster_connect:368 ERROR: > status = -107 > [ 1512.852511] (mount.ocfs2,2179,0):ocfs2_dlm_init:3004 ERROR: status = -107 > [ 1512.854090] (mount.ocfs2,2179,0):ocfs2_mount_volume:1881 ERROR: status = -107 > [ 1512.855476] ocfs2: Unmounting device (253,16) on (node 0) > [ 1512.855915] (mount.ocfs2,2179,0):ocfs2_fill_super:1230 ERROR: status = -107 > [ 1514.839138] o2net: No longer connected to node node1 (num 0) at > 192.168.66.1:7777 > [ 1514.840690] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000028 > [ 1514.841627] IP: [] kernel_sock_ioctl+0x50/0x50 > [ 1514.841627] PGD 1d980067 PUD 1db0c067 PMD 0 > [ 1514.841627] Oops: 0000 [#1] PREEMPT SMP > [ 1514.841627] Modules linked in: > [ 1514.841627] CPU 0 > [ 1514.841627] Pid: 6, comm: kworker/u:0 Not tainted 3.8.1+ #8 Bochs Bochs > [ 1514.841627] RIP: 0010:[] [] > kernel_sock_ioctl+0x50/0x50 > [ 1514.841627] RSP: 0018:ffff88001f949d40 EFLAGS: 00010292 > [ 1514.841627] RAX: ffff88001e6e9c01 RBX: ffff88001e6e9000 RCX: 0000000180080006 > [ 1514.841627] RDX: 0000000180080007 RSI: 0000000000000002 RDI: 0000000000000000 > [ 1514.841627] RBP: ffff88001f949dc8 R08: 0000000000016090 R09: 0000000000000001 > [ 1514.841627] R10: ffffea000079ba40 R11: ffffffff81321207 R12: ffff88001db69440 > [ 1514.841627] R13: ffffffff8204c760 R14: ffff88001db695a8 R15: ffff88001e6e9058 > [ 1514.841627] FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) > knlGS:0000000000000000 > [ 1514.841627] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 1514.841627] CR2: 0000000000000028 CR3: 000000001c009000 CR4: 00000000000006f0 > [ 1514.841627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 1514.841627] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 1514.841627] Process kworker/u:0 (pid: 6, threadinfo > ffff88001f948000, task ffff88001f90a670) > [ 1514.841627] Stack: > [ 1514.841627] ffffffff81323776 ffff88001fad2e20 0000000000010500 > ffff88001f949d98 > [ 1514.841627] ffffffff810013fa ffff88001f949d78 ffffffff818a5c22 > ffff88001f949da8 > [ 1514.841627] ffffffff81069d26 0000000000012900 0000000000012900 > 0000000000000000 > [ 1514.841627] Call Trace: > [ 1514.841627] [] ? o2net_shutdown_sc+0x106/0x1e0 > [ 1514.841627] [] ? __switch_to+0x2a/0x4a0 > [ 1514.841627] [] ? _raw_spin_unlock_irq+0x12/0x40 > [ 1514.841627] [] ? finish_task_switch+0x56/0xc0 > [ 1514.841627] [] process_one_work+0x133/0x510 > [ 1514.841627] [] ? o2net_sc_connect_completed+0xf0/0xf0 > [ 1514.841627] [] worker_thread+0x15d/0x450 > [ 1514.841627] [] ? busy_worker_rebind_fn+0x100/0x100 > [ 1514.841627] [] kthread+0xbb/0xc0 > [ 1514.841627] [] ? e1000_regdump+0x262/0x3be > [ 1514.841627] [] ? kthread_create_on_node+0x130/0x130 > [ 1514.841627] [] ret_from_fork+0x7c/0xb0 > [ 1514.841627] [] ? kthread_create_on_node+0x130/0x130 > [ 1514.841627] Code: ff ff ff ff ff ff 48 8b 47 28 ff 50 48 4c 89 a3 > 48 e0 ff ff 48 8b 5d f0 4c 8b 65 f8 c9 c3 66 66 66 66 2e 0f 1f 84 00 > 00 00 00 00 <48> 8b 47 28 55 48 89 e5 ff 50 60 5d c3 0f 1f 00 55 41 b8 > 4b 43 > [ 1514.841627] RIP [] kernel_sock_ioctl+0x50/0x50 > [ 1514.841627] RSP > [ 1514.841627] CR2: 0000000000000028 > [ 1514.867242] ---[ end trace 36ffe9378168cdc2 ]--- > [ 1514.867619] BUG: unable to handle kernel paging request at ffffffffffffffd8 > [ 1514.868007] IP: [] kthread_data+0xb/0x20 > [ 1514.868007] PGD 1e0d067 PUD 1e0e067 PMD 0 > [ 1514.868007] Oops: 0000 [#2] PREEMPT SMP > [ 1514.868007] Modules linked in: > [ 1514.868007] CPU 0 > [ 1514.868007] Pid: 6, comm: kworker/u:0 Tainted: G D 3.8.1+ > #8 Bochs Bochs > [ 1514.868007] RIP: 0010:[] [] > kthread_data+0xb/0x20 > [ 1514.868007] RSP: 0018:ffff88001f949928 EFLAGS: 00010092 > [ 1514.868007] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 > [ 1514.868007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88001f90a670 > [ 1514.868007] RBP: ffff88001f949928 R08: 0000000000000001 R09: 0000000000000196 > [ 1514.868007] R10: ffff88001f9116a0 R11: 0000000000000004 R12: ffff88001f90aa40 > [ 1514.868007] R13: 0000000000000000 R14: ffff88001f90a660 R15: ffff88001f90a670 > [ 1514.868007] FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) > knlGS:0000000000000000 > [ 1514.868007] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 1514.868007] CR2: ffffffffffffffd8 CR3: 000000001d83d000 CR4: 00000000000006f0 > [ 1514.868007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 1514.868007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 1514.868007] Process kworker/u:0 (pid: 6, threadinfo > ffff88001f948000, task ffff88001f90a670) > [ 1514.868007] Stack: > [ 1514.868007] ffff88001f949948 ffffffff81059980 ffff88001f949948 > ffff88001fc12900 > [ 1514.868007] ffff88001f9499c8 ffffffff818a4c48 ffff88001f949978 > 0000000000000000 > [ 1514.868007] ffff88001f949fd8 ffff88001f90a670 ffff88001f949fd8 > ffff88001f949fd8 > [ 1514.868007] Call Trace: > [ 1514.868007] [] wq_worker_sleeping+0x10/0xc0 > [ 1514.868007] [] __schedule+0x528/0x770 > [ 1514.868007] [] schedule+0x24/0x70 > [ 1514.868007] [] do_exit+0x702/0xad0 > [ 1514.868007] [] oops_end+0x91/0xe0 > [ 1514.868007] [] no_context+0x24e/0x279 > [ 1514.868007] [] ? tcp_rearm_rto+0xa8/0xd0 > [ 1514.868007] [] __bad_area_nosemaphore+0x1b5/0x1d4 > [ 1514.868007] [] bad_area_nosemaphore+0xe/0x10 > [ 1514.868007] [] __do_page_fault+0x296/0x490 > [ 1514.868007] [] ? sock_destroy_inode+0x2e/0x40 > [ 1514.868007] [] ? destroy_inode+0x37/0x60 > [ 1514.868007] [] ? evict+0x11a/0x1b0 > [ 1514.868007] [] ? sc_kref_release+0x77/0x160 > [ 1514.868007] [] ? kfree+0x121/0x160 > [ 1514.868007] [] ? sc_kref_release+0x77/0x160 > [ 1514.868007] [] do_page_fault+0x9/0x10 > [ 1514.868007] [] page_fault+0x22/0x30 > [ 1514.868007] [] ? sc_kref_release+0x77/0x160 > [ 1514.868007] [] ? kernel_sock_ioctl+0x50/0x50 > [ 1514.868007] [] ? o2net_shutdown_sc+0x106/0x1e0 > [ 1514.868007] [] ? __switch_to+0x2a/0x4a0 > [ 1514.868007] [] ? _raw_spin_unlock_irq+0x12/0x40 > [ 1514.868007] [] ? finish_task_switch+0x56/0xc0 > [ 1514.868007] [] process_one_work+0x133/0x510 > [ 1514.868007] [] ? o2net_sc_connect_completed+0xf0/0xf0 > [ 1514.868007] [] worker_thread+0x15d/0x450 > [ 1514.868007] [] ? busy_worker_rebind_fn+0x100/0x100 > [ 1514.868007] [] kthread+0xbb/0xc0 > [ 1514.868007] [] ? e1000_regdump+0x262/0x3be > [ 1514.868007] [] ? kthread_create_on_node+0x130/0x130 > [ 1514.868007] [] ret_from_fork+0x7c/0xb0 > [ 1514.868007] [] ? kthread_create_on_node+0x130/0x130 > [ 1514.868007] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 > c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 87 78 03 00 00 > 55 48 89 e5 <48> 8b 40 d8 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 > 00 00 > [ 1514.868007] RIP [] kthread_data+0xb/0x20 > [ 1514.868007] RSP > [ 1514.868007] CR2: ffffffffffffffd8 > [ 1514.868007] ---[ end trace 36ffe9378168cdc3 ]--- > [ 1514.868007] Fixing recursive fault but reboot is needed! > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/