Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755826AbcDBBuX (ORCPT ); Fri, 1 Apr 2016 21:50:23 -0400 Received: from mail.crc.id.au ([203.56.246.92]:59566 "EHLO mail.crc.id.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752095AbcDBBuV (ORCPT ); Fri, 1 Apr 2016 21:50:21 -0400 Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> <56F56172.9020805@crc.id.au> <56F5653B.1090700@oracle.com> <56F5A87A.8000903@crc.id.au> <56FA4336.2030301@crc.id.au> <56FA8DDD.7070406@oracle.com> From: Steven Haigh Message-ID: <56FF254F.3050103@crc.id.au> Date: Sat, 2 Apr 2016 12:50:07 +1100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0 MIME-Version: 1.0 In-Reply-To: <56FA8DDD.7070406@oracle.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="capVQOPVvLGCVX0Jv3WfLpNN07ulcmmlM" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12175 Lines: 265 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --capVQOPVvLGCVX0Jv3WfLpNN07ulcmmlM Content-Type: multipart/mixed; boundary="EweqmdkLFPEBxd4IeMwsNkngIlm7tF3ja" From: Steven Haigh To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org Message-ID: <56FF254F.3050103@crc.id.au> Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> <56F56172.9020805@crc.id.au> <56F5653B.1090700@oracle.com> <56F5A87A.8000903@crc.id.au> <56FA4336.2030301@crc.id.au> <56FA8DDD.7070406@oracle.com> In-Reply-To: <56FA8DDD.7070406@oracle.com> --EweqmdkLFPEBxd4IeMwsNkngIlm7tF3ja Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 30/03/2016 1:14 AM, Boris Ostrovsky wrote: > On 03/29/2016 04:56 AM, Steven Haigh wrote: >> >> Interestingly enough, this just happened again - but on a different >> virtual machine. I'm starting to wonder if this may have something to = do >> with the uptime of the machine - as the system that this seems to happ= en >> to is always different. >> >> Destroying it and monitoring it again has so far come up blank. >> >> I've thrown the latest lot of kernel messages here: >> http://paste.fedoraproject.org/346802/59241532 >=20 > Would be good to see full console log. The one that you posted starts > with an error so I wonder what was before that. Ok, so I had a virtual machine do this again today. Both vcpus went to 100% usage and essentially hung. I attached to the screen console that was connected via 'xl console' and copied the entire buffer to paste belo= w: yum-cron[30740]: segfault at 1781ab8 ip 00007f2a7fcd282f sp 00007ffe8655fe90 error 5 in libpython2.7.so.1.0[7f2a7fbf5000+178000] swap_free: Bad swap file entry 2a2b7d5bb69515d8 BUG: Bad page map in process yum-cron pte:56fab76d2a2bb06a pmd:0309e067 addr:0000000001780000 vm_flags:00100073 anon_vma:ffff88007b974c08 mapping: (null) index:1780 file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 0 PID: 30740 Comm: yum-cron Tainted: G B 4.4.6-4.el7xen.x86_64 #1 0000000000000000 ffff88004176bac0 ffffffff81323d17 0000000001780000 00003ffffffff000 ffff88004176bb08 ffffffff8117e574 ffffffff81193d6e 0000000000001780 ffff88000309ec00 0000000001780000 56fab76d2a2bb06a Call Trace: [] dump_stack+0x63/0x8c [] print_bad_pte+0x1e4/0x290 [] ? swap_info_get+0x7e/0xe0 [] unmap_single_vma+0x4ff/0x840 [] unmap_vmas+0x47/0x90 [] exit_mmap+0x98/0x150 [] mmput+0x47/0x100 [] do_exit+0x24e/0xad0 [] do_group_exit+0x3f/0xa0 [] get_signal+0x1c3/0x5e0 [] do_signal+0x28/0x630 [] ? printk+0x4d/0x4f [] ? vprintk_default+0x1f/0x30 [] ? bad_area_access_error+0x43/0x4a [] ? __do_page_fault+0x22c/0x3f0 [] exit_to_usermode_loop+0x4c/0x95 [] prepare_exit_to_usermode+0x18/0x20 [] retint_user+0x8/0x13 BUG: Bad page state in process yum-cron pfn:0f3bf page:ffffea00003cefc0 count:0 mapcount:7 mapping:ffff88000f3bf008 index:0xffff88000f3bf000 flags: 0x10000000000094(referenced|dirty|slab) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x80(slab) Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel xen_gntalloc xen_evtchn ipv6 autofs4 CPU: 0 PID: 30740 Comm: yum-cron Tainted: G B 4.4.6-4.el7xen.x86_64 #1 0000000000000000 ffff88004176b958 ffffffff81323d17 ffffea00003cefc0 ffffffff817ab348 ffff88004176b980 ffffffff811c1ab5 ffffea00003cefc0 0000000000000001 0000000000000000 ffff88004176b9d0 ffffffff81159584 Call Trace: [] dump_stack+0x63/0x8c [] bad_page.part.69+0xdf/0xfc [] free_pages_prepare+0x294/0x2a0 [] free_hot_cold_page+0x31/0x160 [] free_hot_cold_page_list+0x49/0xb0 [] release_pages+0xc5/0x260 [] free_pages_and_swap_cache+0x7d/0x90 [] tlb_flush_mmu_free+0x36/0x60 [] unmap_single_vma+0x664/0x840 [] unmap_vmas+0x47/0x90 [] exit_mmap+0x98/0x150 [] mmput+0x47/0x100 [] do_exit+0x24e/0xad0 [] do_group_exit+0x3f/0xa0 [] get_signal+0x1c3/0x5e0 [] do_signal+0x28/0x630 [] ? printk+0x4d/0x4f [] ? vprintk_default+0x1f/0x30 [] ? bad_area_access_error+0x43/0x4a [] ? __do_page_fault+0x22c/0x3f0 [] exit_to_usermode_loop+0x4c/0x95 [] prepare_exit_to_usermode+0x18/0x20 [] retint_user+0x8/0x13 BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:0 val:-1 BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:1 val:2 BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:2 val:-1 yum-cron[4197]: segfault at 32947fcb ip 00007ff0fa1bf8bd sp 00007ffdb1c54990 error 4 in libpython2.7.so.1.0[7ff0fa13a000+178000] BUG: unable to handle kernel paging request at ffff88010f3beffe IP: [] free_block+0x119/0x190 PGD 188b063 PUD 0 Oops: 0002 [#1] SMP Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel xen_gntalloc xen_evtchn ipv6 autofs4 CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G B 4.4.6-4.el7xen.x86_64 #1 Workqueue: events cache_reap task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000 RIP: 0010:[] [] free_block+0x119/0x1= 90 RSP: 0018:ffff880051703d40 EFLAGS: 00010082 RAX: ffffea00003cefc0 RBX: ffffea0000000000 RCX: ffff88000f3bf000 RDX: 00000000fffffffe RSI: ffff88007fd19c40 RDI: ffff88007d012100 RBP: ffff880051703d68 R08: ffff880051703d88 R09: 0000000000000006 R10: ffff88007d01a940 R11: 0000000080000000 R12: 000077ff80000000 R13: ffff88007fd19c58 R14: ffff88007d01a948 R15: ffff88007d01a968 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000= 000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88010f3beffe CR3: 000000007bc8d000 CR4: 0000000000040620 Stack: ffff88007fd19bf0 ffff880051703d88 ffff88007d01a940 000000000000000b ffff88007d012100 ffff880051703dc0 ffffffff811a871b 0000000b51703dd8 ffff88007fd19c00 ffff880051703d88 ffff880051703d88 ffff88007d012100 Call Trace: [] drain_array+0xab/0x130 [] cache_reap+0x6c/0x230 [] process_one_work+0x151/0x400 [] worker_thread+0x11a/0x460 [] ? __schedule+0x2bf/0x880 [] ? rescuer_thread+0x2f0/0x2f0 [] kthread+0xc9/0xe0 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x3f/0x70 [] ? kthread_park+0x60/0x60 Code: 41 89 d1 4c 0f af c9 0f b6 4f 1c 49 c1 e9 20 44 29 ca d3 ea 0f b6 4f 1d 41 01 d1 8b 50 18 41 d3 e9 48 8b 48 10 83 ea 01 89 50 18 <44> 88 0c 11 49 8b 52 38 48 83 c2 01 49 89 52 38 8b 48 18 85 c9 RIP [] free_block+0x119/0x190 RSP CR2: ffff88010f3beffe ---[ end trace 99d22a980c39d4db ]--- BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [] kthread_data+0x10/0x20 PGD 188d067 PUD 188f067 PMD 0 Oops: 0000 [#2] SMP Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel xen_gntalloc xen_evtchn ipv6 autofs4 CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G B D 4.4.6-4.el7xen.x86_64 #1 task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000 RIP: 0010:[] [] kthread_data+0x10/0x= 20 RSP: 0018:ffff880051703a60 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffffffff81bd4080 RSI: 0000000000000001 RDI: ffff8800346bf1c0 RBP: ffff880051703a60 R08: ffff88007c8e19d0 R09: 0000c88ed0b2b602 R10: 000000000000001f R11: 0000000000000000 R12: ffff8800346bf1c0 R13: 0000000000000000 R14: 00000000000163c0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000= 000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 000000007bc8d000 CR4: 0000000000040620 Stack: ffff880051703a78 ffffffff81094741 ffff88007fd163c0 ffff880051703ac0 ffffffff8165858f ffffffff810d2e27 ffff8800346bf1c0 ffff880051704000 ffff8800346bf510 0000000000000046 ffff8800346bf1c0 ffff88007d3facc0 Call Trace: [] wq_worker_sleeping+0x11/0x90 [] __schedule+0x3bf/0x880 [] ? call_rcu_sched+0x17/0x20 [] schedule+0x35/0x80 [] do_exit+0x659/0xad0 [] oops_end+0x9a/0xd0 [] no_context+0x10d/0x360 [] __bad_area_nosemaphore+0x109/0x210 [] bad_area_nosemaphore+0x13/0x20 [] __do_page_fault+0x80/0x3f0 [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 [] ? free_block+0x119/0x190 [] drain_array+0xab/0x130 [] cache_reap+0x6c/0x230 [] process_one_work+0x151/0x400 [] worker_thread+0x11a/0x460 [] ? __schedule+0x2bf/0x880 [] ? rescuer_thread+0x2f0/0x2f0 [] kthread+0xc9/0xe0 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x3f/0x70 [] ? kthread_park+0x60/0x60 Code: ff ff eb 92 be 3a 02 00 00 48 c7 c7 0b 5e 7a 81 e8 e6 34 fe ff e9 d5 fe ff ff 90 66 66 66 66 90 55 48 8b 87 00 04 00 00 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 RIP [] kthread_data+0x10/0x20 RSP CR2: ffffffffffffffd8 ---[ end trace 99d22a980c39d4dc ]--- Fixing recursive fault but reboot is needed! There is nothing further logged on the console after, and any previous messages are concerning an NFS server. --=20 Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 --EweqmdkLFPEBxd4IeMwsNkngIlm7tF3ja-- --capVQOPVvLGCVX0Jv3WfLpNN07ulcmmlM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIbBAEBCAAGBQJW/yVYAAoJEEGvNdV6fTHcUHMP9iC8fmhVOu0t/xkF09BkhLns y+6ddzo7CkQZy7NngRrKPomRBS+NERjtVQdRonHEs5W5lYIbJg7MhSMVSIj6HBIN ZHdrDWKM45vU34+XXRS/Sbs9jyk7NsAyOy9KdHqrqcs6CIqhmaJ35zl4WaWf3982 9K+OTISWUvAOracutAYOi4gCCHuXzbrWKirGpi566FJmdQKB6uhio3XCDSS1J/eA JZ/ZtWjCBohEFDrH1gpMre8xX4aosz0eL7FWI/rp7/iojM5RKOD9cHyHeuq/OXfG x7WXuYE+W8xHW1UpdMtHtR2HZBSdCpXXeB+mostXkXnoHJpeLlvaKB56QyspNfIM 7F6Ac9uWOu2xxFpP+dnE2yBWqYPODMnfSpYWCKlvO/rwO3m75kf+awN3yniQJIWK DXo1u3KP+d5V1b5YbYHG1mtbALXFucUtuzu1Q2RLk76Z6wFPcqkhsv/41vcV96Ki +EHvGVjb3+soVj2WABmva2jfBp4ee1XkEuBBK7J/3quk0yYvlFaH6/2cfw242p/y HImGxcROdhRw6YT6yIRM9EP5iAFxmFEtR1u0xI4YtyhRVgkLIp7y0PzGZFAoHJ92 vRhttW5uW9gpEX4uVh4FI3q9MPi28QlO5zKUHwMyLPnK51MAj1tMfH/m9QfputI9 ggKQZxzRH0ceeyiUGf0= =xHDm -----END PGP SIGNATURE----- --capVQOPVvLGCVX0Jv3WfLpNN07ulcmmlM--