Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752903AbcCYDA7 (ORCPT ); Thu, 24 Mar 2016 23:00:59 -0400 Received: from mail.crc.id.au ([203.56.246.92]:39230 "EHLO mail.crc.id.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751340AbcCYDA5 (ORCPT ); Thu, 24 Mar 2016 23:00:57 -0400 X-Greylist: delayed 464 seconds by postgrey-1.27 at vger.kernel.org; Thu, 24 Mar 2016 23:00:56 EDT To: xen-devel , linux-kernel@vger.kernel.org From: Steven Haigh Subject: 4.4: INFO: rcu_sched self-detected stall on CPU Cc: "gregkh@linuxfoundation.org" Message-ID: <56F4A816.3050505@crc.id.au> Date: Fri, 25 Mar 2016 13:53:10 +1100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0 MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="OgSGBQvEDbATpqvWvJHhNdxFw9aiA7lLr" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6526 Lines: 147 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --OgSGBQvEDbATpqvWvJHhNdxFw9aiA7lLr Content-Type: multipart/mixed; boundary="w3Nh9dN0o1VtU282APgPmxxnxow925LGs" From: Steven Haigh To: xen-devel , linux-kernel@vger.kernel.org Cc: "gregkh@linuxfoundation.org" Message-ID: <56F4A816.3050505@crc.id.au> Subject: 4.4: INFO: rcu_sched self-detected stall on CPU --w3Nh9dN0o1VtU282APgPmxxnxow925LGs Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi all, Firstly, I've cross-posted this to xen-devel and the lkml - as this problem seems to only exist when using kernel 4.4 as a Xen DomU kernel. I have also CC'ed Greg KH for his awesome insight as maintainer. Please CC myself into replies - as I'm not a member of the kernel mailing list - I may miss replies from monitoring the archives. I've noticed recently that heavy disk IO is causing rcu_sched to detect stalls. The process mentioned usually goes to 100% CPU usage, and eventually processes start segfaulting and dying. The only fix to recover the system is to use 'xl destroy' to force-kill the VM and to start it again. The majority of these issues seem to mention ext4 in the trace. This may indicate an issue there - or may be a red herring. The gritty details: INFO: rcu_sched self-detected stall on CPU #0110-...: (20999 ticks this GP) idle=3D327/140000000000001/0 softirq=3D1101493/1101493 fqs=3D6973 #011 (t=3D21000 jiffies g=3D827095 c=3D827094 q=3D524) Task dump for CPU 0: rsync R running task 0 2446 2444 0x00000088 ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000 ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001 ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00 Call Trace: [] sched_show_task+0xaf/0x110 [] dump_cpu_task+0x39/0x40 [] rcu_dump_cpu_stacks+0x8a/0xc0 [] rcu_check_callbacks+0x424/0x7a0 [] ? account_system_time+0x81/0x110 [] ? account_process_tick+0x61/0x160 [] ? tick_sched_do_timer+0x30/0x30 [] update_process_times+0x39/0x60 [] tick_sched_handle.isra.15+0x36/0x50 [] tick_sched_timer+0x3d/0x70 [] __hrtimer_run_queues+0xf2/0x250 [] hrtimer_interrupt+0xa8/0x190 [] xen_timer_interrupt+0x2e/0x140 [] handle_irq_event_percpu+0x55/0x1e0 [] handle_percpu_irq+0x3a/0x50 [] generic_handle_irq+0x22/0x30 [] __evtchn_fifo_handle_events+0x15f/0x180 [] evtchn_fifo_handle_events+0x10/0x20 [] __xen_evtchn_do_upcall+0x43/0x80 [] xen_evtchn_do_upcall+0x30/0x50 [] xen_hvm_callback_vector+0x82/0x90 [] ? queued_write_lock_slowpath+0x3d/0x80 [] _raw_write_lock+0x1e/0x30 [] ext4_es_remove_extent+0x43/0xc0 [] ext4_clear_inode+0x39/0x80 [] ext4_evict_inode+0x8d/0x4e0 [] evict+0xb7/0x180 [] dispose_list+0x36/0x50 [] prune_icache_sb+0x4b/0x60 [] super_cache_scan+0x141/0x190 [] shrink_slab.part.37+0x1ee/0x390 [] shrink_zone+0x26c/0x280 [] do_try_to_free_pages+0x15c/0x410 [] try_to_free_pages+0xba/0x170 [] __alloc_pages_nodemask+0x525/0xa60 [] ? kmem_cache_free+0xcc/0x2c0 [] alloc_pages_current+0x8d/0x120 [] __page_cache_alloc+0x91/0xc0 [] pagecache_get_page+0x56/0x1e0 [] grab_cache_page_write_begin+0x26/0x40 [] ext4_da_write_begin+0xa1/0x300 [] ? ext4_da_write_end+0x124/0x2b0 [] generic_perform_write+0xc0/0x1a0 [] __generic_file_write_iter+0x188/0x1e0 [] ext4_file_write_iter+0xf0/0x340 [] __vfs_write+0xaa/0xe0 [] vfs_write+0xa2/0x1a0 [] SyS_write+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x12/0x71 Some 11 hours later: sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error 4 in ld-2.17.so[7f03bb940000+21000] sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4 in ld-2.17.so[7f6b4a0d5000+21000] systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000] sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error 4 in ld-2.17.so[7ff8c8a82000+21000] sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error 4 in ld-2.17.so[7f183cf6b000+21000] sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error 4 in ld-2.17.so[7f3c665e2000+21000] =46rom isolated testing, this does not occur on kernel 4.5.x - however I have not verified this myself. The kernel config used can be found in the kernel-xen git repo if it assists in debugging: http://xen.crc.id.au/git/?p=3Dkernel-xen;a=3Dsummary --=20 Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 --w3Nh9dN0o1VtU282APgPmxxnxow925LGs-- --OgSGBQvEDbATpqvWvJHhNdxFw9aiA7lLr Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJW9KgWAAoJEEGvNdV6fTHcgK4QAJEBM3/3qTNGr4s8gIq36weh o8RGa96gLN1jeH2iLhA+gC/QoGXkqfukPfmo/zLy8vHC1dtIb3GFwjY0LDs3mUSO ewrTELtdYF3s46rHTbaWAbEyW5eAHQU+LbNz3miR4ECnbmz1HuTzZNFoh6f7DDL5 UGNDq3QPVv+SUzXLDJXCv+LgfLLluP8PCov3Z9344CHN8MEkFUosome8WtUtdOyP zqTDskh92V/QdNJsBkTVPGzup4RBXxaffiX7zmdYEGHPy1fpWO9IFsck1Gnx6PwS 91GnFamsBj+ZZMWgHtyXp8qNtkZsqYFIbU/ygokyBcg6HNxB7dJWvFkhsB3a0p87 BqNSuP8SipjkH4pVcsSfV1hK4JbZvs9rDUMRScEQ9vdfDt3DXQNTNSA5V2mOd52Q AUT2klagJavJTf5mnm+WEE553F8rUUTpx1/TFLrlrlJBoF3ZsB+kuYP2+qZ1yMt0 oPiUutjKPLt0f1ko6+Du4xLdjCgAoCzQqzB0Dt2evPqFvHyoQmIwiGPqb62YV6mW 1d9xNb51u5MsvwM7dIfGRb0LJ7qYlf7SzfWTUNB3QPcAS12O4/r5tB/7Pat0c60c WKLu1mUsdTBN/6lWJJe3qPeIDLM/Wpv88hqrGnFQ0kyBufcEFGUF7fdu2Rzy7rm0 PcoCNdlhpksARM4OWa8/ =2cs3 -----END PGP SIGNATURE----- --OgSGBQvEDbATpqvWvJHhNdxFw9aiA7lLr--