Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753628AbcCYQEK (ORCPT ); Fri, 25 Mar 2016 12:04:10 -0400 Received: from mail.crc.id.au ([203.56.246.92]:41036 "EHLO mail.crc.id.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752256AbcCYQEI (ORCPT ); Fri, 25 Mar 2016 12:04:08 -0400 Subject: Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> Cc: "gregkh@linuxfoundation.org" From: Steven Haigh Message-ID: <56F56172.9020805@crc.id.au> Date: Sat, 26 Mar 2016 03:04:02 +1100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0 MIME-Version: 1.0 In-Reply-To: <56F54EE0.6030004@oracle.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="6uipt3URi8Na8W6LCS5PgrnGEcbsk0Dbx" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9961 Lines: 237 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --6uipt3URi8Na8W6LCS5PgrnGEcbsk0Dbx Content-Type: multipart/mixed; boundary="6PrMP0sxGewkV1sV9Dd60EvVWLTv1kNTv" From: Steven Haigh To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org Cc: "gregkh@linuxfoundation.org" Message-ID: <56F56172.9020805@crc.id.au> Subject: Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> In-Reply-To: <56F54EE0.6030004@oracle.com> --6PrMP0sxGewkV1sV9Dd60EvVWLTv1kNTv Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 26/03/2016 1:44 AM, Boris Ostrovsky wrote: > On 03/25/2016 10:05 AM, Steven Haigh wrote: >> On 25/03/2016 11:23 PM, Boris Ostrovsky wrote: >>> On 03/24/2016 10:53 PM, Steven Haigh wrote: >>>> Hi all, >>>> >>>> Firstly, I've cross-posted this to xen-devel and the lkml - as this >>>> problem seems to only exist when using kernel 4.4 as a Xen DomU kern= el. >>>> I have also CC'ed Greg KH for his awesome insight as maintainer. >>>> >>>> Please CC myself into replies - as I'm not a member of the kernel >>>> mailing list - I may miss replies from monitoring the archives. >>>> >>>> I've noticed recently that heavy disk IO is causing rcu_sched to det= ect >>>> stalls. The process mentioned usually goes to 100% CPU usage, and >>>> eventually processes start segfaulting and dying. The only fix to >>>> recover the system is to use 'xl destroy' to force-kill the VM and t= o >>>> start it again. >>>> >>>> The majority of these issues seem to mention ext4 in the trace. This= >>>> may >>>> indicate an issue there - or may be a red herring. >>>> >>>> The gritty details: >>>> INFO: rcu_sched self-detected stall on CPU >>>> #0110-...: (20999 ticks this GP) idle=3D327/140000000000001/0 >>>> softirq=3D1101493/1101493 fqs=3D6973 >>>> #011 (t=3D21000 jiffies g=3D827095 c=3D827094 q=3D524) >>>> Task dump for CPU 0: >>>> rsync R running task 0 2446 2444 0x00000088 >>>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000 >>>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001 >>>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00 >>>> Call Trace: >>>> [] sched_show_task+0xaf/0x110 >>>> [] dump_cpu_task+0x39/0x40 >>>> [] rcu_dump_cpu_stacks+0x8a/0xc0 >>>> [] rcu_check_callbacks+0x424/0x7a0 >>>> [] ? account_system_time+0x81/0x110 >>>> [] ? account_process_tick+0x61/0x160 >>>> [] ? tick_sched_do_timer+0x30/0x30 >>>> [] update_process_times+0x39/0x60 >>>> [] tick_sched_handle.isra.15+0x36/0x50 >>>> [] tick_sched_timer+0x3d/0x70 >>>> [] __hrtimer_run_queues+0xf2/0x250 >>>> [] hrtimer_interrupt+0xa8/0x190 >>>> [] xen_timer_interrupt+0x2e/0x140 >>>> [] handle_irq_event_percpu+0x55/0x1e0 >>>> [] handle_percpu_irq+0x3a/0x50 >>>> [] generic_handle_irq+0x22/0x30 >>>> [] __evtchn_fifo_handle_events+0x15f/0x180 >>>> [] evtchn_fifo_handle_events+0x10/0x20 >>>> [] __xen_evtchn_do_upcall+0x43/0x80 >>>> [] xen_evtchn_do_upcall+0x30/0x50 >>>> [] xen_hvm_callback_vector+0x82/0x90 >>>> [] ? queued_write_lock_slowpath+0x3d/0x80 >>>> [] _raw_write_lock+0x1e/0x30 >>> This looks to me like ext4 failing to grab a lock. Everything above i= t >>> (in Xen code) is regular tick interrupt handling which detects the >>> stall. >>> >>> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminate= s >>> any possible issues with pv locks. >>> >>> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, >>> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? >> There is nothing special being logged anywhere that I can see. dmesg /= >> xl dmesg on the Dom0 show nothing unusual. >> >> I do share CPUs - but I don't give any DomU more than 2 vcpus. The >> physical host has 4 cores - 1 pinned to the Dom0. >> >> I log to a remote syslog on this system - and I've uploaded the entire= >> log to a pastebin (don't want to do a 45Kb attachment here): >> http://paste.fedoraproject.org/345095/58914452 >=20 > That doesn't look like a full log. In any case, the RCU stall may be a > secondary problem --- there is a bunch of splats before the stall. It may not actually be the full logs. Once the system gets really upset, you can't run anything - as such, grabbing anything from dmesg is not possible. The logs provided above is all that gets spat out to the syslog server. I'll try tinkering with a few things to see if I can get more output - but right now, that's all I've been able to achieve. So far, my only ideas are to remove the 'quiet' options from the kernel command line - but I'm not sure how much that would help. Suggestions gladly accepted on this front. >> >> Not sure if it makes any difference at all, but my DomU config is: >> # cat /etc/xen/backup.vm >> name =3D "backup.vm" >> memory =3D 2048 >> vcpus =3D 2 >> cpus =3D "1-3" >> disk =3D [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ] >> vif =3D [ "mac=3D00:11:36:35:35:09, bridge=3Dbr203, >> vifname=3Dvm.backup, script=3Dvif-bridge" ] >> bootloader =3D 'pygrub' >> pvh =3D 1 >> >> on_poweroff =3D 'destroy' >> on_reboot =3D 'restart' >> on_crash =3D 'restart' >> cpu_weight =3D 64 >> >> I never had this problem when running kernel 4.1.x - it only started >> when I upgraded everything to 4.4 - not exactly a great help - but may= >> help narrow things down? >> >>>> [] ext4_es_remove_extent+0x43/0xc0 >>>> [] ext4_clear_inode+0x39/0x80 >>>> [] ext4_evict_inode+0x8d/0x4e0 >>>> [] evict+0xb7/0x180 >>>> [] dispose_list+0x36/0x50 >>>> [] prune_icache_sb+0x4b/0x60 >>>> [] super_cache_scan+0x141/0x190 >>>> [] shrink_slab.part.37+0x1ee/0x390 >>>> [] shrink_zone+0x26c/0x280 >>>> [] do_try_to_free_pages+0x15c/0x410 >>>> [] try_to_free_pages+0xba/0x170 >>>> [] __alloc_pages_nodemask+0x525/0xa60 >>>> [] ? kmem_cache_free+0xcc/0x2c0 >>>> [] alloc_pages_current+0x8d/0x120 >>>> [] __page_cache_alloc+0x91/0xc0 >>>> [] pagecache_get_page+0x56/0x1e0 >>>> [] grab_cache_page_write_begin+0x26/0x40 >>>> [] ext4_da_write_begin+0xa1/0x300 >>>> [] ? ext4_da_write_end+0x124/0x2b0 >>>> [] generic_perform_write+0xc0/0x1a0 >>>> [] __generic_file_write_iter+0x188/0x1e0 >>>> [] ext4_file_write_iter+0xf0/0x340 >>>> [] __vfs_write+0xaa/0xe0 >>>> [] vfs_write+0xa2/0x1a0 >>>> [] SyS_write+0x46/0xa0 >>>> [] entry_SYSCALL_64_fastpath+0x12/0x71 >>>> >>>> Some 11 hours later: >>>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 >>>> error >>>> 4 in ld-2.17.so[7f03bb940000+21000] >>>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 >>>> error 4 >>>> in ld-2.17.so[7f6b4a0d5000+21000] >>>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp >>>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000] >>>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 >>>> error >>>> 4 in ld-2.17.so[7ff8c8a82000+21000] >>>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 >>>> error >>>> 4 in ld-2.17.so[7f183cf6b000+21000] >>>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 >>>> error >>>> 4 in ld-2.17.so[7f3c665e2000+21000] >>>> >>>> From isolated testing, this does not occur on kernel 4.5.x - >>>> however I >>>> have not verified this myself. >>>> >>>> The kernel config used can be found in the kernel-xen git repo if it= >>>> assists in debugging: >>>> http://xen.crc.id.au/git/?p=3Dkernel-xen;a=3Dsummary >>>> >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xen.org >>>> http://lists.xen.org/xen-devel >=20 --=20 Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 --6PrMP0sxGewkV1sV9Dd60EvVWLTv1kNTv-- --6uipt3URi8Na8W6LCS5PgrnGEcbsk0Dbx Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJW9WFyAAoJEEGvNdV6fTHcGsQQAJ6987x0/QacI3mVbr3ld5IN TwAch2ViCLTh6LRet9v4jqjAmmK4HPpNIsr4tw4XZ8yHpxMcDP/WHzPFiYwqDsli sn6JaRVFnueeqhkbIdpUmLCEiGOdU5M7laqjQ63QbJL6tmA9WS/l1aTBph15sj9o IRCtna22KqV+CjaI05t7y08VnQ0xhMcqjGzpKbLhWJexCWYD+oHVYtmXG89UPBb3 SzBauvSlBKVky/i67REDpPWBktjwwMZGXt4UrUocmO0wrQTbSzPzq37r8Bs2tBmy g1qb8X9udzTqc1v9UMqaDa/HCFxlkL9Yj3OLCG970Tr5qoMvUzWd0nnoQ8QtLiRP 86/PtlxaVkHZ8yGPRyuGTWuOjpE6sP3n1zFZ5bFZwlwhEJJ25sOm+WqNcCkxzgjx SUe1wc++JvOFKQOzTecXINQRRvku8kQMgq97paBWT8MN1nWhMivrxLkB+hQdoZOr QvrG+LNPYQ7RDNjkc2iAIB0QIDER3BuYZVcF7MgyofwGrhFnw/hrKrp4fsJkW8rV n8g2shwoFHItU9NelHynOeLsZ/oKXOEdcvKcno45YIzkdGpsr5iru0Vnueig5CgD S6HHE9ELWoY4Lm5JmjQcMNqM22vJAjhS43ISfqRLva+SzI4CZL4mnI2ucc+roYph J2qdsXEM9R63PsXylsyd =toV0 -----END PGP SIGNATURE----- --6uipt3URi8Na8W6LCS5PgrnGEcbsk0Dbx--