Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753323AbcCYOpV (ORCPT ); Fri, 25 Mar 2016 10:45:21 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:49326 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753244AbcCYOpT (ORCPT ); Fri, 25 Mar 2016 10:45:19 -0400 Subject: Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU To: Steven Haigh , xen-devel , linux-kernel@vger.kernel.org References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> Cc: "gregkh@linuxfoundation.org" From: Boris Ostrovsky Message-ID: <56F54EE0.6030004@oracle.com> Date: Fri, 25 Mar 2016 10:44:48 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56F545B1.8080609@crc.id.au> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7113 Lines: 157 On 03/25/2016 10:05 AM, Steven Haigh wrote: > On 25/03/2016 11:23 PM, Boris Ostrovsky wrote: >> On 03/24/2016 10:53 PM, Steven Haigh wrote: >>> Hi all, >>> >>> Firstly, I've cross-posted this to xen-devel and the lkml - as this >>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel. >>> I have also CC'ed Greg KH for his awesome insight as maintainer. >>> >>> Please CC myself into replies - as I'm not a member of the kernel >>> mailing list - I may miss replies from monitoring the archives. >>> >>> I've noticed recently that heavy disk IO is causing rcu_sched to detect >>> stalls. The process mentioned usually goes to 100% CPU usage, and >>> eventually processes start segfaulting and dying. The only fix to >>> recover the system is to use 'xl destroy' to force-kill the VM and to >>> start it again. >>> >>> The majority of these issues seem to mention ext4 in the trace. This may >>> indicate an issue there - or may be a red herring. >>> >>> The gritty details: >>> INFO: rcu_sched self-detected stall on CPU >>> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0 >>> softirq=1101493/1101493 fqs=6973 >>> #011 (t=21000 jiffies g=827095 c=827094 q=524) >>> Task dump for CPU 0: >>> rsync R running task 0 2446 2444 0x00000088 >>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000 >>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001 >>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00 >>> Call Trace: >>> [] sched_show_task+0xaf/0x110 >>> [] dump_cpu_task+0x39/0x40 >>> [] rcu_dump_cpu_stacks+0x8a/0xc0 >>> [] rcu_check_callbacks+0x424/0x7a0 >>> [] ? account_system_time+0x81/0x110 >>> [] ? account_process_tick+0x61/0x160 >>> [] ? tick_sched_do_timer+0x30/0x30 >>> [] update_process_times+0x39/0x60 >>> [] tick_sched_handle.isra.15+0x36/0x50 >>> [] tick_sched_timer+0x3d/0x70 >>> [] __hrtimer_run_queues+0xf2/0x250 >>> [] hrtimer_interrupt+0xa8/0x190 >>> [] xen_timer_interrupt+0x2e/0x140 >>> [] handle_irq_event_percpu+0x55/0x1e0 >>> [] handle_percpu_irq+0x3a/0x50 >>> [] generic_handle_irq+0x22/0x30 >>> [] __evtchn_fifo_handle_events+0x15f/0x180 >>> [] evtchn_fifo_handle_events+0x10/0x20 >>> [] __xen_evtchn_do_upcall+0x43/0x80 >>> [] xen_evtchn_do_upcall+0x30/0x50 >>> [] xen_hvm_callback_vector+0x82/0x90 >>> [] ? queued_write_lock_slowpath+0x3d/0x80 >>> [] _raw_write_lock+0x1e/0x30 >> This looks to me like ext4 failing to grab a lock. Everything above it >> (in Xen code) is regular tick interrupt handling which detects the stall. >> >> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates >> any possible issues with pv locks. >> >> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, >> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? > There is nothing special being logged anywhere that I can see. dmesg / > xl dmesg on the Dom0 show nothing unusual. > > I do share CPUs - but I don't give any DomU more than 2 vcpus. The > physical host has 4 cores - 1 pinned to the Dom0. > > I log to a remote syslog on this system - and I've uploaded the entire > log to a pastebin (don't want to do a 45Kb attachment here): > http://paste.fedoraproject.org/345095/58914452 That doesn't look like a full log. In any case, the RCU stall may be a secondary problem --- there is a bunch of splats before the stall. -boris > > Not sure if it makes any difference at all, but my DomU config is: > # cat /etc/xen/backup.vm > name = "backup.vm" > memory = 2048 > vcpus = 2 > cpus = "1-3" > disk = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ] > vif = [ "mac=00:11:36:35:35:09, bridge=br203, > vifname=vm.backup, script=vif-bridge" ] > bootloader = 'pygrub' > pvh = 1 > > on_poweroff = 'destroy' > on_reboot = 'restart' > on_crash = 'restart' > cpu_weight = 64 > > I never had this problem when running kernel 4.1.x - it only started > when I upgraded everything to 4.4 - not exactly a great help - but may > help narrow things down? > >>> [] ext4_es_remove_extent+0x43/0xc0 >>> [] ext4_clear_inode+0x39/0x80 >>> [] ext4_evict_inode+0x8d/0x4e0 >>> [] evict+0xb7/0x180 >>> [] dispose_list+0x36/0x50 >>> [] prune_icache_sb+0x4b/0x60 >>> [] super_cache_scan+0x141/0x190 >>> [] shrink_slab.part.37+0x1ee/0x390 >>> [] shrink_zone+0x26c/0x280 >>> [] do_try_to_free_pages+0x15c/0x410 >>> [] try_to_free_pages+0xba/0x170 >>> [] __alloc_pages_nodemask+0x525/0xa60 >>> [] ? kmem_cache_free+0xcc/0x2c0 >>> [] alloc_pages_current+0x8d/0x120 >>> [] __page_cache_alloc+0x91/0xc0 >>> [] pagecache_get_page+0x56/0x1e0 >>> [] grab_cache_page_write_begin+0x26/0x40 >>> [] ext4_da_write_begin+0xa1/0x300 >>> [] ? ext4_da_write_end+0x124/0x2b0 >>> [] generic_perform_write+0xc0/0x1a0 >>> [] __generic_file_write_iter+0x188/0x1e0 >>> [] ext4_file_write_iter+0xf0/0x340 >>> [] __vfs_write+0xaa/0xe0 >>> [] vfs_write+0xa2/0x1a0 >>> [] SyS_write+0x46/0xa0 >>> [] entry_SYSCALL_64_fastpath+0x12/0x71 >>> >>> Some 11 hours later: >>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error >>> 4 in ld-2.17.so[7f03bb940000+21000] >>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4 >>> in ld-2.17.so[7f6b4a0d5000+21000] >>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp >>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000] >>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error >>> 4 in ld-2.17.so[7ff8c8a82000+21000] >>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error >>> 4 in ld-2.17.so[7f183cf6b000+21000] >>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error >>> 4 in ld-2.17.so[7f3c665e2000+21000] >>> >>> From isolated testing, this does not occur on kernel 4.5.x - however I >>> have not verified this myself. >>> >>> The kernel config used can be found in the kernel-xen git repo if it >>> assists in debugging: >>> http://xen.crc.id.au/git/?p=kernel-xen;a=summary >>> >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xen.org >>> http://lists.xen.org/xen-devel