Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932389AbcC2RpP (ORCPT ); Tue, 29 Mar 2016 13:45:15 -0400 Received: from mail.crc.id.au ([203.56.246.92]:39134 "EHLO mail.crc.id.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753155AbcC2RpN (ORCPT ); Tue, 29 Mar 2016 13:45:13 -0400 Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> <56F56172.9020805@crc.id.au> <56F5653B.1090700@oracle.com> <56F5A87A.8000903@crc.id.au> <56FA4336.2030301@crc.id.au> <56FA8DDD.7070406@oracle.com> Cc: "gregkh@linuxfoundation.org" From: Steven Haigh Message-ID: <56FABF17.7090608@crc.id.au> Date: Wed, 30 Mar 2016 04:44:55 +1100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1 MIME-Version: 1.0 In-Reply-To: <56FA8DDD.7070406@oracle.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="THg0QmTdFRwIx8xivMaFMt31JReNbrE65" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7200 Lines: 170 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --THg0QmTdFRwIx8xivMaFMt31JReNbrE65 Content-Type: multipart/mixed; boundary="kLUkSc5OrsWfutVgmlP3M7p0E87Ow3Kxg" From: Steven Haigh To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org Cc: "gregkh@linuxfoundation.org" Message-ID: <56FABF17.7090608@crc.id.au> Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> <56F545B1.8080609@crc.id.au> <56F54EE0.6030004@oracle.com> <56F56172.9020805@crc.id.au> <56F5653B.1090700@oracle.com> <56F5A87A.8000903@crc.id.au> <56FA4336.2030301@crc.id.au> <56FA8DDD.7070406@oracle.com> In-Reply-To: <56FA8DDD.7070406@oracle.com> --kLUkSc5OrsWfutVgmlP3M7p0E87Ow3Kxg Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 30/03/2016 1:14 AM, Boris Ostrovsky wrote: > On 03/29/2016 04:56 AM, Steven Haigh wrote: >> >> Interestingly enough, this just happened again - but on a different >> virtual machine. I'm starting to wonder if this may have something to = do >> with the uptime of the machine - as the system that this seems to happ= en >> to is always different. >> >> Destroying it and monitoring it again has so far come up blank. >> >> I've thrown the latest lot of kernel messages here: >> http://paste.fedoraproject.org/346802/59241532 >=20 > Would be good to see full console log. The one that you posted starts > with an error so I wonder what was before that. Agreed. It started off with me observing this on one VM - but since trying to get details on that VM - others have started showing issues as well. It frustrating as it seems I've been playing whack-a-mole to get more debug on what is going on. So, I've changed the kernel command line to the following on ALL VMs on this system: enforcemodulesig=3D1 selinux=3D0 fsck.repair=3Dyes loglevel=3D7 console=3D= tty0 console=3DttyS0,38400n8 In the Dom0 (which runs the same kernel package), I've started a screen sessions with a screen for each of the DomUs running attached to the console via 'xl console blah' - so hopefully the next one that goes down (whichever one that is) will get caught in the console. > Have you tried this on bare metal, BTW? And you said this is only > observed on 4.4, not 4.5, right? I use the same kernel package as the Dom0 kernel - and so far haven't seen any issues running this as the Dom0. I haven't used it on baremetal as a non-xen kernel as yet. The kernel package I'm currently running is for CentOS / Scientific Linux / RHEL at: http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/ I'm using 4.4.6-3 at the moment - which has CONFIG_PREEMPT_VOLUNTARY set - which *MAY* have increased the time between this happening - or may have no effect at all. I'm not convinced either way as yet. With respect to 4.5, I have had reports from another user of my packages that they haven't seen the same crash using the same Xen packages but with kernel 4.5. I have not verified this myself as yet as I haven't gone down the path of making 4.5 packages for testing. As such, I wouldn't treat this as a conclusive test case as yet. I'm hoping that the steps I've taken above may give some more information in which we can drill down into exactly what is going on - or at least give more pointers into the root cause. >> >> Interestingly, around the same time, /var/log/messages on the remote >> syslog server shows: >> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:00:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root. >> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root. >> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:01:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root. >> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root. >> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94 >> offset -0.002247 sec >> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice. >> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:10:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root. >> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root. >> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:20:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root. >> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root. >> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout >> (limit 1min)! >> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt time= d >> out. Terminating. >> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm time= d >> out. Killing. >> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around afte= r >> SIGKILL. Ignoring. >> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigter= m >> timed out. Killing. >> Mar 29 17:38:26 zeus systemd: systemd-logind.service still around afte= r >> final SIGKILL. Entering failed mode. >> Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered fail= ed >> state. >> Mar 29 17:38:26 zeus systemd: systemd-logind.service failed. >=20 >=20 > These may be result of your system not feeling well, which is not > surprising. >=20 > -boris --=20 Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 --kLUkSc5OrsWfutVgmlP3M7p0E87Ow3Kxg-- --THg0QmTdFRwIx8xivMaFMt31JReNbrE65 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJW+r8iAAoJEEGvNdV6fTHcNBgP+wa6/DsWATKaw0rfj5bg2CG5 rymtfp6CU7XteX4VzZe7MJqRTjtL5+K7o6zh6neLDTI31mXgeKzl8frRsFyxR6r5 lN9tWtReLsl7K85bp7+tgRRNjT9KgZBLSMtQKULnhCjMLIMfO5UTtgtQN48oTMLj G9Cw+EbU5YzXWAln6rmSqtNDa3XHs09fU20T3LkWmb88G1KPq1Y8Sy194+nCwPQm K4AOlyW/ldF4GKuPYIM17PNDvbIUX5xqL15jtNeI2Cq6CgLZ5uU1bsD2SupQORNl yhpoitSwGf3/1DrhBRSDQ2W51Ki7NPiPwm1smiGhz3KhziD/plKBKBko9QhFVl43 W+IKYu2wCMaq+WFS1dhBiDge9UgadK68M/mFL5RDiOfzgPHBLJCeT4e82uwQFkqn +ZT7EcQcyBy32mfDPnTGRAwGap27qEPfiXfrcSs++5OQmrWvTbCHyOzoZaxFc5+y 0sGWyAmgQ25bLy+Rve5dpHNmHLJK6Yq1+2C32hEvBC7L393ddYDjFYHskF4YDr3Z jMDXe4rZop43VAFBuMmfHp/Z/1RIpLk7SmdxJ+obdm6TUa/Oo3zOJkWuFBXvMtra AUny4IrSk8t+1OyESATqaUjuQS5hKCZFLZzxX9+po7Bg6677Gy5hIBDlJ3Fd+ha4 jujhjNQ9GRG7mbKe5Mg/ =UdB9 -----END PGP SIGNATURE----- --THg0QmTdFRwIx8xivMaFMt31JReNbrE65--