Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751845AbcDUUEm (ORCPT ); Thu, 21 Apr 2016 16:04:42 -0400 Received: from torres.zugschlus.de ([85.214.131.164]:34764 "EHLO torres.zugschlus.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751597AbcDUUEk (ORCPT ); Thu, 21 Apr 2016 16:04:40 -0400 Date: Thu, 21 Apr 2016 22:04:33 +0200 From: Marc Haber To: Borislav Petkov Cc: Paolo Bonzini , linux-kernel@vger.kernel.org, kvm ML Subject: Re: Major KVM issues with kernel 4.5 on the host Message-ID: <20160421200433.GL21755@torres.zugschlus.de> References: <56EBD20A.1020608@redhat.com> <20160413183701.GC7600@torres.zugschlus.de> <570EADD2.8030300@redhat.com> <20160413222942.GD7600@torres.zugschlus.de> <570EEF6D.40307@redhat.com> <20160414052220.GE7600@torres.zugschlus.de> <20160421083948.GF21755@torres.zugschlus.de> <20160421123711.GD28821@pd.tnic> <20160421145005.GI21755@torres.zugschlus.de> <20160421165106.GK28821@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20160421165106.GK28821@pd.tnic> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4607 Lines: 126 On Thu, Apr 21, 2016 at 06:51:06PM +0200, Borislav Petkov wrote: > On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote: > > What bothers me is that since I ended up with a "suspect" commit that > > actually results in a "good" kernel (running for 22 hours now), I must > > have said "bad" to an actually "good" kernel, which means that I had > > an unrelated crash or corruption. Is that reasoning correct? > > Hmm, did that "unrelated crash or corruption" have the same symptoms as > the original one? Yes, but there are two symptoms. The VM either suffers file system issues (garbage read from files, or an aborted ext4 journal and following ro remount) or it stops dead in its tracks. > > That one qualified as "good" six days ago. I'll retry, maybe I just > > didn't wait long enough. > > So if the trigger time is varying so much, I'd try to double that to > make sure I'm fairly certain about each commit I'm testing. The longest trigger time I have seen was three hours, I tripled that to nine hours, that probably was not enough. > Also, this is a single box we're talking about, right? And you're sure > it hasn't had any corruption issues so far? It is a single box, and it runs perfectly with kernel 4.4. > I see you have amd64_edac loading, so it must have ECC DIMMs. Have you > had any reports in the past of ECC errors in dmesg? Or other MCEs, > lockups, etc? Can you grep your logs for stuff like "hardware error", > "mce", "edac" etc? Do a case-insensitive search. The box reports about one correctable error per week, so I probably have a faulty DIMM, but since the issue only surfaces in VMs while the host system is in perfect working order... And yes, I am pondering to simply replace the box with an Intel CPU. I see "mce: CPU supports 6 MCE banks" once for each reboot, and about 30 "Machine check events logged" since January. How do I see which events were logged? > > "Trying" means make oldconfig, make deb-pkg in my case right? Does it > > matter what I answer to the numerous config questions that keep coming > > up during the oldconfig step? > > What I do is: > > $ git bisect > > to mark the current commit after having tested it. Then I do > > $ yes "" | make oldconfig > > to set the new config options. So you basically select the default for new options. > Then > > $ make -j7 > $ make modules_install install > > and reboot into the new kernel. Kernel name will possibly change each > time so I write down on paper which kernel I'm testing. I go the way of Debian packages since it is easier to handle the crypto file systems when the machine is booting up. And yes, I think about doing a test reinstall on unencrypted disk to find out whether encryption plays a role, but I currently need the machine to urgently to take it out of serice for half a month, and, again, the host system is in perfect working order, it is just VMs that barf. > You can verify when booting it by doing: > > $ dmesg | head > [ 0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016 > ... > > that date at the end of the line and number "#1" should be current. I check the date of the package I am installing and the date stamp of the kernels being installed to /boot. I'm reasonably sure I have that under control. > > Would it help to explicitly mark > > 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge > > gained during the last week is not completely lost? > > I'd do the whole thing again, just to be sure. > > I know, bisection is very time-consuming :-\ And it is particularly > annoying if it is done on the box I'm normally using daily. ... and if testing a "good" kernel means a day. > > So I need to git log | grep 46896c73c1a4 and apply the patch again > > each time the commit is found? > > I think you can let git do that for ya: > > $ git branch --contains 46896c73c1a4 > * (HEAD detached at 46896c73c1a4) > > that lists that the current checked out HEAD contains that commit. If you do > > $ git checkout 46896c73c1a4~1 > > then that "(HEAD detached..." line is not in the list of branches > containing it. And whenever 46896c73c1a4 is present, I need to apply Paolo's patch, right? Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421