2016-03-17 17:09:57

by Marc Haber

[permalink] [raw]
Subject: Major KVM issues with kernel 4.5 on the host

Hi,

I have a (semi-productive[1]) system ("host") running Debian unstable.
On this system, a few VMs (Debian unstable, Debian testing) ("vm1",
"vm2", "vm3") are running. I roll my own kernels and take vanilla
upstream sources. No distribution patches.

Since host was updated to Kernel 4.5, the VMs have started acting up.
All of them. The range of strangeness begins with "relocation error,
system halted" on system startup, corrupted data files on disk,
filesystems remounted read-only, libraries rejected with "invalid ELF
format", binaries segfaulting all of a sudden. Downgrading host to
kernel 4.4.5 magically fixed all those issues.

Going back to 4.5 lets the issues reappear. Here, for example, ext4 fs
errors, logged in one of the VMs:

Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #415065: comm aide: deleted inode referenced: 546538
Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #415065: comm aide: deleted inode referenced: 546530
Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546543: comm aide: bad extra_isize (44800 != 256)
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4466: inode #546568: comm aide: bogus i_mode (144)
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #546548: comm aide: deleted inode referenced: 546564
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #546548: comm aide: deleted inode referenced: 546562
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546563: comm aide: bad extra_isize (6464 != 256)
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4466: inode #546561: comm aide: bogus i_mode (0)
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546529: comm aide: bad extra_isize (1152 != 256)
Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_xattr_block_get:297: inode #546359: comm aide: bad block 677784

I'm going to try reproducing the issue on a less "important" machine
so that bisecting is less painful, but maybe you guys have an idea
what's going wrong here.

jftr, kernel 4.5 in guest and in standalone systems seems to be
unproblematic.

Greetings
Marc


[1] my main workstation, running enough services for the local network
that disturbances in its operation cause reasonable discomfort, but not the
Enterprise kind of "productive"

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421


2016-03-17 18:11:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

+ kvm ML.

Do you have any funky messages in host's dmesg ? Can you upload a full
dmesg from both a good and a bad host kernel?

On Thu, Mar 17, 2016 at 05:54:35PM +0100, Marc Haber wrote:
> Hi,
>
> I have a (semi-productive[1]) system ("host") running Debian unstable.
> On this system, a few VMs (Debian unstable, Debian testing) ("vm1",
> "vm2", "vm3") are running. I roll my own kernels and take vanilla
> upstream sources. No distribution patches.
>
> Since host was updated to Kernel 4.5, the VMs have started acting up.
> All of them. The range of strangeness begins with "relocation error,
> system halted" on system startup, corrupted data files on disk,
> filesystems remounted read-only, libraries rejected with "invalid ELF
> format", binaries segfaulting all of a sudden. Downgrading host to
> kernel 4.4.5 magically fixed all those issues.
>
> Going back to 4.5 lets the issues reappear. Here, for example, ext4 fs
> errors, logged in one of the VMs:
>
> Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #415065: comm aide: deleted inode referenced: 546538
> Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #415065: comm aide: deleted inode referenced: 546530
> Mar 17 17:39:57 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546543: comm aide: bad extra_isize (44800 != 256)
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4466: inode #546568: comm aide: bogus i_mode (144)
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #546548: comm aide: deleted inode referenced: 546564
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_lookup:1602: inode #546548: comm aide: deleted inode referenced: 546562
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546563: comm aide: bad extra_isize (6464 != 256)
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4466: inode #546561: comm aide: bogus i_mode (0)
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_iget:4269: inode #546529: comm aide: bad extra_isize (1152 != 256)
> Mar 17 17:39:58 spinturn kernel: EXT4-fs error (device dm-0): ext4_xattr_block_get:297: inode #546359: comm aide: bad block 677784
>
> I'm going to try reproducing the issue on a less "important" machine
> so that bisecting is less painful, but maybe you guys have an idea
> what's going wrong here.
>
> jftr, kernel 4.5 in guest and in standalone systems seems to be
> unproblematic.
>
> Greetings
> Marc
>
>
> [1] my main workstation, running enough services for the local network
> that disturbances in its operation cause reasonable discomfort, but not the
> Enterprise kind of "productive"
>
> --
> -----------------------------------------------------------------------------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
>

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-18 10:01:59

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 17/03/2016 19:11, Borislav Petkov wrote:
> I'm going to try reproducing the issue on a less "important" machine
> so that bisecting is less painful, but maybe you guys have an idea
> what's going wrong here.

No idea, sorry. :( Bisecting would be great. I'll also try reproducing
and bisecting next week, in the meanwhile just having the host dmesg
would help a lot.

Paolo

2016-03-18 18:49:33

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

Hi Borislav,

On Thu, Mar 17, 2016 at 07:11:28PM +0100, Borislav Petkov wrote:
> Do you have any funky messages in host's dmesg ?

Not that I see.

> Can you upload a full dmesg from both a good and a bad host kernel?

http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.4.5
http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.5

Hope this helps.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-03-18 22:04:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Fri, Mar 18, 2016 at 07:49:29PM +0100, Marc Haber wrote:
> http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.4.5

This one I got.

> http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.5

This one doesn't want:

HTTP request sent, awaiting response... 403 Forbidden
2016-03-18 22:57:46 ERROR 403: Forbidden.

So I have a similar system to yours, I'll try to reproduce on it with
4.5.

Anything special you're doing to cause the host kernel to barf which I
should do here?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-19 00:08:42

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

Hi Borislav,

On Fri, Mar 18, 2016 at 11:04:29PM +0100, Borislav Petkov wrote:
> On Fri, Mar 18, 2016 at 07:49:29PM +0100, Marc Haber wrote:
> > http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.4.5
>
> This one I got.
>
> > http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.5
>
> This one doesn't want:
>
> HTTP request sent, awaiting response... 403 Forbidden
> 2016-03-18 22:57:46 ERROR 403: Forbidden.

Idiot me. File permissions fixed.

> Anything special you're doing to cause the host kernel to barf which I
> should do here?

Booting Debian Linux, apt-get update, apt-get upgrade, and run aide
(which builds checksums for the entire filesystem, a rather disk-bound
activity).

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-03-20 13:32:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sat, Mar 19, 2016 at 01:08:37AM +0100, Marc Haber wrote:
> Booting Debian Linux, apt-get update, apt-get upgrade, and run aide
> (which builds checksums for the entire filesystem, a rather disk-bound
> activity).

So I did that and aide ran a whole init and check all the way through
and all fine. I don't see anything out of the ordinary in your dmesg
outputs either.

The next things we should look like is:

* diff .configs - there might be something there

* try to reproduce on debian testing or even stable. I have had similar
issues with debian unstable in the past.

* something else which I'm not thinking of it right now.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-20 17:15:25

by Andrey Korolyov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 4:31 PM, Borislav Petkov <[email protected]> wrote:
> On Sat, Mar 19, 2016 at 01:08:37AM +0100, Marc Haber wrote:
>> Booting Debian Linux, apt-get update, apt-get upgrade, and run aide
>> (which builds checksums for the entire filesystem, a rather disk-bound
>> activity).
>
> So I did that and aide ran a whole init and check all the way through
> and all fine. I don't see anything out of the ordinary in your dmesg
> outputs either.
>
> The next things we should look like is:
>
> * diff .configs - there might be something there
>
> * try to reproduce on debian testing or even stable. I have had similar
> issues with debian unstable in the past.
>
> * something else which I'm not thinking of it right now.
>
> --
> Regards/Gruss,
> Boris.
>

Kinda naive question - do you run same ucode version as Marc on his device?

2016-03-20 18:25:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 08:14:58PM +0300, Andrey Korolyov wrote:
> Kinda naive question - do you run same ucode version as Marc on his device?

Yeah, we both have 0x010000dc.

In case you're referring to the recent faulty AMD microcode patch -
it doesn't apply here. The boxes in question are family 0x10 and the
microcode patch is for family 0x15.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-20 18:52:34

by Andrey Korolyov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 9:25 PM, Borislav Petkov <[email protected]> wrote:
> On Sun, Mar 20, 2016 at 08:14:58PM +0300, Andrey Korolyov wrote:
>> Kinda naive question - do you run same ucode version as Marc on his device?
>
> Yeah, we both have 0x010000dc.
>
> In case you're referring to the recent faulty AMD microcode patch -
> it doesn't apply here. The boxes in question are family 0x10 and the
> microcode patch is for family 0x15.
>

Yes, I suggested that the issue could fall over a different family as
well to expose explicit corruption of a guest pages (as opposed to a
generic corruption in a known case). Since there is no direct evidence
of what exactly (data or pgt) is getting corrupted, would disabling
npt for a testing purposes be helpful?

2016-03-20 18:59:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 09:42:15PM +0300, Andrey Korolyov wrote:
> Yes, I suggested that the issue could fall over a different family as
> well to expose explicit corruption of a guest pages (as opposed to a
> generic corruption in a known case).

Probably, but I don't think it is microcode patch related.

> Since there is no direct evidence of what exactly (data or pgt) is
> getting corrupted, would disabling npt for a testing purposes be
> helpful?

So I'm not sure what even happens here yet. I haven't seen anything out
of the ordinary in Marc's dmesg and I wasn't able to reproduce either.
So would it be good to try with "npt=0"? Sure, why not.

Marc, you could give that a try to see if it changes anything...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-21 09:08:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 19/03/2016 01:08, Marc Haber wrote:
>> >
>>> > > http://q.bofh.de/~mh/stuff/20160317-fan-syslog-kvm-4.5
>> >
>> > This one doesn't want:
>> >
>> > HTTP request sent, awaiting response... 403 Forbidden
>> > 2016-03-18 22:57:46 ERROR 403: Forbidden.
> Idiot me. File permissions fixed.
>
>> > Anything special you're doing to cause the host kernel to barf which I
>> > should do here?
> Booting Debian Linux, apt-get update, apt-get upgrade, and run aide
> (which builds checksums for the entire filesystem, a rather disk-bound
> activity).

Ok, so this is AMD. I'll take a look.

Paolo

2016-04-13 18:20:16

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 02:31:58PM +0100, Borislav Petkov wrote:
> On Sat, Mar 19, 2016 at 01:08:37AM +0100, Marc Haber wrote:
> > Booting Debian Linux, apt-get update, apt-get upgrade, and run aide
> > (which builds checksums for the entire filesystem, a rather disk-bound
> > activity).
>
> So I did that and aide ran a whole init and check all the way through
> and all fine. I don't see anything out of the ordinary in your dmesg
> outputs either.
>
> The next things we should look like is:
>
> * diff .configs - there might be something there#

Here we go:

[2/501]mh@fan:~$ diff -u0 /boot/config-4.4.6-zgws1 /boot/config-4.5.1-zgws1
--- /boot/config-4.4.6-zgws1 2016-03-28 15:50:36.000000000 +0200
+++ /boot/config-4.5.1-zgws1 2016-04-13 08:32:44.000000000 +0200
@@ -3 +3 @@
-# Linux/x86_64 4.4.6 Kernel Configuration
+# Linux/x86_64 4.5.1 Kernel Configuration
@@ -14 +13,0 @@
-CONFIG_HAVE_LATENCYTOP_SUPPORT=y
@@ -15,0 +15,4 @@
+CONFIG_ARCH_MMAP_RND_BITS_MIN=28
+CONFIG_ARCH_MMAP_RND_BITS_MAX=32
+CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
+CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
@@ -147,7 +149,0 @@
-# CONFIG_CGROUP_DEBUG is not set
-CONFIG_CGROUP_FREEZER=y
-CONFIG_CGROUP_PIDS=y
-CONFIG_CGROUP_DEVICE=y
-CONFIG_CPUSETS=y
-CONFIG_PROC_PID_CPUSET=y
-CONFIG_CGROUP_CPUACCT=y
@@ -158,3 +154,3 @@
-# CONFIG_MEMCG_KMEM is not set
-# CONFIG_CGROUP_HUGETLB is not set
-CONFIG_CGROUP_PERF=y
+CONFIG_BLK_CGROUP=y
+# CONFIG_DEBUG_BLK_CGROUP is not set
+CONFIG_CGROUP_WRITEBACK=y
@@ -165,3 +161,9 @@
-CONFIG_BLK_CGROUP=y
-# CONFIG_DEBUG_BLK_CGROUP is not set
-CONFIG_CGROUP_WRITEBACK=y
+CONFIG_CGROUP_PIDS=y
+CONFIG_CGROUP_FREEZER=y
+# CONFIG_CGROUP_HUGETLB is not set
+CONFIG_CPUSETS=y
+CONFIG_PROC_PID_CPUSET=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+CONFIG_CGROUP_PERF=y
+# CONFIG_CGROUP_DEBUG is not set
@@ -254 +255,0 @@
-CONFIG_HAVE_DMA_ATTRS=y
@@ -288,0 +290,4 @@
+CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
+CONFIG_ARCH_MMAP_RND_BITS=28
+CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
+CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
@@ -377,0 +383 @@
+CONFIG_X86_FAST_FEATURE_TESTS=y
@@ -383 +389 @@
-CONFIG_IOSF_MBI=m
+CONFIG_IOSF_MBI=y
@@ -390,0 +397 @@
+# CONFIG_QUEUED_LOCK_STAT is not set
@@ -769,0 +777 @@
+# CONFIG_VMD is not set
@@ -772,0 +781 @@
+CONFIG_NET_EGRESS=y
@@ -824,0 +834 @@
+# CONFIG_INET_DIAG_DESTROY is not set
@@ -945,0 +956,3 @@
+CONFIG_NF_DUP_NETDEV=m
+CONFIG_NFT_DUP_NETDEV=m
+CONFIG_NFT_FWD_NETDEV=m
@@ -1252,0 +1266 @@
+# CONFIG_6LOWPAN_DEBUGFS is not set
@@ -1344,0 +1359 @@
+CONFIG_SOCK_CGROUP_DATA=y
@@ -1411 +1425,0 @@
-CONFIG_WEXT_SPY=y
@@ -1423,5 +1437 @@
-CONFIG_LIB80211=m
-CONFIG_LIB80211_CRYPT_WEP=m
-CONFIG_LIB80211_CRYPT_CCMP=m
-CONFIG_LIB80211_CRYPT_TKIP=m
-# CONFIG_LIB80211_DEBUG is not set
+# CONFIG_LIB80211 is not set
@@ -1469 +1479,2 @@
-# CONFIG_NFC_ST_NCI is not set
+# CONFIG_NFC_ST_NCI_I2C is not set
+# CONFIG_NFC_ST_NCI_SPI is not set
@@ -1616,2 +1627,2 @@
-CONFIG_PARPORT_PC=m
-CONFIG_PARPORT_SERIAL=m
+CONFIG_PARPORT_PC=y
+CONFIG_PARPORT_SERIAL=y
@@ -1619 +1630 @@
-CONFIG_PARPORT_PC_SUPERIO=y
+# CONFIG_PARPORT_PC_SUPERIO is not set
@@ -1968,0 +1980 @@
+# CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set
@@ -1971 +1982,0 @@
-# CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set
@@ -2131,0 +2143 @@
+# CONFIG_NET_VENDOR_NETRONOME is not set
@@ -2263,43 +2275,6 @@
-# CONFIG_PCMCIA_RAYCS is not set
-# CONFIG_LIBERTAS_THINFIRM is not set
-# CONFIG_AIRO is not set
-# CONFIG_ATMEL is not set
-# CONFIG_AT76C50X_USB is not set
-# CONFIG_AIRO_CS is not set
-# CONFIG_PCMCIA_WL3501 is not set
-# CONFIG_PRISM54 is not set
-# CONFIG_USB_ZD1201 is not set
-# CONFIG_USB_NET_RNDIS_WLAN is not set
-# CONFIG_ADM8211 is not set
-# CONFIG_RTL8180 is not set
-# CONFIG_RTL8187 is not set
-# CONFIG_MAC80211_HWSIM is not set
-# CONFIG_MWL8K is not set
-# CONFIG_ATH_CARDS is not set
-CONFIG_B43=m
-CONFIG_B43_BCMA=y
-CONFIG_B43_SSB=y
-CONFIG_B43_BUSES_BCMA_AND_SSB=y
-# CONFIG_B43_BUSES_BCMA is not set
-# CONFIG_B43_BUSES_SSB is not set
-CONFIG_B43_PCI_AUTOSELECT=y
-CONFIG_B43_PCICORE_AUTOSELECT=y
-CONFIG_B43_SDIO=y
-CONFIG_B43_BCMA_PIO=y
-CONFIG_B43_PIO=y
-CONFIG_B43_PHY_G=y
-CONFIG_B43_PHY_N=y
-CONFIG_B43_PHY_LP=y
-CONFIG_B43_PHY_HT=y
-CONFIG_B43_LEDS=y
-CONFIG_B43_HWRNG=y
-# CONFIG_B43_DEBUG is not set
-# CONFIG_B43LEGACY is not set
-# CONFIG_BRCMSMAC is not set
-# CONFIG_BRCMFMAC is not set
-CONFIG_HOSTAP=m
-CONFIG_HOSTAP_FIRMWARE=y
-# CONFIG_HOSTAP_FIRMWARE_NVRAM is not set
-CONFIG_HOSTAP_PLX=m
-CONFIG_HOSTAP_PCI=m
-CONFIG_HOSTAP_CS=m
+# CONFIG_WLAN_VENDOR_ADMTEK is not set
+# CONFIG_WLAN_VENDOR_ATH is not set
+# CONFIG_WLAN_VENDOR_ATMEL is not set
+# CONFIG_WLAN_VENDOR_BROADCOM is not set
+# CONFIG_WLAN_VENDOR_CISCO is not set
+CONFIG_WLAN_VENDOR_INTEL=y
@@ -2307,0 +2283,2 @@
+# CONFIG_IWL4965 is not set
+# CONFIG_IWL3945 is not set
@@ -2321,14 +2298,13 @@
-# CONFIG_IWL4965 is not set
-# CONFIG_IWL3945 is not set
-# CONFIG_LIBERTAS is not set
-# CONFIG_HERMES is not set
-# CONFIG_P54_COMMON is not set
-# CONFIG_RT2X00 is not set
-# CONFIG_WL_MEDIATEK is not set
-# CONFIG_RTL_CARDS is not set
-# CONFIG_RTL8XXXU is not set
-# CONFIG_WL_TI is not set
-# CONFIG_ZD1211RW is not set
-# CONFIG_MWIFIEX is not set
-# CONFIG_CW1200 is not set
-# CONFIG_RSI_91X is not set
+# CONFIG_WLAN_VENDOR_INTERSIL is not set
+# CONFIG_WLAN_VENDOR_MARVELL is not set
+# CONFIG_WLAN_VENDOR_MEDIATEK is not set
+# CONFIG_WLAN_VENDOR_RALINK is not set
+# CONFIG_WLAN_VENDOR_REALTEK is not set
+# CONFIG_WLAN_VENDOR_RSI is not set
+# CONFIG_WLAN_VENDOR_ST is not set
+# CONFIG_WLAN_VENDOR_TI is not set
+# CONFIG_WLAN_VENDOR_ZYDAS is not set
+# CONFIG_PCMCIA_RAYCS is not set
+# CONFIG_PCMCIA_WL3501 is not set
+# CONFIG_MAC80211_HWSIM is not set
+# CONFIG_USB_NET_RNDIS_WLAN is not set
@@ -2466,0 +2443 @@
+# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
@@ -2612 +2589 @@
-CONFIG_PRINTER=m
+CONFIG_PRINTER=y
@@ -2614 +2591 @@
-CONFIG_PPDEV=m
+CONFIG_PPDEV=y
@@ -2766,0 +2744 @@
+# CONFIG_SPI_LOOPBACK_TEST is not set
@@ -2826,0 +2805 @@
+# CONFIG_GPIO_104_IDI_48 is not set
@@ -2993 +2971,0 @@
-# CONFIG_SENSORS_HTU21 is not set
@@ -3090,0 +3069 @@
+CONFIG_WATCHDOG_SYSFS=y
@@ -3096,0 +3076 @@
+# CONFIG_ZIIRAVE_WATCHDOG is not set
@@ -3151 +3130,0 @@
-CONFIG_SSB_BLOCKIO=y
@@ -3154 +3133 @@
-CONFIG_SSB_B43_PCI_BRIDGE=y
+# CONFIG_SSB_B43_PCI_BRIDGE is not set
@@ -3159 +3137,0 @@
-# CONFIG_SSB_HOST_SOC is not set
@@ -3171 +3148,0 @@
-CONFIG_BCMA_BLOCKIO=y
@@ -3256,0 +3234,2 @@
+# CONFIG_REGULATOR_PV88060 is not set
+# CONFIG_REGULATOR_PV88090 is not set
@@ -3565,0 +3545 @@
+# CONFIG_VIDEO_CS3308 is not set
@@ -3875 +3854,0 @@
-# CONFIG_DRM_RADEON_UMS is not set
@@ -3878,0 +3858 @@
+CONFIG_DRM_AMD_POWERPLAY=y
@@ -3917,0 +3898 @@
+CONFIG_FB_NOTIFY=y
@@ -4621,0 +4603 @@
+# CONFIG_RTC_DRV_RX8010 is not set
@@ -4842 +4823,0 @@
-# CONFIG_IIO_SIMPLE_DUMMY is not set
@@ -4870 +4851,2 @@
-# CONFIG_WILC1000_DRIVER is not set
+# CONFIG_WILC1000_SDIO is not set
+# CONFIG_WILC1000_SPI is not set
@@ -4896,0 +4879 @@
+# CONFIG_ASUS_WIRELESS is not set
@@ -4901,0 +4885 @@
+CONFIG_INTEL_HID_EVENT=m
@@ -4912,0 +4897 @@
+CONFIG_INTEL_PUNIT_IPC=m
@@ -4921,0 +4907,2 @@
+# CONFIG_COMMON_CLK_CS2000_CP is not set
+# CONFIG_COMMON_CLK_NXP is not set
@@ -4978,0 +4966 @@
+# CONFIG_IIO_CONFIGFS is not set
@@ -4980,0 +4969 @@
+# CONFIG_IIO_SW_TRIGGER is not set
@@ -4989,0 +4979,2 @@
+# CONFIG_MMA7455_I2C is not set
+# CONFIG_MMA7455_SPI is not set
@@ -4993,0 +4985 @@
+# CONFIG_MXC6255 is not set
@@ -5010,0 +5003 @@
+# CONFIG_INA2XX_ADC is not set
@@ -5028,0 +5022 @@
+# CONFIG_IAQCORE is not set
@@ -5061,0 +5056,5 @@
+# IIO dummy driver
+#
+# CONFIG_IIO_SIMPLE_DUMMY is not set
+
+#
@@ -5087,0 +5087,5 @@
+# Health sensors
+#
+# CONFIG_MAX30100 is not set
+
+#
@@ -5188,0 +5193,2 @@
+CONFIG_ARM_GIC_MAX_NR=1
+# CONFIG_TS4800_IRQ is not set
@@ -5297,0 +5304 @@
+# CONFIG_MANDATORY_FILE_LOCKING is not set
@@ -5574,0 +5582 @@
+# CONFIG_WQ_WATCHDOG is not set
@@ -5616,0 +5625 @@
+# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
@@ -5692,0 +5702,3 @@
+CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
+# CONFIG_UBSAN is not set
+CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
@@ -5693,0 +5706 @@
+CONFIG_IO_STRICT_DEVMEM=y
@@ -5749,0 +5763 @@
+# CONFIG_INTEGRITY_TRUSTED_KEYRING is not set
@@ -5930,0 +5945,2 @@
+CONFIG_CRYPTO_DEV_QAT_C3XXX=m
+CONFIG_CRYPTO_DEV_QAT_C62X=m
@@ -5931,0 +5948,2 @@
+CONFIG_CRYPTO_DEV_QAT_C3XXXVF=m
+CONFIG_CRYPTO_DEV_QAT_C62XVF=m
@@ -6040,0 +6059 @@
+# CONFIG_IRQ_POLL is not set


> * try to reproduce on debian testing or even stable. I have had similar
> issues with debian unstable in the past.

Gut feeling is that this is not the case. Why would it only appear on
VM hosts then? Debian unstable and testing are pretty close together
these days, and the issue is around for a month now in current
unstable. Btw, I am a DD, I know my way around debian and my gut
feeling is pretty well calibrated here.

I can try stable in the VM, but I'd rather not take the host out of
business. Would that help?

My CPU is also rather old:

processor : 5
vendor_id : AuthenticAMD
cpu family : 16
model : 10
model name : AMD Phenom(tm) II X6 1090T Processor
stepping : 0
microcode : 0x10000dc
cpu MHz : 1600.000
cache size : 512 KB
physical id : 0
siblings : 6
core id : 5
cpu cores : 6
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 6
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr cpb hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter
bugs : tlb_mmatch apic_c1e fxsave_leak sysret_ss_attrs
bogomips : 6428.52
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate cpb

afaik, this CPU is not affected by the current microcode issues, isn't
it? I do have the amd64-microcode package installed, which is supposed
to do everything automatically, and my initramfs doesn't have a
microcode "partition" prepended, gunzip | cpio -i gives the plain
initramfs contents directly.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-13 18:22:22

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sun, Mar 20, 2016 at 07:58:13PM +0100, Borislav Petkov wrote:
> So I'm not sure what even happens here yet. I haven't seen anything out
> of the ordinary in Marc's dmesg and I wasn't able to reproduce either.
> So would it be good to try with "npt=0"? Sure, why not.

npt=0 goes on the kernel command line of the host or of the guest? Or
is it a KVM option?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-13 18:37:18

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Fri, Mar 18, 2016 at 11:01:46AM +0100, Paolo Bonzini wrote:
> On 17/03/2016 19:11, Borislav Petkov wrote:
> > I'm going to try reproducing the issue on a less "important" machine
> > so that bisecting is less painful, but maybe you guys have an idea
> > what's going wrong here.
>
> No idea, sorry. :( Bisecting would be great.

Working on that now.

> I'll also try reproducing and bisecting next week, in the meanwhile
> just having the host dmesg would help a lot.

Attached. I hope the message will get through to the list.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421


Attachments:
(No filename) (864.00 B)
dmesg.fan.4.5.1 (74.47 kB)
Download all attachments

2016-04-13 20:36:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 13/04/2016 20:37, Marc Haber wrote:
> On Fri, Mar 18, 2016 at 11:01:46AM +0100, Paolo Bonzini wrote:
>> On 17/03/2016 19:11, Borislav Petkov wrote:
>>> I'm going to try reproducing the issue on a less "important" machine
>>> so that bisecting is less painful, but maybe you guys have an idea
>>> what's going wrong here.
>>
>> No idea, sorry. :( Bisecting would be great.
>
> Working on that now.
>
>> I'll also try reproducing and bisecting next week, in the meanwhile
>> just having the host dmesg would help a lot.
>
> Attached. I hope the message will get through to the list.

Didn't help, but a fresh look at the list of 4.5 patches helped.
What the hell was I thinking, I missed write_rdtscp_aux who
obviously uses MSR_TSC_AUX.

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 31346a3f20a5..1481dea15844 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -39,6 +39,7 @@
#include <asm/kvm_para.h>

#include <asm/virtext.h>
+#include <asm/vgtod.h>
#include "trace.h"

#define __ex(x) __kvm_handle_fault_on_reboot(x)
@@ -1240,9 +1241,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
wrmsrl(MSR_AMD64_TSC_RATIO, tsc_ratio);
}
}
- /* This assumes that the kernel never uses MSR_TSC_AUX */
- if (static_cpu_has(X86_FEATURE_RDTSCP))
- wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
}

static void svm_vcpu_put(struct kvm_vcpu *vcpu)
@@ -3847,6 +3845,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
svm->vmcb->save.cr2 = vcpu->arch.cr2;

clgi();
+ if (static_cpu_has(X86_FEATURE_RDTSCP))
+ wrmsrl(MSR_TSC_AUX, svm->tsc_aux);

local_irq_enable();

@@ -3923,6 +3923,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);

+ if (static_cpu_has(X86_FEATURE_RDTSCP))
+ wrmsrl(MSR_TSC_AUX, __getcpu());
#ifdef CONFIG_X86_64
wrmsrl(MSR_GS_BASE, svm->host.gs_base);
#else


Paolo

2016-04-13 20:37:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 13/04/2016 20:22, Marc Haber wrote:
>> So I'm not sure what even happens here yet. I haven't seen anything out
>> > of the ordinary in Marc's dmesg and I wasn't able to reproduce either.
>> > So would it be good to try with "npt=0"? Sure, why not.
> npt=0 goes on the kernel command line of the host or of the guest? Or
> is it a KVM option?

It is an option to the kvm-amd module, but I think I found it.

Paolo

2016-04-13 20:52:44

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> Didn't help, but a fresh look at the list of 4.5 patches helped.
> What the hell was I thinking, I missed write_rdtscp_aux who
> obviously uses MSR_TSC_AUX.

So you want me to apply that to 4.5 od 4.5.1 and try that?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-13 22:29:49

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> Didn't help, but a fresh look at the list of 4.5 patches helped.
> What the hell was I thinking, I missed write_rdtscp_aux who
> obviously uses MSR_TSC_AUX.

I applied this patch to 4.5, which didn't go cleanly, I had to do it
manually, and there is no change in behavior. Sometimes, the Vm just
crashes, but most times the filesystem is remounted ro.

[ 84.658968] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27903
[ 84.664877] Aborting journal on device dm-0-8.
[ 84.667992] EXT4-fs (dm-0): Remounting filesystem read-only
[ 84.670972] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[ 84.763331] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27898
[ 84.825412] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27895
[ 84.907959] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27893
[ 84.915187] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27900
[ 84.961062] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27889
[ 84.983700] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27891
[ 98.315538] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27897
[ 98.323606] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27904
[ 99.889927] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27892
[ 99.893823] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27901
[ 99.901140] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27890
[ 99.904898] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27896
[ 99.909758] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27899
[ 99.914394] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27894
[ 207.132045] serial8250: too much work for irq4
[ 207.220043] serial8250: too much work for irq4
[ 207.312028] serial8250: too much work for irq4


Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-14 01:16:43

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 14/04/2016 00:29, Marc Haber wrote:
> On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
>> Didn't help, but a fresh look at the list of 4.5 patches helped.
>> What the hell was I thinking, I missed write_rdtscp_aux who
>> obviously uses MSR_TSC_AUX.
>
> I applied this patch to 4.5, which didn't go cleanly, I had to do it
> manually, and there is no change in behavior. Sometimes, the Vm just
> crashes, but most times the filesystem is remounted ro.

Ok, then I guess bisection is needed. Please first try commit
45bdbcfdf241. If it fails, then the bug come together with KVM's merge
window changes for 4.5-rc1. Please apply the patch I sent here when
bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
that probably that should be the commit you try second; the bisection
then becomes much easier).

Thanks,

Paolo

> [ 84.658968] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27903
> [ 84.664877] Aborting journal on device dm-0-8.
> [ 84.667992] EXT4-fs (dm-0): Remounting filesystem read-only
> [ 84.670972] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
> [ 84.763331] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27898
> [ 84.825412] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27895
> [ 84.907959] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27893
> [ 84.915187] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27900
> [ 84.961062] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27889
> [ 84.983700] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27891
> [ 98.315538] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27897
> [ 98.323606] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27904
> [ 99.889927] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27892
> [ 99.893823] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27901
> [ 99.901140] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27890
> [ 99.904898] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27896
> [ 99.909758] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27899
> [ 99.914394] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27894
> [ 207.132045] serial8250: too much work for irq4
> [ 207.220043] serial8250: too much work for irq4
> [ 207.312028] serial8250: too much work for irq4
>
>
> Greetings
> Marc
>

2016-04-14 05:22:32

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> On 14/04/2016 00:29, Marc Haber wrote:
> > On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> >> Didn't help, but a fresh look at the list of 4.5 patches helped.
> >> What the hell was I thinking, I missed write_rdtscp_aux who
> >> obviously uses MSR_TSC_AUX.
> >
> > I applied this patch to 4.5, which didn't go cleanly, I had to do it
> > manually, and there is no change in behavior. Sometimes, the Vm just
> > crashes, but most times the filesystem is remounted ro.
>
> Ok, then I guess bisection is needed. Please first try commit
> 45bdbcfdf241. If it fails, then the bug come together with KVM's merge
> window changes for 4.5-rc1. Please apply the patch I sent here when
> bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
> that probably that should be the commit you try second; the bisection
> then becomes much easier).

I have never bisected this deeply. Can you please give more advice,
with which two commits to start? And how do I find out whether I am
"past" a commit? I am als not a git expert, a few command lines would
be appreciated.

Things have not become any easier this night; 4.5-rc7 ran for more
than three hours before it failed :-(

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-14 06:07:28

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> Ok, then I guess bisection is needed. Please first try commit
> 45bdbcfdf241.

That kernel labels itself as "4.4.0-rc5+", is that correct?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-14 16:48:10

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> Ok, then I guess bisection is needed. Please first try commit
> 45bdbcfdf241.

I did git checkout 45bdbcfdf241 and built the resulting kernel
4.4.0-rc5. This one has now been running for ten hours, which is
threefold the longest time that a faulty kernel has held before a VM
experienced corruption. So I guess, that one is fine.

Since 4.5.0-rc1 is bad, I guess I do:

git checkout 45bdbcfdf241
git bisect start
git bisect good
git bisect bad v4.5.0-rc1

right?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-14 17:30:50

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host



On 14/04/2016 18:47, Marc Haber wrote:
>> > Ok, then I guess bisection is needed. Please first try commit
>> > 45bdbcfdf241.
> I did git checkout 45bdbcfdf241 and built the resulting kernel
> 4.4.0-rc5. This one has now been running for ten hours, which is
> threefold the longest time that a faulty kernel has held before a VM
> experienced corruption. So I guess, that one is fine.

Interesting, this means it's not a KVM bug. You can ignore my patch
from yesterday (though we'll get it in anyway).

> Since 4.5.0-rc1 is bad, I guess I do:
>
> git checkout 45bdbcfdf241
> git bisect start
> git bisect good
> git bisect bad v4.5.0-rc1

This is correct but you also want to do

git bisect good 4.4.0
git bisect good 4.4.0-rc5

so that bisection basically works through the commits in the merge window.

Thanks,

Paolo

2016-04-14 17:47:45

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 14, 2016 at 07:30:43PM +0200, Paolo Bonzini wrote:
> On 14/04/2016 18:47, Marc Haber wrote:
> >> > Ok, then I guess bisection is needed. Please first try commit
> >> > 45bdbcfdf241.
> > I did git checkout 45bdbcfdf241 and built the resulting kernel
> > 4.4.0-rc5. This one has now been running for ten hours, which is
> > threefold the longest time that a faulty kernel has held before a VM
> > experienced corruption. So I guess, that one is fine.
>
> Interesting, this means it's not a KVM bug. You can ignore my patch
> from yesterday (though we'll get it in anyway).
>
> > Since 4.5.0-rc1 is bad, I guess I do:
> >
> > git checkout 45bdbcfdf241
> > git bisect start
> > git bisect good
> > git bisect bad v4.5.0-rc1
>
> This is correct but you also want to do
>
> git bisect good 4.4.0
> git bisect good 4.4.0-rc5
>
> so that bisection basically works through the commits in the merge window.

So I start over from this:

[47/544]mh@fan:~/linux/debug/linux$ git checkout 45bdbcfdf241
HEAD is now at 45bdbcf... kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
[48/545]mh@fan:~/linux/debug/linux$ git bisect start
[49/546]mh@fan:~/linux/debug/linux$ git bisect good
[50/547]mh@fan:~/linux/debug/linux$ git bisect bad v4.5-rc1
Bisecting: 5761 revisions left to test after this (roughly 13 steps)
[cbd88cd4c07f9361914ab7fd7e21c9227986fe68] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
[51/548]mh@fan:~/linux/debug/linux$ git bisect good v4.4
Bisecting: 5468 revisions left to test after this (roughly 12 steps)
[f9a03ae123c92c1f45cd2ca88d0f6edd787be78c] Merge tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
[52/549]mh@fan:~/linux/debug/linux$ git bisect good v4.4-rc5
Bisecting: 5468 revisions left to test after this (roughly 12 steps)
[f9a03ae123c92c1f45cd2ca88d0f6edd787be78c] Merge tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
[53/550]mh@fan:~/linux/debug/linux$

This is going to take a few days as detecting a "bad" version may take
a few hours.

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-21 08:39:57

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 14, 2016 at 07:22:20AM +0200, Marc Haber wrote:
> On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> > On 14/04/2016 00:29, Marc Haber wrote:
> > > On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> > >> Didn't help, but a fresh look at the list of 4.5 patches helped.
> > >> What the hell was I thinking, I missed write_rdtscp_aux who
> > >> obviously uses MSR_TSC_AUX.
> > >
> > > I applied this patch to 4.5, which didn't go cleanly, I had to do it
> > > manually, and there is no change in behavior. Sometimes, the Vm just
> > > crashes, but most times the filesystem is remounted ro.
> >
> > Ok, then I guess bisection is needed. Please first try commit
> > 45bdbcfdf241. If it fails, then the bug come together with KVM's merge
> > window changes for 4.5-rc1. Please apply the patch I sent here when
> > bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
> > that probably that should be the commit you try second; the bisection
> > then becomes much easier).
>
> I have never bisected this deeply. Can you please give more advice,
> with which two commits to start? And how do I find out whether I am
> "past" a commit? I am als not a git expert, a few command lines would
> be appreciated.

I have tried bisecting, and finally bisect says that the bad commit is
0e749e54244eec87b2a3cd0a4314e60bc6781115 dax: increase granularity of dax_clear_blocks() operations

However, a kernel built after
$ git checkout 0e749e54244eec87b2a3cd0a4314e60bc6781115
seems to be fine, at least my VM is running for 15 hours now.

I guess I need to start over again with git bisect good
0e749e54244eec87b2a3cd0a4314e60bc6781115 and git bisect bad v4.5.

Currently, I cannot explain how this has happened, I must have flagged
an actually good kernel as bad from my understanding of git bisect.

Can you give advice how to continue here?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-21 12:37:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 10:39:48AM +0200, Marc Haber wrote:
> Currently, I cannot explain how this has happened, I must have flagged
> an actually good kernel as bad from my understanding of git bisect.
>
> Can you give advice how to continue here?

Yap, sounds like you marked a bisection step incorrectly, which lead
into the wrong direction. How reliable is your reproducer?

Also, do the bisection as Paolo suggested:

* try 45bdbcfdf241.

* then do

$ git bisect start v4.5-rc1 v4.4

which marks -rc1 as bad and 4.4 as good.

While you're doing that bisect, do what Paolo said by applying the diff
here

https://lkml.kernel.org/r/[email protected]

when the bisection point you're at at each step contains

46896c73c1a4 ("KVM: svm: add support for RDTSCP")

You should apply the above hunk by doing

$ patch -p1 --dry-run -i /tmp/hunk

If it applies fine, you then apply it

$ patch -p1 -i /tmp/hunk

All clear?

If not, do not hesitate to ask.

Thanks

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-04-21 14:50:18

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 02:37:11PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 10:39:48AM +0200, Marc Haber wrote:
> > Currently, I cannot explain how this has happened, I must have flagged
> > an actually good kernel as bad from my understanding of git bisect.
> >
> > Can you give advice how to continue here?
>
> Yap, sounds like you marked a bisection step incorrectly, which lead
> into the wrong direction. How reliable is your reproducer?

Usually, the crash or filesystem corruption happens in the first 15 to
30 minutes. I have had one instance running three hours before
corrupting, I have therefore upped the run time to nine hours before
saying "this kernel is good".

What bothers me is that since I ended up with a "suspect" commit that
actually results in a "good" kernel (running for 22 hours now), I must
have said "bad" to an actually "good" kernel, which means that I had
an unrelated crash or corruption. Is that reasoning correct?

> Also, do the bisection as Paolo suggested:
>
> * try 45bdbcfdf241.

That one qualified as "good" six days ago. I'll retry, maybe I just
didn't wait long enough.

"Trying" means make oldconfig, make deb-pkg in my case right? Does it
matter what I answer to the numerous config questions that keep coming
up during the oldconfig step?

> * then do
>
> $ git bisect start v4.5-rc1 v4.4
>
> which marks -rc1 as bad and 4.4 as good.

Would it help to explicitly mark
0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
gained during the last week is not completely lost?

> While you're doing that bisect, do what Paolo said by applying the diff
> here
>
> https://lkml.kernel.org/r/[email protected]
>
> when the bisection point you're at at each step contains
>
> 46896c73c1a4 ("KVM: svm: add support for RDTSCP")
>
> You should apply the above hunk by doing
>
> $ patch -p1 --dry-run -i /tmp/hunk
>
> If it applies fine, you then apply it
>
> $ patch -p1 -i /tmp/hunk
>
> All clear?

So I need to git log | grep 46896c73c1a4 and apply the patch again
each time the commit is found?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-21 16:51:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote:
> What bothers me is that since I ended up with a "suspect" commit that
> actually results in a "good" kernel (running for 22 hours now), I must
> have said "bad" to an actually "good" kernel, which means that I had
> an unrelated crash or corruption. Is that reasoning correct?

Hmm, did that "unrelated crash or corruption" have the same symptoms as
the original one?

> That one qualified as "good" six days ago. I'll retry, maybe I just
> didn't wait long enough.

So if the trigger time is varying so much, I'd try to double that to
make sure I'm fairly certain about each commit I'm testing.

Also, this is a single box we're talking about, right? And you're sure
it hasn't had any corruption issues so far?

I see you have amd64_edac loading, so it must have ECC DIMMs. Have you
had any reports in the past of ECC errors in dmesg? Or other MCEs,
lockups, etc? Can you grep your logs for stuff like "hardware error",
"mce", "edac" etc? Do a case-insensitive search.

> "Trying" means make oldconfig, make deb-pkg in my case right? Does it
> matter what I answer to the numerous config questions that keep coming
> up during the oldconfig step?

What I do is:

$ git bisect <good|bad>

to mark the current commit after having tested it. Then I do

$ yes "" | make oldconfig

to set the new config options. Then

$ make -j7
$ make modules_install install

and reboot into the new kernel. Kernel name will possibly change each
time so I write down on paper which kernel I'm testing. You can verify
when booting it by doing:

$ dmesg | head
[ 0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016
...

that date at the end of the line and number "#1" should be current.
Number is also in .version and gets issued when you finish building:

Kernel: arch/x86/boot/bzImage is ready (#1)

> Would it help to explicitly mark
> 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
> gained during the last week is not completely lost?

I'd do the whole thing again, just to be sure.

I know, bisection is very time-consuming :-\ And it is particularly
annoying if it is done on the box I'm normally using daily.

> So I need to git log | grep 46896c73c1a4 and apply the patch again
> each time the commit is found?

I think you can let git do that for ya:

$ git branch --contains 46896c73c1a4
* (HEAD detached at 46896c73c1a4)

that lists that the current checked out HEAD contains that commit. If you do

$ git checkout 46896c73c1a4~1

then that "(HEAD detached..." line is not in the list of branches
containing it.

HTH.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-04-21 20:04:42

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 06:51:06PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote:
> > What bothers me is that since I ended up with a "suspect" commit that
> > actually results in a "good" kernel (running for 22 hours now), I must
> > have said "bad" to an actually "good" kernel, which means that I had
> > an unrelated crash or corruption. Is that reasoning correct?
>
> Hmm, did that "unrelated crash or corruption" have the same symptoms as
> the original one?

Yes, but there are two symptoms. The VM either suffers file system
issues (garbage read from files, or an aborted ext4 journal and
following ro remount) or it stops dead in its tracks.


> > That one qualified as "good" six days ago. I'll retry, maybe I just
> > didn't wait long enough.
>
> So if the trigger time is varying so much, I'd try to double that to
> make sure I'm fairly certain about each commit I'm testing.

The longest trigger time I have seen was three hours, I tripled that
to nine hours, that probably was not enough.

> Also, this is a single box we're talking about, right? And you're sure
> it hasn't had any corruption issues so far?

It is a single box, and it runs perfectly with kernel 4.4.

> I see you have amd64_edac loading, so it must have ECC DIMMs. Have you
> had any reports in the past of ECC errors in dmesg? Or other MCEs,
> lockups, etc? Can you grep your logs for stuff like "hardware error",
> "mce", "edac" etc? Do a case-insensitive search.

The box reports about one correctable error per week, so I probably
have a faulty DIMM, but since the issue only surfaces in VMs while the
host system is in perfect working order...

And yes, I am pondering to simply replace the box with an Intel CPU.

I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
30 "Machine check events logged" since January. How do I see which
events were logged?

> > "Trying" means make oldconfig, make deb-pkg in my case right? Does it
> > matter what I answer to the numerous config questions that keep coming
> > up during the oldconfig step?
>
> What I do is:
>
> $ git bisect <good|bad>
>
> to mark the current commit after having tested it. Then I do
>
> $ yes "" | make oldconfig
>
> to set the new config options.

So you basically select the default for new options.

> Then
>
> $ make -j7
> $ make modules_install install
>
> and reboot into the new kernel. Kernel name will possibly change each
> time so I write down on paper which kernel I'm testing.

I go the way of Debian packages since it is easier to handle the
crypto file systems when the machine is booting up.

And yes, I think about doing a test reinstall on unencrypted disk to
find out whether encryption plays a role, but I currently need the
machine to urgently to take it out of serice for half a month, and,
again, the host system is in perfect working order, it is just VMs
that barf.

> You can verify when booting it by doing:
>
> $ dmesg | head
> [ 0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016
> ...
>
> that date at the end of the line and number "#1" should be current.

I check the date of the package I am installing and the date stamp of
the kernels being installed to /boot. I'm reasonably sure I have that
under control.

> > Would it help to explicitly mark
> > 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
> > gained during the last week is not completely lost?
>
> I'd do the whole thing again, just to be sure.
>
> I know, bisection is very time-consuming :-\ And it is particularly
> annoying if it is done on the box I'm normally using daily.

... and if testing a "good" kernel means a day.

> > So I need to git log | grep 46896c73c1a4 and apply the patch again
> > each time the commit is found?
>
> I think you can let git do that for ya:
>
> $ git branch --contains 46896c73c1a4
> * (HEAD detached at 46896c73c1a4)
>
> that lists that the current checked out HEAD contains that commit. If you do
>
> $ git checkout 46896c73c1a4~1
>
> then that "(HEAD detached..." line is not in the list of branches
> containing it.

And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
right?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-23 16:04:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> Yes, but there are two symptoms. The VM either suffers file system
> issues (garbage read from files, or an aborted ext4 journal and
> following ro remount) or it stops dead in its tracks.

Stops dead? What does that mean exactly? Box is wedged solid and it
doesn't react to any key presses?

Because if so, this could really be a DRAM going bad and a correctable
error turning into an uncorrectable. How old is the DRAM in that box?
Judging by your CPU, it should be a couple of years...

> The longest trigger time I have seen was three hours, I tripled that
> to nine hours, that probably was not enough.

So enlarge even more I guess.

> The box reports about one correctable error per week, so I probably
> have a faulty DIMM, but since the issue only surfaces in VMs while the
> host system is in perfect working order...

So it could be that correctable error turns into an uncorrectable one at
some point. But then you should be getting an exception...

> And yes, I am pondering to simply replace the box with an Intel CPU.

Your CPU is fine, from what I've seen so far.

> I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
> 30 "Machine check events logged" since January. How do I see which
> events were logged?

Hmm, you have

[ 18.149300] MCE: In-kernel MCE decoding enabled.

that's CONFIG_EDAC_DECODE_MCE, so you should have some "Hardware Error"
lines in dmesg, I'd guess, decoding the errors.

> So you basically select the default for new options.

Yap.

> I go the way of Debian packages since it is easier to handle the
> crypto file systems when the machine is booting up.

As long as you're testing the correct bisection kernels...

> And yes, I think about doing a test reinstall on unencrypted disk to
> find out whether encryption plays a role, but I currently need the
> machine to urgently to take it out of serice for half a month, and,
> again, the host system is in perfect working order, it is just VMs
> that barf.

Yeah, I can't reproduce it here and I have a very similar box to yours
which is otherwise idle, more or less.

Another fact which points to potentially DIMM going bad...

> I check the date of the package I am installing and the date stamp of
> the kernels being installed to /boot. I'm reasonably sure I have that
> under control.

Good.

> ... and if testing a "good" kernel means a day.

Yeah, it is annoying. In a perfect world, we all should have two
identical boxes so that we use one as a workstation and the second for
testing when the first one, the workstation barfs. I should bring that
up with my manager next time... :-)

> And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
> right?

Yap.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-04-23 18:43:48

by Marc Haber

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > Yes, but there are two symptoms. The VM either suffers file system
> > issues (garbage read from files, or an aborted ext4 journal and
> > following ro remount) or it stops dead in its tracks.
>
> Stops dead? What does that mean exactly? Box is wedged solid and it
> doesn't react to any key presses?

No ping, no reaction on serial console, no reaction on virtual
console, no syslog entries.

> Because if so, this could really be a DRAM going bad and a correctable
> error turning into an uncorrectable. How old is the DRAM in that box?
> Judging by your CPU, it should be a couple of years...

Uncorrectable errors would still be identified by the ECC hardware,
and the box wouldn't be perfectly fine with an "old" kernel.

> > The box reports about one correctable error per week, so I probably
> > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > host system is in perfect working order...
>
> So it could be that correctable error turns into an uncorrectable one at
> some point. But then you should be getting an exception...

Yes, that would be in the logs.

> > And yes, I am pondering to simply replace the box with an Intel CPU.
>
> Your CPU is fine, from what I've seen so far.

But we still postulate that the issue does only show on older AMD
CPUs. Otherwise, I wouldn't be the only one making this experience.

> > I go the way of Debian packages since it is easier to handle the
> > crypto file systems when the machine is booting up.
>
> As long as you're testing the correct bisection kernels...

I am reasonably sure about that, yes.

> > And yes, I think about doing a test reinstall on unencrypted disk to
> > find out whether encryption plays a role, but I currently need the
> > machine to urgently to take it out of serice for half a month, and,
> > again, the host system is in perfect working order, it is just VMs
> > that barf.
>
> Yeah, I can't reproduce it here and I have a very similar box to yours
> which is otherwise idle, more or less.
>
> Another fact which points to potentially DIMM going bad...

Do you want me to memtest for 24 hours?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421

2016-04-23 19:12:08

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

* Marc Haber ([email protected]) wrote:
> On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > > Yes, but there are two symptoms. The VM either suffers file system
> > > issues (garbage read from files, or an aborted ext4 journal and
> > > following ro remount) or it stops dead in its tracks.
> >
> > Stops dead? What does that mean exactly? Box is wedged solid and it
> > doesn't react to any key presses?
>
> No ping, no reaction on serial console, no reaction on virtual
> console, no syslog entries.
>
> > Because if so, this could really be a DRAM going bad and a correctable
> > error turning into an uncorrectable. How old is the DRAM in that box?
> > Judging by your CPU, it should be a couple of years...
>
> Uncorrectable errors would still be identified by the ECC hardware,
> and the box wouldn't be perfectly fine with an "old" kernel.

Hmm, your problem does sound like bad hardware, but....
If you've got a nice reliable crash, can you try turning transparent huge pages
off on the host;
echo never > /sys/kernel/mm/transparent_hugepage/enabled

Dave

> > > The box reports about one correctable error per week, so I probably
> > > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > > host system is in perfect working order...
> >
> > So it could be that correctable error turns into an uncorrectable one at
> > some point. But then you should be getting an exception...
>
> Yes, that would be in the logs.
>
> > > And yes, I am pondering to simply replace the box with an Intel CPU.
> >
> > Your CPU is fine, from what I've seen so far.
>
> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.
>
> > > I go the way of Debian packages since it is easier to handle the
> > > crypto file systems when the machine is booting up.
> >
> > As long as you're testing the correct bisection kernels...
>
> I am reasonably sure about that, yes.
>
> > > And yes, I think about doing a test reinstall on unencrypted disk to
> > > find out whether encryption plays a role, but I currently need the
> > > machine to urgently to take it out of serice for half a month, and,
> > > again, the host system is in perfect working order, it is just VMs
> > > that barf.
> >
> > Yeah, I can't reproduce it here and I have a very similar box to yours
> > which is otherwise idle, more or less.
> >
> > Another fact which points to potentially DIMM going bad...
>
> Do you want me to memtest for 24 hours?
>
> Greetings
> Marc
>
> --
> -----------------------------------------------------------------------------
> Marc Haber | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/

2016-04-23 23:58:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: Major KVM issues with kernel 4.5 on the host

On Sat, Apr 23, 2016 at 08:43:41PM +0200, Marc Haber wrote:
> Uncorrectable errors would still be identified by the ECC hardware,

Not if the hardware decides to syncflood so that we don't even get to
run the #MC handler...

> and the box wouldn't be perfectly fine with an "old" kernel.

Maybe the "old" kernel is not causing all the required ingredients to
come together for the uncorrectable error to happen. But yeah, I agree,
the fact that 4.4 is fine kinda doesn't fit with the uncorrectable error
theory.

> Yes, that would be in the logs.

Presumably. And see above.

> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.

It actually shows only on this one system. At least I'm not aware of any
other report of the same issue. My system with a F10h, rev E is just
fine.

> Do you want me to memtest for 24 hours?

Yeah, that memtest crap never triggers any ECCs. But if you're bored,
why not...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.