2017-04-03 20:46:22

by Samuel Sieb

[permalink] [raw]
Subject: AMD IOMMU causing filesystem corruption

I filed a bug in bugzilla, but I wasn't sure what category to put it in,
so I suspect I ended up picking one that doesn't get looked at much.

https://bugzilla.kernel.org/show_bug.cgi?id=195051

The issue is that on a specific Acer laptop with a dual-core A9, if I
don't disable the IOMMU using iommu=off, it has immediate and rapidly
fatal filesystem corruption by the time a user logs into the desktop.
What led me to try that was at one point I noticed an error message
about the iommu in the logs. However, I did not have a chance to save
that due to the corruption obliterating the log files.


2017-04-03 21:39:12

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

Hi Samuel,

On Mon, Apr 03, 2017 at 01:38:08PM -0700, Samuel Sieb wrote:
> I filed a bug in bugzilla, but I wasn't sure what category to put it
> in, so I suspect I ended up picking one that doesn't get looked at
> much.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=195051
>
> The issue is that on a specific Acer laptop with a dual-core A9, if
> I don't disable the IOMMU using iommu=off, it has immediate and
> rapidly fatal filesystem corruption by the time a user logs into the
> desktop. What led me to try that was at one point I noticed an error
> message about the iommu in the logs. However, I did not have a
> chance to save that due to the corruption obliterating the log
> files.

You have a system based on the AMD Stoney platform, on which the PCI-ATS
feature of the GPU is broken, as we recently found out.

Can you please test whether the attached patch fixes the issue on your
machine?

>From 09cbdcbbd23f0823e7651b4f35b13ae633b3fbe2 Mon Sep 17 00:00:00 2001
From: Joerg Roedel <[email protected]>
Date: Tue, 28 Mar 2017 13:20:27 +0200
Subject: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS

ATS is broken on these devices. Under invalidation load, the
GPU does not reply to invalidations anymore, causing
Completion-wait loop timeouts on the AMD IOMMU driver side.
Fix it by not enabling ATS on these devices.

Note that below mentioned commit is not broken, it just
triggers the issue because it might cause invalidation
storms on devices.

Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
Reported-by: Daniel Drake <[email protected]>
Cc: Alexander Deucher <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
---
drivers/pci/ats.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index eeb9fb2..711bdb2 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -17,10 +17,18 @@

#include "pci.h"

+static const struct pci_device_id broken_ats_tbl[] = {
+ { PCI_DEVICE(PCI_VENDOR_ID_AMD, 0x98e4) }, /* AMD Stoney GPU part */
+ { 0 }
+};
+
void pci_ats_init(struct pci_dev *dev)
{
int pos;

+ if (pci_match_id(broken_ats_tbl, dev))
+ return;
+
pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
if (!pos)
return;
--
1.9.1

2017-04-04 03:45:51

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/03/2017 02:39 PM, Joerg Roedel wrote:
> You have a system based on the AMD Stoney platform, on which the PCI-ATS
> feature of the GPU is broken, as we recently found out.
>
> Can you please test whether the attached patch fixes the issue on your
> machine?
>
Yes, that works, thank you!

Now I'm curious what Windows does. Either they don't use that feature
or they already knew to avoid it. In which case, why did AMD take so
long to let the kernel developers know?

2017-04-04 07:11:38

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/03/2017 08:37 PM, Samuel Sieb wrote:
> On 04/03/2017 02:39 PM, Joerg Roedel wrote:
>> You have a system based on the AMD Stoney platform, on which the PCI-ATS
>> feature of the GPU is broken, as we recently found out.
>>
>> Can you please test whether the attached patch fixes the issue on your
>> machine?
>>
> Yes, that works, thank you!
>
Unfortunately, that turned out to be a bit premature. After compiling a
kernel on it remotely over ssh (not interacting with the logged in
desktop), a reboot failed with endless completion-wait loop timeout
messages and after a force poweroff and restart it won't boot. The EFI
filesystem was even destroyed, probably because the kernel installation
modified a file on there.

2017-04-04 07:32:19

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On Tue, Apr 04, 2017 at 12:11:31AM -0700, Samuel Sieb wrote:
> Unfortunately, that turned out to be a bit premature. After
> compiling a kernel on it remotely over ssh (not interacting with the
> logged in desktop), a reboot failed with endless completion-wait
> loop timeout messages and after a force poweroff and restart it
> won't boot. The EFI filesystem was even destroyed, probably because
> the kernel installation modified a file on there.

Yeah, please boot the machine with amd_iommu=off during re-installation
and when you install the modified kernel. And then boot into the patches
kernel with amd_iommu=on (which is the default).


Thanks,

Joerg

2017-04-04 16:29:44

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/04/2017 12:32 AM, Joerg Roedel wrote:
> On Tue, Apr 04, 2017 at 12:11:31AM -0700, Samuel Sieb wrote:
>> Unfortunately, that turned out to be a bit premature. After
>> compiling a kernel on it remotely over ssh (not interacting with the
>> logged in desktop), a reboot failed with endless completion-wait
>> loop timeout messages and after a force poweroff and restart it
>> won't boot. The EFI filesystem was even destroyed, probably because
>> the kernel installation modified a file on there.
>
> Yeah, please boot the machine with amd_iommu=off during re-installation
> and when you install the modified kernel. And then boot into the patches
> kernel with amd_iommu=on (which is the default).
>
That's what I did. While running with iommu=off, I compiled and
installed a 4.11rc kernel with the patch. I rebooted to use that kernel
and then compiled and installed a 4.10 kernel with that patch and
another unrelated patch. That is what I described above. The
filesystem destruction happened while running the 4.11rc kernel with
that patch. Is there any way to verify that the patch was actually
having any effect? Can I check if ATS is enabled or not? I will have
to rebuild the system before I can test again.

2017-04-07 10:23:05

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On Tue, Apr 04, 2017 at 09:29:37AM -0700, Samuel Sieb wrote:
> That's what I did. While running with iommu=off, I compiled and
> installed a 4.11rc kernel with the patch. I rebooted to use that
> kernel and then compiled and installed a 4.10 kernel with that patch
> and another unrelated patch. That is what I described above. The
> filesystem destruction happened while running the 4.11rc kernel with
> that patch. Is there any way to verify that the patch was actually
> having any effect? Can I check if ATS is enabled or not? I will
> have to rebuild the system before I can test again.

Can you please send me output of 'lspci -nv' on your system?


Thanks,

Joerg

2017-04-07 10:27:49

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On Tue, Apr 04, 2017 at 09:29:37AM -0700, Samuel Sieb wrote:
> That's what I did. While running with iommu=off, I compiled and
> installed a 4.11rc kernel with the patch. I rebooted to use that
> kernel and then compiled and installed a 4.10 kernel with that patch
> and another unrelated patch. That is what I described above. The
> filesystem destruction happened while running the 4.11rc kernel with
> that patch. Is there any way to verify that the patch was actually
> having any effect? Can I check if ATS is enabled or not? I will
> have to rebuild the system before I can test again.

Also, please try the attached debug-diff on your kernel. It completly
disables the use of ATS in the amd-iommu driver.

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 98940d1392cb..f019aa67c54c 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -467,7 +467,7 @@ static int iommu_init_device(struct device *dev)
struct amd_iommu *iommu;

iommu = amd_iommu_rlookup_table[dev_data->devid];
- dev_data->iommu_v2 = iommu->is_iommu_v2;
+ dev_data->iommu_v2 = false;
}

dev->archdata.iommu = dev_data;
diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
index 6130278c5d71..41d0e645960c 100644
--- a/drivers/iommu/amd_iommu_init.c
+++ b/drivers/iommu/amd_iommu_init.c
@@ -171,7 +171,7 @@ int amd_iommus_present;

/* IOMMUs have a non-present cache? */
bool amd_iommu_np_cache __read_mostly;
-bool amd_iommu_iotlb_sup __read_mostly = true;
+bool amd_iommu_iotlb_sup __read_mostly = false;

u32 amd_iommu_max_pasid __read_mostly = ~0;


2017-04-08 06:49:40

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/07/2017 03:22 AM, Joerg Roedel wrote:
> Can you please send me output of 'lspci -nv' on your system?
>
I have to figure out how to rebuild the system and find the time to do
it before I can test that patch, but here's the lspci output:

00:00.0 0600: 1022:1576
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0

00:00.2 0806: 1022:1577
Subsystem: 1025:1099
Flags: fast devsel, IRQ 24
Capabilities: [40] Secure device <?>
Capabilities: [64] MSI: Enable+ Count=1/4 Maskable- 64bit+
Capabilities: [74] HyperTransport: MSI Mapping Enable+ Fixed+

00:01.0 0300: 1002:98e4 (rev c1) (prog-if 00 [VGA controller])
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 37
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=8M]
I/O ports at 3000 [size=256]
Memory at f0d00000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at f0d80000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270] #19
Capabilities: [2b0] Address Translation Service (ATS)
Capabilities: [2c0] Page Request Interface (PRI)
Capabilities: [2d0] Process Address Space ID (PASID)
Kernel driver in use: amdgpu
Kernel modules: amdgpu

00:01.1 0403: 1002:15b3
Subsystem: 1002:15b3
Flags: bus master, fast devsel, latency 0, IRQ 255
Memory at f0d60000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Kernel modules: snd_hda_intel

00:02.0 0600: 1022:157b
Flags: fast devsel

00:02.1 0604: 1022:157c (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 26
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00001000-00001fff
Memory behind bridge: f0e00000-f0ffffff
Prefetchable memory behind bridge: 00000000f1000000-00000000f11fffff
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Port (Slot+), MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [c0] Subsystem: 1022:1234
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:02.2 0604: 1022:157c (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 27
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 00002000-00002fff
Memory behind bridge: f0c00000-f0cfffff
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Port (Slot+), MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [c0] Subsystem: 1022:1234
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:02.3 0604: 1022:157c (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 28
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
Memory behind bridge: f0800000-f09fffff
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Port (Slot+), MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [c0] Subsystem: 1022:1234
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270] #19
Kernel driver in use: pcieport
Kernel modules: shpchp

00:03.0 0600: 1022:157b
Flags: fast devsel

00:08.0 1080: 1022:1578
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 255
Memory at f0d40000 (64-bit, prefetchable) [size=128K]
Memory at f0b00000 (32-bit, non-prefetchable) [size=1M]
Memory at f0d6f000 (32-bit, non-prefetchable) [size=4K]
Memory at f0d6a000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [50] MSI-X: Enable- Count=2 Masked-
Capabilities: [5c] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [60] Power Management version 3
Capabilities: [a4] PCI Advanced Features

00:09.0 0600: 1022:157d
Flags: fast devsel

00:09.2 0403: 1022:157a
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 255
Memory at f0d64000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Capabilities: [a4] PCI Advanced Features
Kernel modules: snd_hda_intel

00:10.0 0c03: 1022:7914 (rev 20) (prog-if 30 [XHCI])
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 18
Memory at f0d68000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [50] Power Management version 3
Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
Capabilities: [a0] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [100] Latency Tolerance Reporting
Kernel driver in use: xhci_hcd

00:11.0 0106: 1022:7901 (rev 4b) (prog-if 01 [AHCI 1.0])
Subsystem: 1025:1099
Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 19
I/O ports at 3118 [size=8]
I/O ports at 3124 [size=4]
I/O ports at 3110 [size=8]
I/O ports at 3120 [size=4]
I/O ports at 3100 [size=16]
Memory at f0d6c000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [60] Power Management version 3
Capabilities: [70] SATA HBA v1.0
Kernel driver in use: ahci

00:12.0 0c03: 1022:7908 (rev 49) (prog-if 20 [EHCI])
Subsystem: 1025:1099
Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 18
Memory at f0d6d000 (32-bit, non-prefetchable) [size=256]
Capabilities: [c0] Power Management version 2
Capabilities: [e4] Debug port: BAR=1 offset=00e0
Kernel driver in use: ehci-pci

00:14.0 0c05: 1022:790b (rev 4b)
Subsystem: 1025:1099
Flags: 66MHz, medium devsel
Kernel driver in use: piix4_smbus
Kernel modules: i2c_piix4, sp5100_tco

00:14.3 0601: 1022:790e (rev 11)
Subsystem: 1025:1099
Flags: bus master, 66MHz, medium devsel, latency 0

00:18.0 0600: 1022:15b0
Flags: fast devsel

00:18.1 0600: 1022:15b1
Flags: fast devsel

00:18.2 0600: 1022:15b2
Flags: fast devsel

00:18.3 0600: 1022:15b3
Flags: fast devsel
Capabilities: [f0] Secure device <?>

00:18.4 0600: 1022:15b4
Flags: fast devsel
Kernel modules: fam15h_power

00:18.5 0600: 1022:15b5
Flags: fast devsel

02:00.0 ff00: 10ec:5287 (rev 01)
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 33
Memory at f0c05000 (32-bit, non-prefetchable) [size=4K]
Expansion ROM at f0c10000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [170] Latency Tolerance Reporting
Capabilities: [178] L1 PM Substates
Kernel driver in use: rtsx_pci
Kernel modules: rtsx_pci

02:00.1 0200: 10ec:8168 (rev 12)
Subsystem: 1025:1099
Flags: bus master, fast devsel, latency 0, IRQ 35
I/O ports at 2000 [size=256]
Memory at f0c04000 (64-bit, non-prefetchable) [size=4K]
Memory at f0c00000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [160] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [170] Latency Tolerance Reporting
Capabilities: [178] L1 PM Substates
Kernel driver in use: r8169
Kernel modules: r8169

03:00.0 0280: 168c:0042 (rev 31)
Subsystem: 11ad:08a6
Flags: bus master, fast devsel, latency 0, IRQ 39
Memory at f0800000 (64-bit, non-prefetchable) [size=2M]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable+ Count=8/8 Maskable+ 64bit-
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Virtual Channel
Capabilities: [168] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [178] Latency Tolerance Reporting
Capabilities: [180] L1 PM Substates
Kernel driver in use: ath10k_pci
Kernel modules: ath10k_pci

2017-04-25 17:55:39

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/07/2017 03:27 AM, Joerg Roedel wrote:
> Also, please try the attached debug-diff on your kernel. It completly
> disables the use of ATS in the amd-iommu driver.
>
I applied this patch to 4.11.0 rc8 and then stress tested the laptop
with another kernel build while running graphical applications and there
appears to be no damage to the filesystem. Is there any way to
determine if ATS is enabled or disabled?

2017-04-26 10:14:16

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

Hi Samuel,

On Tue, Apr 25, 2017 at 10:55:24AM -0700, Samuel Sieb wrote:
> On 04/07/2017 03:27 AM, Joerg Roedel wrote:
> >Also, please try the attached debug-diff on your kernel. It completly
> >disables the use of ATS in the amd-iommu driver.
> >
> I applied this patch to 4.11.0 rc8 and then stress tested the laptop
> with another kernel build while running graphical applications and
> there appears to be no damage to the filesystem. Is there any way
> to determine if ATS is enabled or disabled?

Great, thanks for testing the patch. The lspci tool should be able to
tell you whether the ATS capability is enabled on the GPU. With a
'lspci -vvv -s <GPUDEV>" should give you that info.


Joerg

2017-04-26 21:31:54

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/26/2017 03:14 AM, Joerg Roedel wrote:
> On Tue, Apr 25, 2017 at 10:55:24AM -0700, Samuel Sieb wrote:
>> On 04/07/2017 03:27 AM, Joerg Roedel wrote:
>>> Also, please try the attached debug-diff on your kernel. It completly
>>> disables the use of ATS in the amd-iommu driver.
>>>
>> I applied this patch to 4.11.0 rc8 and then stress tested the laptop
>> with another kernel build while running graphical applications and
>> there appears to be no damage to the filesystem. Is there any way
>> to determine if ATS is enabled or disabled?
>
> Great, thanks for testing the patch. The lspci tool should be able to
> tell you whether the ATS capability is enabled on the GPU. With a
> 'lspci -vvv -s <GPUDEV>" should give you that info.
>
This test was done with the patch that always disables ATS. Which is
the current patch to selectively disable it? The last patch I tried
didn't seem to work.

2017-04-26 21:44:04

by Joerg Roedel

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On Wed, Apr 26, 2017 at 02:31:40PM -0700, Samuel Sieb wrote:
> This test was done with the patch that always disables ATS. Which
> is the current patch to selectively disable it? The last patch I
> tried didn't seem to work.

Its

[PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs


You should have received it as you were on the Cc list.



Joerg

2017-04-27 19:33:02

by Samuel Sieb

[permalink] [raw]
Subject: Re: AMD IOMMU causing filesystem corruption

On 04/26/2017 02:43 PM, Joerg Roedel wrote:
> On Wed, Apr 26, 2017 at 02:31:40PM -0700, Samuel Sieb wrote:
>> This test was done with the patch that always disables ATS. Which
>> is the current patch to selectively disable it? The last patch I
>> tried didn't seem to work.
>
> Its
>
> [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
>
>
> You should have received it as you were on the Cc list.
>
Yes, but there was some discussion about it, so I wanted to make sure
that was the latest. I can verify that the patch works. ATS is
disabled and there is no filesystem corruption.