Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp892359pxb; Wed, 27 Oct 2021 14:35:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqF0AQV9Z+ckXqCmXJFeEdU+dbKPTiOnaqeu0dmQK6ODn2kqS7YZxzVfCXXLuq0txBV2Ev X-Received: by 2002:a17:906:230f:: with SMTP id l15mr148251eja.256.1635370438934; Wed, 27 Oct 2021 14:33:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635370438; cv=none; d=google.com; s=arc-20160816; b=vcKEIbjikEspDr7IrXdYp5L9365DXz4SXTkgMWyoYg0Q02uspOsKkR4reOFjtv45k1 1wjfMUijQE+Dong4yxzlyQF3BRHBW3Hi5NOiOFU6cbUUSQW5yuP11iw1/XVK7YQV7GcR DSwr9Kj/TWG7zdtA4VHFSXJUaqjkxX6AgFBfXJY738HynAmWCEPzZqnthNLl5w+1sKus oUMsvp/T89semsov2B7auT8mVsb9ic6A+ZePNLDWXKJo+5lEw+NyD3Xt1fERGnddW6zt 8PPFxYcmWQwm+891ygkmTK45DQHESBt+1owH62HK4RLQ9CcmNeU4wFNUY7M9xjCcB0WQ AKhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=2p7tY5r/wFl0osiLQx9W/VS0p4ul76VnwJInd4e7nko=; b=hkOMKwOxKMZxYKmJRWCLFoELYlcJiueOgVenHZ3LdvxtmrafAwUApmcIBri2tAZL9o Ec6aT0P1DI+mUk4HgChYn+VuAFw/+LJAdr3WmKupsFQJYKEQACPaRgQMbjJyQWYKggMq U8cboMlz/PQKYa5ybv1AH32AVE8hDAFSiN9qDVy82Z0UpksGXrwiztTe1n4qd/tRbqQY kgIxFGirDYFcAyeeXcHkjhyIGcftw0ArvhEm7SSernXsihm48MHBJqXJXaa1HPy/1bcQ Y5Bi+qwqByUSNnfkFTphZ1yyEbvQryAf3SSLsmP1kVFNQoKcWrHZXKP0bh6Mexu3svy0 I7Wg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p22si1239448edy.167.2021.10.27.14.33.36; Wed, 27 Oct 2021 14:33:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243230AbhJ0RVd (ORCPT + 97 others); Wed, 27 Oct 2021 13:21:33 -0400 Received: from foss.arm.com ([217.140.110.172]:45724 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243231AbhJ0RV2 (ORCPT ); Wed, 27 Oct 2021 13:21:28 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 40864ED1; Wed, 27 Oct 2021 10:19:02 -0700 (PDT) Received: from [10.1.196.40] (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6CB533F70D; Wed, 27 Oct 2021 10:19:00 -0700 (PDT) Subject: Re: I got an IOMMU IO page fault. What to do now? To: Paul Menzel Cc: x86@kernel.org, Xinhui Pan , LKML , amd-gfx@lists.freedesktop.org, iommu@lists.linux-foundation.org, Ingo Molnar , Borislav Petkov , Alex Deucher , it+linux-iommu@molgen.mpg.de, Thomas Gleixner , =?UTF-8?Q?Christian_K=c3=b6nig?= , =?UTF-8?Q?Christian_K=c3=b6nig?= , =?UTF-8?B?SsO2cmcgUsO2ZGVs?= , Suravee Suthikulpanit References: <7a5123b0-6370-59dc-f0c2-8be5b370d9ba@molgen.mpg.de> <0cfccc44-6cc6-98f5-ecd6-2f376839ec18@gmail.com> <3c2de089-8f80-3644-7735-7df1c6151d70@molgen.mpg.de> From: Robin Murphy Message-ID: <82fccb9d-43e8-4485-0ddb-7ff260f3ed32@arm.com> Date: Wed, 27 Oct 2021 18:18:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: <3c2de089-8f80-3644-7735-7df1c6151d70@molgen.mpg.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 27/10/2021 5:45 pm, Paul Menzel wrote: > Dear Robin, > > > On 25.10.21 18:01, Robin Murphy wrote: >> On 2021-10-25 12:23, Christian König wrote: > >>> not sure how the IOMMU gives out addresses, but the printed ones look >>> suspicious to me. Something like we are using an invalid address like >>> -1 or similar. >> >> FWIW those look like believable DMA addresses to me, assuming that the >> DMA mapping APIs are being backed iommu_dma_ops and the device has a >> 40-bit DMA mask, since the IOVA allocator works top-down. >> >> Likely causes are either a race where the dma_unmap_*() call happens >> before the hardware has really stopped accessing the relevant >> addresses, or the device's DMA mask has been set larger than it should >> be, and thus the upper bits have been truncated in the round-trip >> through the hardware. >> >> Given the addresses involved, my suspicions would initially lean >> towards the latter case - the faults are in the very topmost pages >> which imply they're the first things mapped in that range. The other >> contributing factor being the trick that the IOVA allocator plays for >> PCI devices, where it tries to prefer 32-bit addresses. Thus you're >> only likely to see this happen once you already have ~3.5-4GB of live >> DMA-mapped memory to exhaust the 32-bit IOVA space (minus some >> reserved areas) and start allocating from the full DMA mask. You >> should be able to check that with a 5.13 or newer kernel by booting >> with "iommu.forcedac=1" and seeing if it breaks immediately >> (unfortunately with an older kernel you'd have to manually hack >> iommu_dma_alloc_iova() to the same effect). > > I booted Linux 5.15-rc7 with `iommu.forcedac=1` and the system booted, > and I could log in remotely over SSH. Please find the Linux kernel > messages attached. (The system logs say lightdm failed to start, but it > might be some other issue due to a change in the operating system.) OK, that looks like it's made the GPU blow up straight away, which is what I was hoping for (and also appears to reveal another bug where it's not handling probe failure very well - possibly trying to remove a non-existent audio device?). Lightdm presumably fails to start because it doesn't find any display devices, since amdgpu failed to probe. If you can boot the same kernel without "iommu.forcedac" and get a successful probe and working display, that will imply that it is managing to work OK with 32-bit DMA addresses, at which point I'd have to leave it to Christian and Alex to figure out exactly where DMA addresses are getting mangled. The only thing that stands out to me is the reference to "gfx_v6_0", which makes me wonder whether it's related to gmc_v6_0_sw_init() where a 44-bit DMA mask gets set. If so, that would suggest that either this particular model of GPU is more limited than expected, or that SoC only has 40 bits of address wired up between the PCI host bridge and the IOMMU. Cheers, Robin.