Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758565AbaDJPPY (ORCPT ); Thu, 10 Apr 2014 11:15:24 -0400 Received: from mail-ie0-f172.google.com ([209.85.223.172]:39287 "EHLO mail-ie0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758532AbaDJPPU convert rfc822-to-8bit (ORCPT ); Thu, 10 Apr 2014 11:15:20 -0400 MIME-Version: 1.0 In-Reply-To: <1397119587.19944.14.camel@shinybook.infradead.org> References: <20140409023935.GE11839@dhcp-16-105.nay.redhat.com> <1397083799.2608.20.camel@buesod1.americas.hpqcorp.net> <1397084904.9519.62.camel@dabdike> <1397085044.9519.63.camel@dabdike> <1397086817.2608.25.camel@buesod1.americas.hpqcorp.net> <1397087425.9519.67.camel@dabdike> <1397089180.2608.27.camel@buesod1.americas.hpqcorp.net> <1397111557.2608.29.camel@buesod1.americas.hpqcorp.net> <20140410071535.GX13491@8bytes.org> <1397119587.19944.14.camel@shinybook.infradead.org> From: Bjorn Helgaas Date: Thu, 10 Apr 2014 09:14:59 -0600 Message-ID: Subject: Re: hpsa driver bug crack kernel down! To: "Woodhouse, David" Cc: "joro@8bytes.org" , "linux-kernel@vger.kernel.org" , "bhe@redhat.com" , "jiang.liu@linux.intel.com" , "linux-scsi@vger.kernel.org" , "iommu@lists.linux-foundation.org" , "James.Bottomley@hansenpartnership.com" , "linux-pci@vger.kernel.org" , "scameron@beardog.cce.hp.com" , "davidlohr@hp.com" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David wrote: >> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear >> > > >> > > > > dmar: DRHD: handling fault status reg 602 >> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000 > > That "Present bit in context entry is clear" fault means that we have > not set up *any* mappings for this PCI deviceā€¦ on this IOMMU. > >> > Yes, specifically (finally done bisecting): >> > >> > commit 2e45528930388658603ea24d49cf52867b928d3e >> > Author: Jiang Liu >> > Date: Wed Feb 19 14:07:36 2014 +0800 >> > >> > iommu/vt-d: Unify the way to process DMAR device scope array > > This commit is about how we decide which IOMMU a given PCI device is > attached to. > > Thus, my first guess would be that we are quite happily setting up the > requested DMA maps on the *wrong* IOMMU, and then taking faults when the > device actually tries to do DMA. > > However, I'm not 100% convinced of that. The fault address looks > suspiciously like a true physical address, not a virtual bus address of > the type that we'd normally allocate for a dma_map_* operation. Those > would start at 0xfffff000 and work downwards, typically. I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't connect the device with an IOMMU at all, that would explain the device DMAing directly to a physical address, wouldn't it? > Do you have 'iommu=pt' on the kernel command line? Can I see the full > dmesg as this system boots, and also a copy of the DMAR table? > > We should also rate-limit DMA faults, which would avoid the lockup > failure mode. Bjorn, what should an IOMMU driver *do* when it detects > that a device is creating an endless stream of DMA faults and isn't > aborting the transaction? You mentioned that POWER with EEH does something intelligent in this case, but I'm not familiar with that code. We have AER support, which can result in resetting a device, but I think DMA faults are reported differently, and I don't think there's any nice existing way for PCI to deal with them. Maybe there should be, though. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/