Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933634AbeAHXXY (ORCPT + 1 other); Mon, 8 Jan 2018 18:23:24 -0500 Received: from mail-yb0-f194.google.com ([209.85.213.194]:38587 "EHLO mail-yb0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751190AbeAHXXW (ORCPT ); Mon, 8 Jan 2018 18:23:22 -0500 X-Google-Smtp-Source: ACJfBoucZnbvLOp89ji0BSUcGs04c3nTtfiMn+GSliE+ni3QbCMcuxqddUXYU3Z2DaikfMyeFtUho53eIcKKbeQkems= MIME-Version: 1.0 In-Reply-To: References: <20180105220412.fzpwqe4zljdawr36@darkstar.musicnaut.iki.fi> From: Bjorn Helgaas Date: Mon, 8 Jan 2018 17:23:01 -0600 Message-ID: Subject: Re: [BISECTED] v4.15-rc: Boot regression on x86_64/AMD To: Linus Torvalds Cc: Aaro Koskinen , =?UTF-8?Q?Christian_K=C3=B6nig?= , Andy Shevchenko , Linux Kernel Mailing List , linux-pci@vger.kernel.org, Boris Ostrovsky , Juergen Gross Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: [+cc Boris, Juergen, linux-pci] On Fri, Jan 5, 2018 at 6:00 PM, Linus Torvalds wrote: > On Fri, Jan 5, 2018 at 2:04 PM, Aaro Koskinen wrote: >> >> After v4.14, I've been unable to boot my AMD compilation box with the >> v4.15-rc mainline Linux. It just ends up in a silent reboot loop. >> >> I bisected this to: >> >> commit fa564ad9636651fd11ec2c79c48dee844066f73a >> Author: Christian König >> Date: Tue Oct 24 14:40:29 2017 -0500 >> >> x86/PCI: Enable a 64bit BAR on AMD Family 15h (Models 00-1f, 30-3f, 60-7f) > > Hmm. That was reported to break boot earlier already. > > The breakage was supposedly fixed by three patches from Christian: > > a19e2696135e: "x86/PCI: Only enable a 64bit BAR on single-socket AMD > Family 15h" > > 470195f82e4e: "x86/PCI: Fix infinite loop in search for 64bit BAR placement" > > and a third one that was apparently never applied. > > I'm not sure why that third patch was never applied, I'm including it here. > > Does the system work for you if you apply that patch (instead of > reverting all of them)? > > I wonder why that patch wasn't applied, but if it doesn't fix things, > I think we do need to revert it all. > > Christian? Bjorn? I didn't apply the third patch ("x86/PCI: limit the size of the 64bit BAR to 256GB") because (a) we thought it was optional ("just a precaution against eventual problems"), (b) we didn't have a good explanation of why 256GB was the correct number, and (c) it seemed to be a workaround for a Xen issue that we hoped to fix in a better way. It does apparently make Aaro's system work, but I still hesitate to apply it because it's magical -- avoiding the address space from 0x1_00000000 to 0xbd_00000000 makes things work, but we don't know why. I assume there's some unreported device in that area, but I don't think we have any real assurance that the 0xbd_00000000-0xfd_00000000 area we now use is any safer. I would feel better about this if we made it opt-in via a kernel parameter and/or some kind of whitelist. I still don't really *like* it, since ACPI does provide a mechanism (_PRS/_SRS) for doing this safely, and we could just say "if you want to use big BARs, the BIOS should enable big windows or at least make them available via ACPI resources." The only problem is that BIOSes don't do that and we don't yet have Linux support for _PRS/_SRS for host bridges. I'll prepare a revert as a back-up plan in case we don't come up with a better solution. Bjorn