Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753959AbcKQKWw (ORCPT ); Thu, 17 Nov 2016 05:22:52 -0500 Received: from mail-wm0-f49.google.com ([74.125.82.49]:33476 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753493AbcKQKWs (ORCPT ); Thu, 17 Nov 2016 05:22:48 -0500 Subject: Re: mlx4 BUG_ON in probe path To: Bjorn Helgaas References: <20161116182527.GC26600@bhelgaas-glaptop.roam.corp.google.com> Cc: Yishai Hadas , netdev@vger.kernel.org, linux-rdma@vger.kernel.org, Johannes Thumshirn , linux-kernel@vger.kernel.org From: Yishai Hadas Message-ID: Date: Thu, 17 Nov 2016 12:22:44 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161116182527.GC26600@bhelgaas-glaptop.roam.corp.google.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 971 Lines: 22 On 11/16/2016 8:25 PM, Bjorn Helgaas wrote: > Hi Yishai, > > Johannes has been working on an mlx4 initialization problem on an > IBM x3850 X6. The underlying problem is a PCI core issue -- we're > setting RCB in the Mellanox device, which means it thinks it can > generate 128-byte Completions, even though the Root Port above it > can't handle them. That issue is > https://bugzilla.kernel.org/show_bug.cgi?id=187781 > > The machine crashed when this happened, apparently not because of any > error reported via AER, but because mlx4 contains a BUG_ON, probably > the one in mlx4_enter_error_state(). > > That one happens if pci_channel_offline() returns false. Is this > telling us about a problem in PCI error handling, or is it just a case > where mlx4 isn't as smart as it could be? Yes, we expect at that step a problem/bug in the PCI layer that should be fixed (e.g. reporting online but really is offline, etc.), can you please evaluate and confirm that ?