Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754716AbYJHP0Q (ORCPT ); Wed, 8 Oct 2008 11:26:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753381AbYJHPZ5 (ORCPT ); Wed, 8 Oct 2008 11:25:57 -0400 Received: from mga01.intel.com ([192.55.52.88]:18650 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751643AbYJHPZ4 convert rfc822-to-8bit (ORCPT ); Wed, 8 Oct 2008 11:25:56 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.33,379,1220252400"; d="scan'208";a="624790967" X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT Subject: RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3 Date: Wed, 8 Oct 2008 08:25:49 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3 Thread-Index: Ackoi4zXIB96a+PaS9Sb3JXiuCbe4QAOO8jgAAAmgPAABKgV0A== References: From: "Graham, David" To: "Hillier, Gernot " Cc: , , "Allan, Bruce W" , "Hockert, Jeff W" , "Graham, David" X-OriginalArrivalTime: 08 Oct 2008 15:25:53.0257 (UTC) FILETIME=[26DCCD90:01C9295A] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4491 Lines: 104 Hi Gernot, Thanks for reporting this issue. We have witnessed this in our labs too, only on platforms that have BMC management firmware. I'm very familiar with the problem, and believe that we have fixed it, though the application of the fix may not be simple. The problem is a result of improper synchronization between the platform FW and the e1000e driver when they attempt concurrent access to LAN resources, and fixes were made both on the driver side, and on the FW side. On some platforms a simple driver update resolves the problem, others require FW fixes too. The 0.2.0 driver in 2.6.25 has no fixes for this problem, and so I am not surprised that you see it there. The first set of changes for this issue are already in the 0.3.3.3-k2 driver that you are still seeing the problem with on 2.6.26, so either those changes are not good, or your issue requires one of the additional fixes. There have been further improvements made to the driver synchronization code since the 0.3.3.3-k2 driver, and it is possible that a newer driver would resolve the issue. It'd be good for us to know if that's the case. The driver version is not yet (AFAICS) upstream, but is already available in the standalone e1000e-0.4.1.7 driver on sourceforge. (google "sourceforge e1000e"). Would you be able to try that, as a first step ? If this does not resolve the issue for the Supermicro board, you likely also require a "FW-side" fix, and this comes in one of two flavors. If the board has an INTEL BMC, then we will need to update it with a new BMC version. If the board has a Supermicro BMC (I expect that it does), then we can provide a patch to some of the platform microcode using a EEPROM update. To determine which is appropriate for you, we'll need to know more about the platform. There's probably a BMC version number on one of the BIOS menus. I can work with you to find the info we need, and then, to help you to perform the necessary steps to perform an upgrade. Dave Dave-----Original Message----- From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Hillier, Gernot Sent: Tuesday, October 07, 2008 7:26 AM To: Brandeburg, Jesse Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Allan, Bruce W Subject: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3 Hi there, On at least two machines using the Supermicro X7DB3 board with Intel 82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on modprobe (about 1 time in some hundred tries): e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2 e1000e: Copyright (c) 1999-2008 Intel Corporation. e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18 e1000e 0000:06:00.0: setting latency timer to 64 0000:06:00.0: 0000:06:00.0: Hardware Error 0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f6 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection 0000:06:00.0: eth0: MAC: 3, PHY: 5, PBA No: 2050ff-0ff e1000e 0000:06:00.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19 e1000e 0000:06:00.1: setting latency timer to 64 0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f7 0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection 0000:06:00.1: eth1: MAC: 3, PHY: 5, PBA No: 2050ff-0ff 0000:06:00.0: eth0: Hardware Error eth0 is not available after module loading. During boot, this means the machine won't come up correctly. Problem can be "fixed" by removing and reloading the module. This happens on the rather old SUSE-patched 2.6.25.11 with e1000e 0.2.0 as well as with vanilla 2.6.27-rc8 including e1000e 0.3.3.3-k2. The machines are equipped with two Quad-Core Xeons E5440 and 8GB of RAM. Both kernels are compiled for x86_64. Supermicro claims that there's no known hardware problem with these boards and that the Windows driver doesn't show any issue... Is there anything I can do to help narrowing down the problem? Anything I can test? Any help greatly appreciated... TIA! -- Gernot Hillier Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/