Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752134AbbGASCs (ORCPT ); Wed, 1 Jul 2015 14:02:48 -0400 Received: from mail-yk0-f175.google.com ([209.85.160.175]:36534 "EHLO mail-yk0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751005AbbGASCl (ORCPT ); Wed, 1 Jul 2015 14:02:41 -0400 Date: Wed, 1 Jul 2015 12:58:59 -0500 From: Shaun Ruffell To: Ian Kumlien Cc: linux-netdev@vger.kernel.org, "linux-kernel@vger.kernel.org" , Russ Meyerriecks Subject: Re: [igb] AER timeout - resend. Message-ID: <20150701175859.GA89727@digium.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4305 Lines: 91 On Mon, Feb 23, 2015 at 03:56:56PM +0100, Ian Kumlien wrote: > Sending this to both netdev and kernel since i don't know if it's the > driver or the pcie AER that does something odd - the machine was > stable before 3.19 and PCIE AER. > > Everything started out like i first sent to linux nics () intel: > ------ > > And today i had some issues and wondered why things was broken, i was met with: > > [950016.366477] pcieport 0000:00:04.0: AER: Uncorrected (Non-Fatal) > error received: id=0500 > [950016.366495] igb 0000:05:00.0: PCIe Bus Error: severity=Uncorrected > (Non-Fatal), type=Transaction Layer, id=0500(Requester ID) > [950016.366502] igb 0000:05:00.0: device [8086:1521] error > status/mask=00004000/00000000 > [950016.366509] igb 0000:05:00.0: [14] Completion Timeout > [950016.366519] igb 0000:05:00.0: broadcast error_detected message > [950016.379742] br0: port 1(enp5s0f0) entered disabled state > [950016.488213] igb 0000:05:00.0: broadcast slot_reset message > [950016.588014] igb 0000:05:00.0: broadcast resume message > [950016.752654] igb 0000:05:00.0: AER: Device recovery successful > [950019.817249] igb 0000:05:00.1 enp5s0f1: igb: enp5s0f1 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: RX/TX > [950020.699773] igb 0000:05:00.0 enp5s0f0: igb: enp5s0f0 NIC Link is > Up 1000 Mbps Full Duplex, Flow Control: RX > [950020.701485] br0: port 1(enp5s0f0) entered forwarding state > [950020.701504] br0: port 1(enp5s0f0) entered forwarding state > [976152.448092] ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 > action 0xe frozen > [976152.448100] ata5: irq_stat 0x00400040, connection status changed > [976152.448107] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch } > [976152.448117] ata5: hard resetting link > [976152.448134] ata6: exception Emask 0x50 SAct 0x0 SErr 0x4090800 > action 0xe frozen > [976152.448140] ata6: irq_stat 0x00400040, connection status changed > [976152.448147] ata6: SError: { HostInt PHYRdyChg 10B8B DevExch } > [976152.448155] ata6: hard resetting link > [976153.171195] ata6: SATA link down (SStatus 0 SControl 300) > [976158.174058] ata6: hard resetting link > [976158.174110] ata5: SATA link down (SStatus 0 SControl 300) > [976163.176997] ata5: hard resetting link > [976163.480133] ata6: SATA link down (SStatus 0 SControl 300) > [976163.480147] ata6: limiting SATA link speed to 1.5 Gbps > [976168.483028] ata6: hard resetting link > [976168.483095] ata5: SATA link down (SStatus 0 SControl 300) > [976168.483108] ata5: limiting SATA link speed to 1.5 Gbps > [976173.485907] ata5: hard resetting link > [976173.789066] ata6: SATA link down (SStatus 0 SControl 310) > [976173.789080] ata6.00: disabled > [976173.791066] ata6: EH complete > [976173.791078] ata5: SATA link down (SStatus 0 SControl 310) > [976173.791085] ata6.00: detaching (SCSI 5:0:0:0) > [976173.791090] ata5.00: disabled > [976173.794073] ata5: EH complete > [976173.794100] ata5.00: detaching (SCSI 4:0:0:0) > [976173.794968] sd 5:0:0:0: [sdb] Synchronizing SCSI cache > [976173.795073] sd 5:0:0:0: [sdb] Synchronize Cache(10) failed: > Result: hostbyte=0x04 driverbyte=0x00 > [976173.795080] sd 5:0:0:0: [sdb] Stopping disk > [976173.795108] sd 5:0:0:0: [sdb] Start/Stop Unit failed: Result: > hostbyte=0x04 driverbyte=0x00 > [976173.797180] sd 4:0:0:0: [sda] Synchronizing SCSI cache > [976173.797254] sd 4:0:0:0: [sda] Synchronize Cache(10) failed: > Result: hostbyte=0x04 driverbyte=0x00 > [976173.797261] sd 4:0:0:0: [sda] Stopping disk > [976173.797285] sd 4:0:0:0: [sda] Start/Stop Unit failed: Result: > hostbyte=0x04 driverbyte=0x00 > > So two out of two disks just failed and isn't replying anymore? > > Seven hours after a AER this machine who's intel ssd:s are idle just > fail to respond? ;) > > Anyway, will reboot it when i get home - any idea/suggestion is more > than welcome. Hi Ian, Did you ever find a resolution to this? I'm seeing something very similar where a customer upgrades to 3.19 and then there are AER errors and the links are brought down but 3.10 works fine. Thanks, Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/