Date: Wed, 1 Jul 2015 12:58:59 -0500
From: Shaun Ruffell <sruffell@digium.com>
To: Ian Kumlien <ian.kumlien@gmail.com>
Cc: linux-netdev@vger.kernel.org,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Russ Meyerriecks <rmeyerriecks@digium.com>
Subject: Re: [igb] AER timeout - resend.
Message-ID: <20150701175859.GA89727@digium.com>
References: <CAA85sZvM_cOq8JEVjBkBvP7BZpNTjuN3_4E8b=xpyZfm_8Vr-Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAA85sZvM_cOq8JEVjBkBvP7BZpNTjuN3_4E8b=xpyZfm_8Vr-Q@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4305
Lines: 91

On Mon, Feb 23, 2015 at 03:56:56PM +0100, Ian Kumlien wrote:
> Sending this to both netdev and kernel since i don't know if it's the
> driver or the pcie AER that does something odd - the machine was
> stable before 3.19 and PCIE AER.
> 
> Everything started out like i first sent to linux nics () intel:
> ------
> 
> And today i had some issues and wondered why things was broken, i was met with:
> 
> [950016.366477] pcieport 0000:00:04.0: AER: Uncorrected (Non-Fatal)
> error received: id=0500
> [950016.366495] igb 0000:05:00.0: PCIe Bus Error: severity=Uncorrected
> (Non-Fatal), type=Transaction Layer, id=0500(Requester ID)
> [950016.366502] igb 0000:05:00.0:   device [8086:1521] error
> status/mask=00004000/00000000
> [950016.366509] igb 0000:05:00.0:    [14] Completion Timeout
> [950016.366519] igb 0000:05:00.0: broadcast error_detected message
> [950016.379742] br0: port 1(enp5s0f0) entered disabled state
> [950016.488213] igb 0000:05:00.0: broadcast slot_reset message
> [950016.588014] igb 0000:05:00.0: broadcast resume message
> [950016.752654] igb 0000:05:00.0: AER: Device recovery successful
> [950019.817249] igb 0000:05:00.1 enp5s0f1: igb: enp5s0f1 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [950020.699773] igb 0000:05:00.0 enp5s0f0: igb: enp5s0f0 NIC Link is
> Up 1000 Mbps Full Duplex, Flow Control: RX
> [950020.701485] br0: port 1(enp5s0f0) entered forwarding state
> [950020.701504] br0: port 1(enp5s0f0) entered forwarding state
> [976152.448092] ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800
> action 0xe frozen
> [976152.448100] ata5: irq_stat 0x00400040, connection status changed
> [976152.448107] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
> [976152.448117] ata5: hard resetting link
> [976152.448134] ata6: exception Emask 0x50 SAct 0x0 SErr 0x4090800
> action 0xe frozen
> [976152.448140] ata6: irq_stat 0x00400040, connection status changed
> [976152.448147] ata6: SError: { HostInt PHYRdyChg 10B8B DevExch }
> [976152.448155] ata6: hard resetting link
> [976153.171195] ata6: SATA link down (SStatus 0 SControl 300)
> [976158.174058] ata6: hard resetting link
> [976158.174110] ata5: SATA link down (SStatus 0 SControl 300)
> [976163.176997] ata5: hard resetting link
> [976163.480133] ata6: SATA link down (SStatus 0 SControl 300)
> [976163.480147] ata6: limiting SATA link speed to 1.5 Gbps
> [976168.483028] ata6: hard resetting link
> [976168.483095] ata5: SATA link down (SStatus 0 SControl 300)
> [976168.483108] ata5: limiting SATA link speed to 1.5 Gbps
> [976173.485907] ata5: hard resetting link
> [976173.789066] ata6: SATA link down (SStatus 0 SControl 310)
> [976173.789080] ata6.00: disabled
> [976173.791066] ata6: EH complete
> [976173.791078] ata5: SATA link down (SStatus 0 SControl 310)
> [976173.791085] ata6.00: detaching (SCSI 5:0:0:0)
> [976173.791090] ata5.00: disabled
> [976173.794073] ata5: EH complete
> [976173.794100] ata5.00: detaching (SCSI 4:0:0:0)
> [976173.794968] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [976173.795073] sd 5:0:0:0: [sdb] Synchronize Cache(10) failed:
> Result: hostbyte=0x04 driverbyte=0x00
> [976173.795080] sd 5:0:0:0: [sdb] Stopping disk
> [976173.795108] sd 5:0:0:0: [sdb] Start/Stop Unit failed: Result:
> hostbyte=0x04 driverbyte=0x00
> [976173.797180] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [976173.797254] sd 4:0:0:0: [sda] Synchronize Cache(10) failed:
> Result: hostbyte=0x04 driverbyte=0x00
> [976173.797261] sd 4:0:0:0: [sda] Stopping disk
> [976173.797285] sd 4:0:0:0: [sda] Start/Stop Unit failed: Result:
> hostbyte=0x04 driverbyte=0x00
> 
> So two out of two disks just failed and isn't replying anymore?
> 
> Seven hours after a AER this machine who's intel ssd:s are idle just
> fail to respond? ;)
> 
> Anyway, will reboot it when i get home - any idea/suggestion is more
> than welcome.

Hi Ian,

Did you ever find a resolution to this? I'm seeing something very
similar where a customer upgrades to 3.19 and then there are AER
errors and the links are brought down but 3.10 works fine.

Thanks,
Shaun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/