Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752507AbdGGTLp (ORCPT ); Fri, 7 Jul 2017 15:11:45 -0400 Received: from smtp81.ord1c.emailsrvr.com ([108.166.43.81]:52856 "EHLO smtp81.ord1c.emailsrvr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751059AbdGGTLm (ORCPT ); Fri, 7 Jul 2017 15:11:42 -0400 X-Greylist: delayed 530 seconds by postgrey-1.27 at vger.kernel.org; Fri, 07 Jul 2017 15:11:42 EDT X-SMTPDoctor-Processed: csmtpprox beta X-Auth-ID: markh@compro.net X-Sender-Id: markh@compro.net Reply-To: markh@compro.net To: Linux-kernel From: Mark Hounschell Subject: ata1.00: failed command: WRITE FPDMA QUEUED on new AMD AM4 MSI B350 Motherboard Organization: Compro Computer Svcs. Message-ID: <2a50ea45-6948-bade-9eea-85cd655a9bf7@compro.net> Date: Fri, 7 Jul 2017 15:02:50 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8378 Lines: 147 With both 4.11 and 4.12 kernels I get the following when doing heavy disk I/O, like a kernel build with "make -j 15". Even copying the kernel source tree from one place to another. The hardware is an MSI B350 Tomahawk Arctic MB with 16GB of memory and a Ryzen 1700 processor. The disk being used is a 160Gb Seagate ST3160815AS that has error free media according to "badblocks -w". Jul 6 13:34:43 cpu0 kernel: ata1.00: exception Emask 0x11 SAct 0x7ffbffff SErr 0x400000 action 0x6 frozen Jul 6 13:34:43 cpu0 kernel: ata1.00: irq_stat 0x48000008, interface fatal error Jul 6 13:34:43 cpu0 kernel: ata1: SError: { Handshk } Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:00:57:89:90/00:00:03:00:00/40 tag 0 ncq dma 4096 out res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error) Jul 6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY } Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:08:87:89:90/00:00:03:00:00/40 tag 1 ncq dma 4096 out res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error) Jul 6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY } Jul 6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jul 6 13:34:43 cpu0 kernel: ata1.00: cmd 61/20:10:97:89:90/00:00:03:00:00/40 tag 2 ncq dma 16384 out res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error) When I set the kernel cmdline option libata.force=noncq, the messages change into: [ 1724.372101] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 1724.375888] ata1.00: irq_stat 0x48000001, interface fatal error [ 1724.379721] ata1: SError: { Handshk } [ 1724.383691] ata1.00: failed command: WRITE DMA EXT [ 1724.383695] ata1.00: cmd 35/00:50:67:0d:e4/00:09:02:00:00/e0 tag 10 dma 1220608 out res 51/84:50:67:0d:e4/00:09:02:00:00/e0 Emask 0x10 (ATA bus error) [ 1724.383699] ata1.00: status: { DRDY ERR } [ 1724.383700] ata1.00: error: { ICRC ABRT } [ 1724.383706] ata1: hard resetting link [ 1724.850060] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 1724.959883] ata1.00: configured for UDMA/133 [ 1724.959910] ata1: EH complete [ 1921.704356] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen [ 1921.708292] ata1.00: irq_stat 0x48000001, interface fatal error [ 1921.712210] ata1: SError: { Handshk } [ 1921.716294] ata1.00: failed command: WRITE DMA EXT [ 1921.716297] ata1.00: cmd 35/00:90:ef:93:86/00:03:02:00:00/e0 tag 18 dma 466944 out res 51/84:90:ef:93:86/00:03:02:00:00/e0 Emask 0x10 (ATA bus error) [ 1921.716298] ata1.00: status: { DRDY ERR } [ 1921.716298] ata1.00: error: { ICRC ABRT } [ 1921.716303] ata1: hard resetting link [ 1922.175312] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 1922.284165] ata1.00: configured for UDMA/133 [ 1922.288602] ata1: EH complete smartctl shows no issues with the drive. In fact I can take this very drive and install it an an AM3 machine and everything works just fine. I have also installed a PCI-e Sata card and connected the drive to that and that works just fine also. So I have either a linux kernel problem or a hardware problem on this brand new AM4 motherboard. I don't really know what it is other than it is something related with the AMD B350 chipset. It is a fairly new chip set so I am suspicious of the kernel. # smartctl -a /dev/sda smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.11.6-lcrs] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 Device Model: ST3160815AS Serial Number: 6RACD737 Firmware Version: 4.AAB User Capacity: 160,041,885,696 bytes [160 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is: Fri Jul 7 13:50:50 2017 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 54) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 416 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 68 7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 100916113 9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 48052 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 416 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 066 045 Old_age Always - 30 (Min/Max 26/30) 194 Temperature_Celsius 0x0022 030 034 000 Old_age Always - 30 (0 22 0 0 0) 195 Hardware_ECC_Recovered 0x001a 079 065 000 Old_age Always - 168805116 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 46 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Any pointers would be greatly appreciated. Regards Mark