Reply-To: markh@compro.net
To: Linux-kernel <linux-kernel@vger.kernel.org>
From: Mark Hounschell <markh@compro.net>
Subject: ata1.00: failed command: WRITE FPDMA QUEUED on new AMD AM4 MSI B350
 Motherboard
Organization: Compro Computer Svcs.
Message-ID: <2a50ea45-6948-bade-9eea-85cd655a9bf7@compro.net>
Date: Fri, 7 Jul 2017 15:02:50 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8378
Lines: 147

With both 4.11 and 4.12 kernels I get the following when doing heavy disk I/O, like a kernel build with "make -j 15". Even copying the kernel source tree from one place to another. The hardware is an MSI B350 Tomahawk Arctic MB with 16GB of memory and a Ryzen 1700 processor. The disk being used is a 160Gb Seagate ST3160815AS that has error free media according to "badblocks -w".

Jul  6 13:34:43 cpu0 kernel: ata1.00: exception Emask 0x11 SAct 0x7ffbffff SErr 0x400000 action 0x6 frozen
Jul  6 13:34:43 cpu0 kernel: ata1.00: irq_stat 0x48000008, interface fatal error
Jul  6 13:34:43 cpu0 kernel: ata1: SError: { Handshk }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:00:57:89:90/00:00:03:00:00/40 tag 0 ncq dma 4096 out
         res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/08:08:87:89:90/00:00:03:00:00/40 tag 1 ncq dma 4096 out
         res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Jul  6 13:34:43 cpu0 kernel: ata1.00: status: { DRDY }
Jul  6 13:34:43 cpu0 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jul  6 13:34:43 cpu0 kernel: ata1.00: cmd 61/20:10:97:89:90/00:00:03:00:00/40 tag 2 ncq dma 16384 out
         res 40/00:b8:2f:ff:b3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)

When I set the kernel cmdline option libata.force=noncq, the messages change into:

[ 1724.372101] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 1724.375888] ata1.00: irq_stat 0x48000001, interface fatal error
[ 1724.379721] ata1: SError: { Handshk }
[ 1724.383691] ata1.00: failed command: WRITE DMA EXT
[ 1724.383695] ata1.00: cmd 35/00:50:67:0d:e4/00:09:02:00:00/e0 tag 10 dma 1220608 out
                        res 51/84:50:67:0d:e4/00:09:02:00:00/e0 Emask 0x10 (ATA bus error)
[ 1724.383699] ata1.00: status: { DRDY ERR }
[ 1724.383700] ata1.00: error: { ICRC ABRT }
[ 1724.383706] ata1: hard resetting link
[ 1724.850060] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1724.959883] ata1.00: configured for UDMA/133
[ 1724.959910] ata1: EH complete
[ 1921.704356] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
[ 1921.708292] ata1.00: irq_stat 0x48000001, interface fatal error
[ 1921.712210] ata1: SError: { Handshk }
[ 1921.716294] ata1.00: failed command: WRITE DMA EXT
[ 1921.716297] ata1.00: cmd 35/00:90:ef:93:86/00:03:02:00:00/e0 tag 18 dma 466944 out
                        res 51/84:90:ef:93:86/00:03:02:00:00/e0 Emask 0x10 (ATA bus error)
[ 1921.716298] ata1.00: status: { DRDY ERR }
[ 1921.716298] ata1.00: error: { ICRC ABRT }
[ 1921.716303] ata1: hard resetting link
[ 1922.175312] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 1922.284165] ata1.00: configured for UDMA/133
[ 1922.288602] ata1: EH complete


smartctl shows no issues with the drive. In fact I can take this very drive 
and install it an an AM3 machine and everything works just fine. I have 
also installed a PCI-e Sata card and connected the drive to that and that
works just fine also. 

So I have either a linux kernel problem or a hardware problem on 
this brand new AM4 motherboard. I don't really know what it 
is other than it is something related with the AMD B350 chipset.

It is a fairly new chip set so I am suspicious of the kernel. 

# smartctl -a  /dev/sda
smartctl 6.2 2013-11-07 r3856 [x86_64-linux-4.11.6-lcrs] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3160815AS
Serial Number:    6RACD737
Firmware Version: 4.AAB
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Fri Jul  7 13:50:50 2017 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  54) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       416
  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       68
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always       -       100916113
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       48052
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       416
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   066   045    Old_age   Always       -       30 (Min/Max 26/30)
194 Temperature_Celsius     0x0022   030   034   000    Old_age   Always       -       30 (0 22 0 0 0)
195 Hardware_ECC_Recovered  0x001a   079   065   000    Old_age   Always       -       168805116
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       46
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Any pointers would be greatly appreciated. 

Regards
Mark