2019-05-05 08:56:07

by Paul Menzel

[permalink] [raw]
Subject: Re: failed to set xfermode (err_mask=0x40): READ LOG DMA EXT failed, trying PIO

Dear Linux folks,


On 01.05.19 11:34, Paul Menzel wrote:

> On an MSI B350M MORTAR (MS-7A37) with all firmwares, and the AHCI SSD
> Crucial MX500, *sometimes* boot and resume are delayed by four to five
> seconds, and the log contains the message *failed to set xfermode*.
>
>> [    0.474150] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
>> [    0.474267] microcode: Microcode Update Driver: v2.2.
>> [    0.474294] AVX2 version of gcm_enc/dec engaged.
>> [    0.474295] AES CTR mode by8 optimization enabled
>> [    0.474304] ata9.00: supports DRM functions and may not be fully accessible
>> [    0.474368] ata9.00: ATA-10: CT1000MX500SSD4, M3CR020, max UDMA/133
>> [    0.474370] ata9.00: 1953525168 sectors, multi 1: LBA48 NCQ (depth 32), AA
>> [    0.474569] ata9.00: READ LOG DMA EXT failed, trying PIO
>> [    0.474571] ata9.00: failed to get Identify Device Data, Emask 0x40
>> [    0.474572] ata9.00: ATA Identify Device Log not supported
>> [    0.474573] ata9.00: Security Log not supported
>> [    0.474576] ata9.00: failed to set xfermode (err_mask=0x40)
>> [    0.493193] sched_clock: Marking stable (504159603, -10991650)->(615393192, -122225239)
>> [    0.493768] registered taskstats version 1
>> [    0.493770] Loading compiled-in X.509 certificates
>> [    0.497894] Loaded X.509 cert 'Build time autogenerated kernel key: 4682756457c0b7b167f0b726943cf53ade93bd78'
>> [    0.497984] zswap: loaded using pool lzo/zbud
>> [    0.498237] kmemleak: Kernel memory leak detector initialized
>> [    0.498243] kmemleak: Automatic memory scanning thread started
>> [    0.506469] Key type big_key registered
>> [    0.508864]   Magic number: 11:580:189
>> [    0.508914] acpi device:48: hash matches
>> [    0.508922] acpi device:1b: hash matches
>> [    0.509017] rtc_cmos 00:02: setting system clock to 2019-03-29T17:09:41 UTC (1553879381)
>> [    0.509485] After kernel_init_freeable
>> [    1.364358] tsc: Refined TSC clocksource calibration: 3499.982 MHz
>> [    1.364370] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x32734013f67, max_idle_ns: 440795323134 ns
>> [    1.364419] clocksource: Switched to clocksource tsc
>> [    5.593141] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 0)
>> [    5.593270] ata9.00: supports DRM functions and may not be fully accessible
>> [    5.594006] ata9.00: supports DRM functions and may not be fully accessible
>> [    5.594600] ata9.00: configured for UDMA/133
>> [    5.595893] scsi 8:0:0:0: Direct-Access     ATA      CT1000MX500SSD4  020  PQ: 0 ANSI: 5
>> [    5.596611] sd 8:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
>> [    5.596613] sd 8:0:0:0: [sda] 4096-byte physical blocks
>> [    5.596626] sd 8:0:0:0: [sda] Write Protect is off
>> [    5.596629] sd 8:0:0:0: [sda] Mode Sense: 00 3a 00 00
>> [    5.596650] sd 8:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>> [    5.596837] sd 8:0:0:0: Attached scsi generic sg0 type 0
>> [    5.597329]  sda: sda1 sda2 sda3
>> [    5.599079] sd 8:0:0:0: [sda] Attached SCSI disk
>
> This happens with all Linux kernel versions – tested since 4.14.
> Searching for this error on the Web, several people seem to be affected
> with different boards and drives [1].
>
> In a lot of those reports, I also see the message *Security Log not
> supported*.
>
> Do you have an idea, why this is only happening sometimes? Is it likely
> a drive firmware issue? As it affects several vendors, can Linux work
> around this problem (if it can be pinpointed)?

The first error seems to be `READ LOG DMA EXT failed, trying PIO`. The
code looks like below [2].

> retry:
> ata_tf_init(dev, &tf);
> if (dev->dma_mode && ata_id_has_read_log_dma_ext(dev->id) &&
> !(dev->horkage & ATA_HORKAGE_NO_DMA_LOG)) {
> tf.command = ATA_CMD_READ_LOG_DMA_EXT;
> tf.protocol = ATA_PROT_DMA;
> dma = true;
> } else {
> tf.command = ATA_CMD_READ_LOG_EXT;
> tf.protocol = ATA_PROT_PIO;
> dma = false;
> }
> tf.lbal = log;
> tf.lbam = page;
> tf.nsect = sectors;
> tf.hob_nsect = sectors >> 8;
> tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_LBA48 | ATA_TFLAG_DEVICE;
>
> err_mask = ata_exec_internal(dev, &tf, NULL, DMA_FROM_DEVICE,
> buf, sectors * ATA_SECT_SIZE, 0);
>
> if (err_mask && dma) {
> dev->horkage |= ATA_HORKAGE_NO_DMA_LOG;
> ata_dev_warn(dev, "READ LOG DMA EXT failed, trying PIO\n");
> goto retry;
> }
>
> DPRINTK("EXIT, err_mask=%x\n", err_mask);
> return err_mask;

So when trying PIO, that also seems to fail, and the error mask 0x40 is
propagated up.

> enum ata_completion_errors {
> AC_ERR_DEV = (1 << 0), /* device reported error */
> AC_ERR_HSM = (1 << 1), /* host state machine violation */
> AC_ERR_TIMEOUT = (1 << 2), /* timeout */
> AC_ERR_MEDIA = (1 << 3), /* media error */
> AC_ERR_ATA_BUS = (1 << 4), /* ATA bus error */
> AC_ERR_HOST_BUS = (1 << 5), /* host bus error */
> AC_ERR_SYSTEM = (1 << 6), /* system error */
> AC_ERR_INVALID = (1 << 7), /* invalid argument */
> AC_ERR_OTHER = (1 << 8), /* unknown */
> AC_ERR_NODEV_HINT = (1 << 9), /* polling device detection hint */
> AC_ERR_NCQ = (1 << 10), /* marker for offending NCQ qc */
> };

So 0x40 is AC_ERR_HOST_BUS, right? Any idea, what might cause this?

Would it be helpful to enable the debug messages?

#undef ATA_DEBUG /* debugging output */


Kind regards,

Paul


> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=195895[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c?id=1daa0449d287a109b93c4516914eddeff4baff65#n2075
[3]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/libata.h?id=1daa0449d287a109b93c4516914eddeff4baff65#n508