LinuxLists.cc - absurdly high "optimal_io

2014-11-06 16:48:59

Subject: absurdly high "optimal_io_size" on Seagate SAS disk

Hi,

I'm running a modified 3.4-stable on relatively recent X86 server-class
hardware.

I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive)
and it's reporting a value of 4294966784 for optimal_io_size. The other
parameters look normal though:

/sys/block/sda/queue/hw_sector_size:512
/sys/block/sda/queue/logical_block_size:512
/sys/block/sda/queue/max_segment_size:65536
/sys/block/sda/queue/minimum_io_size:512
/sys/block/sda/queue/optimal_io_size:4294966784

The other drives in the system look more like what I'd expect:

/sys/block/sdb/queue/hw_sector_size:512
/sys/block/sdb/queue/logical_block_size:512
/sys/block/sdb/queue/max_segment_size:65536
/sys/block/sdb/queue/minimum_io_size:4096
/sys/block/sdb/queue/optimal_io_size:0
/sys/block/sdb/queue/physical_block_size:4096

/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/max_segment_size:65536
/sys/block/sdc/queue/minimum_io_size:4096
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/physical_block_size:4096

According to the manual, the ST900MM0026 has a 512 byte physical sector
size.

Is this a drive firmware bug? Or a bug in the SAS driver? Or is there
a valid reason for a single drive to report such a huge value?

Would it make sense for the kernel to do some sort of sanity checking on
this value?

Chris

2014-11-06 17:17:06

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/06/2014 10:47 AM, Chris Friesen wrote:
> Hi,
>
> I'm running a modified 3.4-stable on relatively recent X86 server-class
> hardware.
>
> I recently installed a Seagate ST900MM0026 (900GB 2.5in 10K SAS drive)
> and it's reporting a value of 4294966784 for optimal_io_size. The other
> parameters look normal though:
>
> /sys/block/sda/queue/hw_sector_size:512
> /sys/block/sda/queue/logical_block_size:512
> /sys/block/sda/queue/max_segment_size:65536
> /sys/block/sda/queue/minimum_io_size:512
> /sys/block/sda/queue/optimal_io_size:4294966784

<snip>

> According to the manual, the ST900MM0026 has a 512 byte physical sector
> size.
>
> Is this a drive firmware bug? Or a bug in the SAS driver? Or is there
> a valid reason for a single drive to report such a huge value?
>
> Would it make sense for the kernel to do some sort of sanity checking on
> this value?

Looks like this sort of thing has been seen before, in other drives (one
of which is from the same family as my drive):

http://www.spinics.net/lists/linux-scsi/msg65292.html

http://iamlinux.technoyard.in/blog/why-is-my-ssd-disk-not-reconized-by-the-rhel6-anaconda-installer/

Perhaps the ST900MM0026 should be blacklisted as well?

Or maybe the SCSI code should do a variation on Mike Snitzer's original
patch and just ignore any values above some reasonable threshold? (And
then we could remove the blacklist on the ST900MM0006.)

Chris

2014-11-06 17:34:25

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris> Perhaps the ST900MM0026 should be blacklisted as well?

Sure. I'll widen the net a bit for that Seagate model.

commit 17f1ee2d16a6878269c4429306f6e678b7e61505
Author: Martin K. Petersen <[email protected]>
Date: Thu Nov 6 12:31:43 2014 -0500

SCSI: Blacklist ST900MM0026

Looks like this entire series of drives reports the wrong values in the
block limits VPD. Widen the blacklist.

Reported-by: Chris Friesen <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..9116531b415a 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -229,7 +229,7 @@ static struct {
{"SanDisk", "ImageMate CF-SD1", NULL, BLIST_FORCELUN},
{"SEAGATE", "ST34555N", "0930", BLIST_NOTQ}, /* Chokes on tagged INQUIRY */
{"SEAGATE", "ST3390N", "9546", BLIST_NOTQ},
- {"SEAGATE", "ST900MM0006", NULL, BLIST_SKIP_VPD_PAGES},
+ {"SEAGATE", "ST900MM", NULL, BLIST_SKIP_VPD_PAGES},
{"SGI", "RAID3", "*", BLIST_SPARSELUN},
{"SGI", "RAID5", "*", BLIST_SPARSELUN},
{"SGI", "TP9100", "*", BLIST_REPORTLUN2},

2014-11-06 17:45:25

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/06/2014 11:34 AM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]> writes:
>
> Chris> Perhaps the ST900MM0026 should be blacklisted as well?
>
> Sure. I'll widen the net a bit for that Seagate model.

That'd work, but is it the best way to go? I mean, I found one report
of a similar problem on an SSD (model number unknown). In that case it
was a near-UINT_MAX value as well.

The problem with the blacklist is that until someone patches it, the
drive is broken. And then it stays blacklisted even if the firmware
gets fixed.

I'm wondering if it might not be better to just ignore all values larger
than X (where X is whatever we think is the largest conceivable
reasonable value).

Chris

2014-11-06 18:12:30

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris> That'd work, but is it the best way to go? I mean, I found one
Chris> report of a similar problem on an SSD (model number unknown). In
Chris> that case it was a near-UINT_MAX value as well.

My concern is still the same. Namely that this particular drive happens
to be returning UINT_MAX but it might as well be a value that's entirely
random. Or even a value that is small and innocuous looking but
completely wrong.

Chris> The problem with the blacklist is that until someone patches it,
Chris> the drive is broken. And then it stays blacklisted even if the
Chris> firmware gets fixed.

Well, you can manually blacklist in /proc/scsi/device_info.

Chris> I'm wondering if it might not be better to just ignore all values
Chris> larger than X (where X is whatever we think is the largest
Chris> conceivable reasonable value).

The problem is that finding that is not easy and it too will be a moving
target.

I'm willing to entertain the following, however...

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 95bfb7bfbb9d..75cc51a01860 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2593,7 +2593,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
blk_queue_io_min(sdkp->disk->queue,
get_unaligned_be16(&buffer[6]) * sector_sz);
blk_queue_io_opt(sdkp->disk->queue,
- get_unaligned_be32(&buffer[12]) * sector_sz);
+ min_t(u32, get_unaligned_be32(&buffer[12]),
+ sdkp->capacity) * sector_sz);

if (buffer[3] == 0x3c) {
unsigned int lba_count, desc_count;

--
Martin K. Petersen Oracle Linux Engineering

2014-11-06 18:15:40

by Jens Axboe

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 2014-11-06 11:12, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]> writes:
>
> Chris> That'd work, but is it the best way to go? I mean, I found one
> Chris> report of a similar problem on an SSD (model number unknown). In
> Chris> that case it was a near-UINT_MAX value as well.
>
> My concern is still the same. Namely that this particular drive happens
> to be returning UINT_MAX but it might as well be a value that's entirely
> random. Or even a value that is small and innocuous looking but
> completely wrong.
>
> Chris> The problem with the blacklist is that until someone patches it,
> Chris> the drive is broken. And then it stays blacklisted even if the
> Chris> firmware gets fixed.
>
> Well, you can manually blacklist in /proc/scsi/device_info.
>
> Chris> I'm wondering if it might not be better to just ignore all values
> Chris> larger than X (where X is whatever we think is the largest
> Chris> conceivable reasonable value).
>
> The problem is that finding that is not easy and it too will be a moving
> target.

Didn't check, but assuming the value is the upper 24 bits of 32. If so,
might not hurt to check for as 0xfffffe00 as an invalid value.

--
Jens Axboe

2014-11-06 19:15:56

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/06/2014 12:12 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]>
>>>>>> writes:
>
> Chris> That'd work, but is it the best way to go? I mean, I found
> one Chris> report of a similar problem on an SSD (model number
> unknown). In Chris> that case it was a near-UINT_MAX value as well.
>
> My concern is still the same. Namely that this particular drive
> happens to be returning UINT_MAX but it might as well be a value
> that's entirely random. Or even a value that is small and innocuous
> looking but completely wrong.
>
> Chris> The problem with the blacklist is that until someone patches
> it, Chris> the drive is broken. And then it stays blacklisted even
> if the Chris> firmware gets fixed.
>
> Well, you can manually blacklist in /proc/scsi/device_info.
>
> Chris> I'm wondering if it might not be better to just ignore all
> values Chris> larger than X (where X is whatever we think is the
> largest Chris> conceivable reasonable value).
>
> The problem is that finding that is not easy and it too will be a
> moving target.

Do we need to be perfect, or just "good enough"?

For a RAID card I expect it would be related to chunk size or stripe
width or something...but even then I would expect to be able to cap it
at 100MB or so. Or are there storage systems on really fast interfaces
that could legitimately want a hundred meg of data at a time?

On 11/06/2014 12:15 PM, Jens Axboe wrote:
> Didn't check, but assuming the value is the upper 24 bits of 32. If
> so, might not hurt to check for as 0xfffffe00 as an invalid value.

Yep, in all three wonky cases so far "optimal_io_size" ended up as
4294966784, which is 0xfffffe00. Does something mask out the lower bits?

Chris

2014-11-07 01:56:26

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris,

Chris> For a RAID card I expect it would be related to chunk size or
Chris> stripe width or something...but even then I would expect to be
Chris> able to cap it at 100MB or so. Or are there storage systems on
Chris> really fast interfaces that could legitimately want a hundred meg
Chris> of data at a time?

Well, there are several devices that report their capacity to indicate
that they don't suffer any performance (RMW) penalties for large
commands regardless of size. I would personally prefer them to report 0
in that case.

Chris> Yep, in all three wonky cases so far "optimal_io_size" ended up
Chris> as 4294966784, which is 0xfffffe00. Does something mask out the
Chris> lower bits?

Ignoring reported values of UINT_MAX and 0xfffffe000 only works until
the next spec-dyslexic firmware writer comes along.

I also think that singling out the OPTIMAL TRANSFER LENGTH is a bit of a
red herring. A vendor could mess up any value in that VPD and it would
still cause us grief. There's no rational explanation for why OTL would
be more prone to being filled out incorrectly than any of the other
parameters in that page.

I do concur, though, that io_opt is problematic by virtue of being
32-bits and that gets multiplied by the sector size. So things can
easily get out of whack for fdisk and friends (by comparison the value
that we use for io_min is only 16 bits).

I'm still partial to just blacklisting that entire Seagate family. We
don't have any details on the alleged SSD having the same problem. For
all we know it could be the same SAS disk drive and not an SSD at all.

If there are compelling arguments or other supporting data for sanity
checking OTL I'd suggest the following patch that caps it at 1GB. I know
of a few devices that prefer alignment at that granularity.

--
Martin K. Petersen Oracle Linux Engineering

commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
Author: Martin K. Petersen <[email protected]>
Date: Thu Nov 6 12:31:43 2014 -0500

[SCSI] sd: Sanity check the optimal I/O size

We have come across a couple of devices that report crackpot values in
the optimal I/O size in the Block Limits VPD page. Since this is a
32-bit entity that gets multiplied by the logical block size we can get
disproportionately large values reported to the block layer.

Cap io_opt at 1 GB.

Reported-by: Chris Friesen <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Cc: [email protected]

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b041eca8955d..806e06c2575f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2591,7 +2591,8 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
blk_queue_io_min(sdkp->disk->queue,
get_unaligned_be16(&buffer[6]) * sector_sz);
blk_queue_io_opt(sdkp->disk->queue,
- get_unaligned_be32(&buffer[12]) * sector_sz);
+ min_t(unsigned int, SD_MAX_IO_OPT_BYTES,
+ get_unaligned_be32(&buffer[12]) * sector_sz));

if (buffer[3] == 0x3c) {
unsigned int lba_count, desc_count;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 63ba5ca7f9a1..3492779d9d3e 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -44,10 +44,11 @@ enum {
};

enum {
- SD_DEF_XFER_BLOCKS = 0xffff,
- SD_MAX_XFER_BLOCKS = 0xffffffff,
- SD_MAX_WS10_BLOCKS = 0xffff,
- SD_MAX_WS16_BLOCKS = 0x7fffff,
+ SD_DEF_XFER_BLOCKS = 0xffff,
+ SD_MAX_XFER_BLOCKS = 0xffffffff,
+ SD_MAX_WS10_BLOCKS = 0xffff,
+ SD_MAX_WS16_BLOCKS = 0x7fffff,
+ SD_MAX_IO_OPT_BYTES = 1024 * 1024 * 1024,
};

enum {

2014-11-07 05:36:16

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/06/2014 07:56 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]> writes:
>
> Chris,
>
> Chris> For a RAID card I expect it would be related to chunk size or
> Chris> stripe width or something...but even then I would expect to be
> Chris> able to cap it at 100MB or so. Or are there storage systems on
> Chris> really fast interfaces that could legitimately want a hundred meg
> Chris> of data at a time?
>
> Well, there are several devices that report their capacity to indicate
> that they don't suffer any performance (RMW) penalties for large
> commands regardless of size. I would personally prefer them to report 0
> in that case.

I got curious and looked at the spec at
"http://www.13thmonkey.org/documentation/SCSI/sbc3r25.pdf". I'm now
wondering if maybe linux is misbehaving.

I think there is actually some justification for putting a huge value in
the "optimal transfer length" field. That field is described as "the
optimal transfer length in blocks for a single...command", but then
later it has "If a device server receives a request with a transfer
length exceeding this value, then a significant delay in processing the
request may be incurred." As written, it is ambiguous.

Looking at "ftp://ftp.t10.org/t10/document.03/03-028r2.pdf" it appears
that originally that field was the "optimal maximum transfer length",
not the "optimal transfer length". It appears that the intent was that
the device was able to take requests up to the "maximum transfer
length", but there would be a performance penalty if you went over the
"optimum maximum transfer length".

Section E.4 in "sbc3r25.pdf" talks about optimizing transfers. They
suggest using a transfer length that is a multiple of "optimal transfer
length granularity", up to a max of either the max or optimal transfer
lengths depending on the size of the penalty if you exceed the optimal
transfer length. This reinforces the idea that the "optimal transfer
length" is actually the optimal *maximum* length, but any multiple of
the optimal granularity is fine.

Based on that, I think it would have been clearer if it had been called
"/sys/block/sdb/queue/optimal_max_io_size".

Also, I think it's wrong for filesystems and userspace to use it for
alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they
use the optimal granularity field for alignment, not the optimal
transfer length.

So for the ST900MM0006, it had:

# sg_inq --vpd --page=0xb0 /dev/sdb
VPD INQUIRY: Block limits page (SBC)
Optimal transfer length granularity: 1 blocks
Maximum transfer length: 0 blocks
Optimal transfer length: 4294967295 blocks

In this case I think the drive is trying to say that it doesn't require
any special granularity (can handle alignment on 512-byte blocks), and
that it can handle any size of transfer without performance penalty.

Chris

2014-11-07 15:27:01

by worley

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

> From: Chris Friesen <[email protected]>

> Also, I think it's wrong for filesystems and userspace to use it for
> alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks like they
> use the optimal granularity field for alignment, not the optimal
> transfer length.

Everything you say suggests that "optimal transfer length" means
"there is a penalty for doing transfers *larger* than this", but
people have been treating it as "there is a penalty for doing
transfers *smaller* than this". But the latter is the "optimal
transfer length granularity".

Dale

2014-11-07 16:25:39

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris,

Chris> Also, I think it's wrong for filesystems and userspace to use it
Chris> for alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks
Chris> like they use the optimal granularity field for alignment, not
Chris> the optimal transfer length.

The original rationale behind the OTLG and OTL values was to be able to
express stripe chunk size and stripe width. And to encourage aligned,
full stripe writes but nothing bigger than that. Obviously the wording
went through the usual standards body process to be vague/generic enough
to be used for anything. It has changed several times since sbc3r25,
btw.

The kernel really isn't using io_opt. The value is merely stacked and
communicated to userspace. The reason the partitioning tools blow up
with weird values is that they try to align partitions beginnings to the
stripe width. Which is the right thing to do as far as I'm concerned.

I have worked with many, many partners in the storage industry to make
sure they report sensible values in the Block Limits VPD. I have no
reason to believe that the SAS drive issue in question is anything but a
simple typo. I know there was a bug open with Seagate. I assume it has
been fixed in their latest firmware. To my knowledge it is not a problem
in any of their other drive models. Certainly isn't in any of the ones
we are shipping.

The unfortunate thing with disk drives is that firmware updates are much
harder to deal with. And you rarely end up having access to an updated
firmware unless your drive was procured through a vendor like Dell, HP
or Oracle. That's why I originally opted to quirk this model in
Linux. Otherwise I would just have said "update your firmware".

If we had devices from many different vendors showing up with values
that constantly threw off our tooling I would have more reason to be
concerned. But we haven't. And this code has been in the kernel since
2.6.32 or so.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 17:11:22

by Elliott, Robert (Server Storage)

[permalink] [raw]

Subject: RE: absurdly high "optimal_io_size" on Seagate SAS disk

> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
> Author: Martin K. Petersen <[email protected]>
> Date: Thu Nov 6 12:31:43 2014 -0500
>
> [SCSI] sd: Sanity check the optimal I/O size
>
> We have come across a couple of devices that report crackpot
> values in the optimal I/O size in the Block Limits VPD page.
> Since this is a 32-bit entity that gets multiplied by the
> logical block size we can get
> disproportionately large values reported to the block layer.
>
> Cap io_opt at 1 GB.

Another reasonable cap is the maximum transfer size.
There are lots of them:

* the block layer BIO_MAX_PAGES value of 256 limits IOs
to a maximum of 1 MiB
* SCSI LLDs report their maximum transfer size in
/sys/block/sdNN/queue/max_hw_sectors_kb
* the SCSI midlayer maximum transfer size is set/reported
in /sys/block/sdNN/queue/max_sectors_kb
and the default is 512 KiB
* the SCSI LLD maximum number of scatter gather entries
reported in /sys/block/sdNN/queue/max_segments and
/sys/block/sdNN/queue/max_segment_size creates a
limit based on how fragmented the data buffer is
in virtual memory
* the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
indicates the maximum transfer size for one command over
the SCSI transport protocol supported by the drive itself

It is risky to use transfer sizes larger than linux and
Windows can generate, since drives are probably tested in
those environments.

---
Rob Elliott HP Server Storage

2014-11-07 17:41:11

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Rob" == Elliott, Robert (Server Storage) <[email protected]> writes:

Rob,

Rob> * the block layer BIO_MAX_PAGES value of 256 limits IOs
Rob> to a maximum of 1 MiB

We do support scatterlist chaining, though.

Rob> * SCSI LLDs report their maximum transfer size in
Rob> /sys/block/sdNN/queue/max_hw_sectors_kb
Rob> * the SCSI midlayer maximum transfer size is set/reported
Rob> in /sys/block/sdNN/queue/max_sectors_kb and the default is 512
Rob> KiB
Rob> * the SCSI LLD maximum number of scatter gather entries
Rob> reported in /sys/block/sdNN/queue/max_segments and
Rob> /sys/block/sdNN/queue/max_segment_size creates a limit based on
Rob> how fragmented the data buffer is in virtual memory
Rob> * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
Rob> indicates the maximum transfer size for one command over the SCSI
Rob> transport protocol supported by the drive itself

Yep. We're already capping the actual max I/O size based on all of the
above. However, the purpose of exposing io_opt was to be able to report
stripe size to partitioning tools and filesystems for alignment
purposes. And although they would ideally be the same it was always
anticipated that stripe size could be bigger than the max I/O size.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 17:42:40

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Martin" == Martin K Petersen <[email protected]> writes:

Martin> I know there was a bug open with Seagate. I assume it has been
Martin> fixed in their latest firmware.

Seagate confirms that this issue was fixed about a year ago. Will
provide more data when I have it.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 17:51:50

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/07/2014 11:42 AM, Martin K. Petersen wrote:
>>>>>> "Martin" == Martin K Petersen <[email protected]> writes:
>
> Martin> I know there was a bug open with Seagate. I assume it has been
> Martin> fixed in their latest firmware.
>
> Seagate confirms that this issue was fixed about a year ago. Will
> provide more data when I have it.

Okay, thanks for the clarification (for this and the spec itself).

Apparently there's a new firmware available, dated Oct 13 but with no
release notes. We just tried updating the firmware on one of the drives
in question and it failed from two different versions of linux, while
Windows won't install because it doesn't like our SSD apparently. Joy.

Chris

2014-11-07 18:03:59

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris> Apparently there's a new firmware available, dated Oct 13 but
Chris> with no release notes. We just tried updating the firmware on
Chris> one of the drives in question and it failed from two different
Chris> versions of linux,

Did you use sg_write_buffer or some special firmware update tool?

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 18:48:39

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/07/2014 10:25 AM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]> writes:
>
> Chris,
>
> Chris> Also, I think it's wrong for filesystems and userspace to use it
> Chris> for alignment. In E.4 and E.5 in the "sbc3r25.pdf" doc, it looks
> Chris> like they use the optimal granularity field for alignment, not
> Chris> the optimal transfer length.
>
> The original rationale behind the OTLG and OTL values was to be able to
> express stripe chunk size and stripe width. And to encourage aligned,
> full stripe writes but nothing bigger than that. Obviously the wording
> went through the usual standards body process to be vague/generic enough
> to be used for anything. It has changed several times since sbc3r25,
> btw.

You've obviously been involved in this area a lot more closely than me,
so I'll defer to your experience. :)

I think that if that's the intended use case, then the spec wording could
be improved. Looking at "sbc3r36.pdf", it still only explicitly mentions
performance penalties for transfers that are larger than the "optimal
transfer length", not for transfers that are smaller.

On 11/07/2014 12:03 PM, Martin K. Petersen wrote:
>>>>>> "Chris" == Chris Friesen <[email protected]> writes:
>
> Chris> Apparently there's a new firmware available, dated Oct 13 but
> Chris> with no release notes. We just tried updating the firmware on
> Chris> one of the drives in question and it failed from two different
> Chris> versions of linux,
>
> Did you use sg_write_buffer or some special firmware update tool?

Both. I didn't do it myself, but the guy who did sent me the following:

localhost:~$ ./dl_sea_fw-0.2.3_64 -m ST900MM0026 -d /dev/sda -f Lightningbug10K6-SED-0003.LOD
================================================================================
Seagate Firmware Download Utility v0.2.3 Build Date: Jan 9 2013
Copyright (c) 2012 Seagate Technology LLC, All Rights Reserved
Fri Nov 7 14:51:21 2014
================================================================================
Downloading file Lightningbug10K6-SED-0003.LOD to /dev/sda
send_io: Input/output error
send_io: Input/output error
!
FW Download FAILED

This log is from a different system running Debian:

root@bricklane-2:/home/cgcs# sg_write_buffer -vvv --in=Lightningbug10K6-SED-0003.LOD --length=1752576 --mode=5 /dev/sdb

open /dev/sdb with flags=0x802

sending single write buffer, mode=0x5, mpsec=0, id=0, offset=0, len=1752576

Write buffer cmd: 3b 05 00 00 00 00 1a be 00 00

Write buffer parameter list (first 256 bytes):

e7 1a 0e 59 01 00 02 00 00 00 00 00 00 00 19 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 be 1a 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 d5 cd

00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 00

80 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 bc 1a 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 5f 42

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

ioctl(SG_IO v3) failed: Invalid argument (errno=22)

write buffer: pass through os error: Invalid argument

Write buffer failed: Sense category: -1, try '-v' option for more information

Apparently the "hdparm -I" command is giving bogus data as well.
I've seen that happen if the drive is on a RAID controller--I assume
that could cause problems with firmware updates too?

Chris

2014-11-07 19:17:47

by Martin K. Petersen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

>>>>> "Chris" == Chris Friesen <[email protected]> writes:

Chris> Apparently the "hdparm -I" command is giving bogus data as well.
Chris> I've seen that happen if the drive is on a RAID controller--I
Chris> assume that could cause problems with firmware updates too?

I'd suggest trying /dev/sgN instead.

But yes, some RAID controllers require you to use their tooling and
won't allow direct passthrough.

--
Martin K. Petersen Oracle Linux Engineering

2014-11-07 20:15:20

by Douglas Gilbert

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 14-11-07 12:10 PM, Elliott, Robert (Server Storage) wrote:
>> commit 87c0103ea3f96615b8a9816b8aee8a7ccdf55d50
>> Author: Martin K. Petersen <[email protected]>
>> Date: Thu Nov 6 12:31:43 2014 -0500
>>
>> [SCSI] sd: Sanity check the optimal I/O size
>>
>> We have come across a couple of devices that report crackpot
>> values in the optimal I/O size in the Block Limits VPD page.
>> Since this is a 32-bit entity that gets multiplied by the
>> logical block size we can get
>> disproportionately large values reported to the block layer.
>>
>> Cap io_opt at 1 GB.
>
> Another reasonable cap is the maximum transfer size.
> There are lots of them:
>
> * the block layer BIO_MAX_PAGES value of 256 limits IOs
> to a maximum of 1 MiB
> * SCSI LLDs report their maximum transfer size in
> /sys/block/sdNN/queue/max_hw_sectors_kb
> * the SCSI midlayer maximum transfer size is set/reported
> in /sys/block/sdNN/queue/max_sectors_kb
> and the default is 512 KiB
> * the SCSI LLD maximum number of scatter gather entries
> reported in /sys/block/sdNN/queue/max_segments and
> /sys/block/sdNN/queue/max_segment_size creates a
> limit based on how fragmented the data buffer is
> in virtual memory
> * the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
> indicates the maximum transfer size for one command over
> the SCSI transport protocol supported by the drive itself
>
> It is risky to use transfer sizes larger than linux and
> Windows can generate, since drives are probably tested in
> those environments.

After being burnt by a (virtual) SCSI disk recently, my
utilities now take a more aggressive approach to the data-in
buffer received from INQUIRY, MODE SENSE and LOG SENSE (and
probably should add a few more):

At a low level, after the command is completed, the data-in
buffer is post-filled with zeros following the last valid
byte as indicated by resid, until the end of that buffer.
Then it is passed back for higher level processing of the
command including its data-in buffer.

Pre-filling the data-in buffer with zeros has been in place
for a long time, but I don't think it helps much.

So if there are any HBA drivers that set resid higher than it
should be, expect some pain soon.

Doug Gilbert

2014-11-07 21:04:38

by Chris Friesen

[permalink] [raw]

Subject: Re: absurdly high "optimal_io_size" on Seagate SAS disk

On 11/07/2014 01:17 PM, Martin K. Petersen wrote:

> I'd suggest trying /dev/sgN instead.

That seems to work. Much appreciated.

And it's now showing an "optimal_io_size" of 0, so I think the issue is
dealt with.

Thanks for all the help, it's been educational. :)

Chris