2015-07-20 20:18:55

by Jeff Moyer

[permalink] [raw]
Subject: [patch] Revert "block: remove artifical max_hw_sectors cap"


<resent with Jens' email address fixed>

Hi,

This reverts commit 34b48db66e08, which caused significant iozone
performance regressions and uncovered a silent data corruption
bug in at least one disk.

For SAN storage, we've seen initial write and re-write performance drop
25-50% across all I/O sizes. On locally attached storage, we've seen
regressions of 40% for all I/O types, but only for I/O sizes larger than
1MB.

In addition to the performance issues, we've also seen data corruption
on one disk/hba combination. See
http://marc.info/?l=linux-ide&m=143680539400526&w=2

Signed-off-by: Jeff Moyer <[email protected]>

---

I'm open to other suggestions on how to address this, but the
performance regression is seen across a number of our different test
platforms, and the silent data corruption really scares me. Did we
ever get performance numbers that showed a benefit to lifting the
max_sectors_kb cap?

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 12600bf..b160f89 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -257,7 +257,9 @@ void blk_limits_max_hw_sectors(struct queue_limits *limits, unsigned int max_hw_
__func__, max_hw_sectors);
}

- limits->max_sectors = limits->max_hw_sectors = max_hw_sectors;
+ limits->max_hw_sectors = max_hw_sectors;
+ limits->max_sectors = min_t(unsigned int, max_hw_sectors,
+ BLK_DEF_MAX_SECTORS);
}
EXPORT_SYMBOL(blk_limits_max_hw_sectors);

diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 46c282f..dd73e1f 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -395,7 +395,7 @@ aoeblk_gdalloc(void *vp)
WARN_ON(d->flags & DEVFL_TKILL);
WARN_ON(d->gd);
WARN_ON(d->flags & DEVFL_UP);
- blk_queue_max_hw_sectors(q, 1024);
+ blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS);
q->backing_dev_info.name = "aoe";
q->backing_dev_info.ra_pages = READ_AHEAD / PAGE_CACHE_SIZE;
d->bufpool = mp;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d4068c1..1fd459e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1138,6 +1138,7 @@ extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm);
enum blk_default_limits {
BLK_MAX_SEGMENTS = 128,
BLK_SAFE_MAX_SECTORS = 255,
+ BLK_DEF_MAX_SECTORS = 1024,
BLK_MAX_SEGMENT_SIZE = 65536,
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
};


2015-07-20 20:24:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

On 07/20/2015 01:17 PM, Jeff Moyer wrote:
>
> <resent with Jens' email address fixed>
>
> Hi,
>
> This reverts commit 34b48db66e08, which caused significant iozone
> performance regressions and uncovered a silent data corruption
> bug in at least one disk.
>
> For SAN storage, we've seen initial write and re-write performance drop
> 25-50% across all I/O sizes. On locally attached storage, we've seen
> regressions of 40% for all I/O types, but only for I/O sizes larger than
> 1MB.

Do we have any understanding of where this regression is coming from?
Even just basic info like iostats from a run would be useful.

> In addition to the performance issues, we've also seen data corruption
> on one disk/hba combination. See
> http://marc.info/?l=linux-ide&m=143680539400526&w=2

That's just sucky hardware... That said, it is indeed one of the risks.
We had basically the same transition from 255 as max sectors, since we
depended on ATA treating 0 == 256 sectors (as per spec).

--
Jens Axboe

2015-07-20 20:44:31

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

Jens Axboe <[email protected]> writes:

> On 07/20/2015 01:17 PM, Jeff Moyer wrote:
>>
>> <resent with Jens' email address fixed>
>>
>> Hi,
>>
>> This reverts commit 34b48db66e08, which caused significant iozone
>> performance regressions and uncovered a silent data corruption
>> bug in at least one disk.
>>
>> For SAN storage, we've seen initial write and re-write performance drop
>> 25-50% across all I/O sizes. On locally attached storage, we've seen
>> regressions of 40% for all I/O types, but only for I/O sizes larger than
>> 1MB.
>
> Do we have any understanding of where this regression is coming from?
> Even just basic info like iostats from a run would be useful.

I'll request this information and get back to you. Sorry, I should have
done more digging first, but this seemed somewhat urgent to me.

>> In addition to the performance issues, we've also seen data corruption
>> on one disk/hba combination. See
>> http://marc.info/?l=linux-ide&m=143680539400526&w=2
>
> That's just sucky hardware... That said, it is indeed one of the
> risks. We had basically the same transition from 255 as max sectors,
> since we depended on ATA treating 0 == 256 sectors (as per spec).

Sure, the hardware sucks. I still don't like foisting silent data
corruption on users. Besides, given that this patch went in without any
performance numbers attached, I'd say the risk/reward ratio right now is
in favor of the revert.

Cheers,
Jeff

2015-07-21 13:02:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

On Mon, Jul 20, 2015 at 03:17:07PM -0400, Jeff Moyer wrote:
> For SAN storage, we've seen initial write and re-write performance drop
> 25-50% across all I/O sizes. On locally attached storage, we've seen
> regressions of 40% for all I/O types, but only for I/O sizes larger than
> 1MB.

Workload, and hardare please. An only mainline numbers, not some old
hacked vendor kernel, please.

2015-07-29 16:52:44

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

[global]
ioengine=sync
direct=1
filename=/dev/DEVNAME
size=1G
rw=write

[1M]
stonewall
blocksize=1M

[2M]
stonewall
blocksize=2M

[4M]
stonewall
blocksize=4M


Attachments:
seqwrite.fio (158.00 B)
Subject: RE: [patch] Revert "block: remove artifical max_hw_sectors cap"


> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Jeff Moyer
> Sent: Wednesday, July 29, 2015 11:53 AM
> To: Christoph Hellwig <[email protected]>
> Cc: Jens Axboe <[email protected]>; [email protected];
> [email protected]

Adding linux-scsi...

> Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"
>
> Christoph Hellwig <[email protected]> writes:
>
> > On Mon, Jul 20, 2015 at 03:17:07PM -0400, Jeff Moyer wrote:
> >> For SAN storage, we've seen initial write and re-write performance drop
> >> 25-50% across all I/O sizes. On locally attached storage, we've seen
> >> regressions of 40% for all I/O types, but only for I/O sizes larger
> than
> >> 1MB.
> >
> > Workload, and hardare please. An only mainline numbers, not some old
> > hacked vendor kernel, please.
>
> I've attached a simple fio config that reproduces the problem. It just
> does sequential, O_DIRECT write I/O with I/O sizes of 1M, 2M and 4M. So
> far I've tested it on an HP HSV400 and an IBM XIV SAN array connected
> via a qlogic adapter, a nearline sata driveand a WD Red (NAS) sata disk
> connected via an intel ich9r sata controller. The kernel I tested was
> 4.2.0-rc3, and the testing was done across 3 different hosts (just
> because I don't have all the hardware connected to a single box). I did
> 10 runs using max_sectors_kb set to 1024, and 10 runs with it set to
> 32767. Results compare the averages of those 10 runs. In no cases did
> I see a performance gain. In two cases, there is a performance hit.
>
> In addition to my testing, our performance teams have seen performance
> regressions running iozone on fibre channel-attached HP MSA1000 storage,
> as well as on an SSD hidden behind a megaraid controller. I was not
> able to get the exact details on the SSD. iozone configurations can be
> provided, but I think I've nailed the underlying problem with this test
> case.
>
> But, don't take my word for it. Run the fio script on your own
> hardware. All you have to do is echo a couple of values into
> /sys/block/sdX/queue/max_sectors_kb to test, no kernel rebuilding
> required.
>
> In the tables below, concentrate on the BW/IOPS numbers under the WRITE
> column. Negative numbers indicate that max_sectors_kb of 32767 shows a
> performance regression of the indicated percentage when compared with a
> setting of 1024.
>
> Christoph, did you have some hardware where a higher max_sectors_kb
> improved performance?

I don't still have performance numbers, but the old default of
512 KiB was interfering with building large writes that RAID
controllers can treat as full stripe writes (avoiding the need
to read the old parity).

With the SCSI LLD value bumped up, some other limits remain:
* the block layer BIO_MAX_PAGES value of 256 limits IOs
to a maximum of 1 MiB (bio chaining affects this too)
* the SCSI LLD maximum number of scatter gather entries
reported in /sys/block/sdNN/queue/max_segments and
/sys/block/sdNN/queue/max_segment_size creates a
limit based on how fragmented the data buffer is
in virtual memory
* the Block Limits VPD page MAXIMUM TRANSFER LENGTH field
indicates the maximum transfer size for one command over
the SCSI transport protocol supported by the drive itself

The patch let 1 MiB IOs flow through the stack, which
is a better fit for modern strip sizes than 512 KiB.

Software using large IOs must be prepared for long
latencies in exchange for the potential bandwidth gains,
and must use a low (but greater than 1) queue depth
to keep the IOs flowing back-to-back.

Are you finding real software generating such IOs
but relying on the storage stack to break them up
for decent performance?

Your fio script is using the sync IO engine, which
means no queuing. This forces a turnaround time
between IOs, preventing the device from looking ahead
to see what's next (for sequential IOs, probably
continuing data transfers with minimal delay).

If the storage stack breaks up large sync IOs, the
drive might be better at detecting that the access
pattern is sequential (e.g., the gaps are between
every set of 2 IOs rather than every IO). This is
very drive-specific.

If we have to go back to that artificial limit, then
modern drivers (e.g., blk-mq capable drivers) need a
way to raise the default; relying on users to change
the sysfs settings means they're usually not changed.

---
Robert Elliott, HP Server Storage

2015-07-30 13:51:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

Hi Jeff,

thanks for the detailed numbers!

The bigger I/O size makes a drastic impact for Linux software RAID
setups, for which this was a driver. For the RAID5/6 over SATA disks
setups that I was benchmarking this it gives between 20 and 40% better
sequential read and write numbers.

Besides those I've tested it on various SSDs where it didn't make any
difference, e.g. on the laptop with a Samsung 840 Evo I'm currently
travelling with and your fio script.

First line max_sectors_kb = 512, second line
max_sectors_kb = max_hw_sectors_kb (32767)

Run status group 0 (all jobs):
WRITE: io=1024.0MB, aggrb=471270KB/s, minb=471270KB/s, maxb=471270KB/s, mint=2225msec, maxt=2225msec
WRITE: io=1024.0MB, aggrb=475976KB/s, minb=475976KB/s, maxb=475976KB/s, mint=2203msec, maxt=2203msec

Run status group 1 (all jobs):
WRITE: io=1024.0MB, aggrb=477276KB/s, minb=477276KB/s, maxb=477276KB/s, mint=2197msec, maxt=2197msec
WRITE: io=1024.0MB, aggrb=479677KB/s, minb=479677KB/s, maxb=479677KB/s, mint=2186msec, maxt=2186msec

Run status group 2 (all jobs):
WRITE: io=1024.0MB, aggrb=488618KB/s, minb=488618KB/s, maxb=488618KB/s, mint=2146msec, maxt=2146msec
WRITE: io=1024.0MB, aggrb=489302KB/s, minb=489302KB/s, maxb=489302KB/s, mint=2143msec, maxt=2143msec

2015-07-30 14:03:38

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

"Elliott, Robert (Server Storage)" <[email protected]> writes:

>> Christoph, did you have some hardware where a higher max_sectors_kb
>> improved performance?
>
> I don't still have performance numbers, but the old default of
> 512 KiB was interfering with building large writes that RAID
> controllers can treat as full stripe writes (avoiding the need
> to read the old parity).

Too bad you don't still have data. Does this mean you never posted the
data to the list? What type of performance gains were there? 1%? 5%?
100%?

> The patch let 1 MiB IOs flow through the stack, which is a better fit
> for modern strip sizes than 512 KiB.

I agree in principle. I'd love to see the numbers to back it up,
though. And, keep in mind that the patch in question doesn't bump the
limit to 1MB. It bumps it up to max_hw_sectors_kb, which is 32767 on
the hardware I tested. I wouldn't be against raising the limit to 1MB,
or even 1280k to accommodate entire RAID stripe writes/reads. The
numbers I posted didn't really seem to regress until I/Os got larger
than 1MB (though I didn't test anything between 1MB and 2MB).

> Software using large IOs must be prepared for long latencies in
> exchange for the potential bandwidth gains, and must use a low (but
> greater than 1) queue depth to keep the IOs flowing back-to-back.

> Are you finding real software generating such IOs but relying on the
> storage stack to break them up for decent performance?

As I stated at the beginning of this thread, the regression was reported
when running iozone. I would be surprised, however, if there were no
real workloads that issued streaming I/Os through the page cache.
Writeback performance matters. If you don't believe me, I'll CC Dave
Chinner, and he'll bore you into submission with details and data.

> Your fio script is using the sync IO engine, which means no queuing.
> This forces a turnaround time between IOs, preventing the device from
> looking ahead to see what's next (for sequential IOs, probably
> continuing data transfers with minimal delay).

I used the sync I/O engine with direct=1 because I was trying to
highlight the problem. If I used direct=0, then we would get a lot of
caching, and even larger I/Os would be sent down, and I wouldn't be able
to tell whether 1M, 2M or 4M I/Os regressed.

> If the storage stack breaks up large sync IOs, the drive might be
> better at detecting that the access pattern is sequential (e.g., the
> gaps are between every set of 2 IOs rather than every IO). This is
> very drive-specific.

Of course it's drive-specific! I just showed you four drives, and only
two regressed. The point I'm making is that I can't find a single
device that performs better. Even the HP enterprise storage arrays
perform worse in this configuration. But, by all means, PROVE ME
WRONG. It' simple. I showed you how, all you have to do is run the
tests and report the data.

> If we have to go back to that artificial limit, then modern drivers
> (e.g., blk-mq capable drivers) need a way to raise the default;
> relying on users to change the sysfs settings means they're usually
> not changed.

Do we? I don't think anyone has shown the real need for this. And it's
dead simple to show the need, which is the frustrating part. Run your
favorite workload on your favorite storage with two different values of
max_sectors_kb. Enough with the hand-waving, show me the data!

Cheers,
Jeff

2015-07-30 14:08:21

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

Christoph Hellwig <[email protected]> writes:

> Hi Jeff,
>
> thanks for the detailed numbers!
>
> The bigger I/O size makes a drastic impact for Linux software RAID
> setups, for which this was a driver. For the RAID5/6 over SATA disks
> setups that I was benchmarking this it gives between 20 and 40% better
> sequential read and write numbers.

OK, thanks for the details on the setup. Did you try with max_sector_kb
values between 512 and 32767? I wonder if we can find a happy middle
ground, like 1280 or even 2048.

I'll try to setup a software raid 5 with 10 disks and get back to you.
If you have a preference for the exact geometry, please speak up.

Thanks!
Jeff

2015-08-10 17:23:51

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

Christoph Hellwig <[email protected]> writes:

> Hi Jeff,
>
> thanks for the detailed numbers!
>
> The bigger I/O size makes a drastic impact for Linux software RAID
> setups, for which this was a driver. For the RAID5/6 over SATA disks
> setups that I was benchmarking this it gives between 20 and 40% better
> sequential read and write numbers.

Hi, Christoph,

Unfortunately, I'm unable to reproduce your results (though my test
setup uses SAS disks, not SATA). I tried with a 10 data disk md RAID5,
with 32k and 128k chunk sizes. I modified the fio program to read/write
multiples of the stripe width, and I also used aio-stress over a range
of queue depth and I/O sizes for read, write, random read and random
write. I didn't see any measurable performance difference. Do you
still have access to your test setup?

What do you think about reinstituting the artificial max_sectors_kb cap,
but bumping the default up from 512KB to 1280KB? I had our performance
team run numbers on their test setup (which is a 12 disk raid0 with 32k
chunk size, fwiw) with max_sectors_kb set to 1280 and, aside from one
odd data point, things looked good.

Cheers,
Jeff

2015-08-11 14:28:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

On Mon, Aug 10, 2015 at 01:23:48PM -0400, Jeff Moyer wrote:
> Unfortunately, I'm unable to reproduce your results (though my test
> setup uses SAS disks, not SATA). I tried with a 10 data disk md RAID5,
> with 32k and 128k chunk sizes. I modified the fio program to read/write
> multiples of the stripe width, and I also used aio-stress over a range
> of queue depth and I/O sizes for read, write, random read and random
> write. I didn't see any measurable performance difference. Do you
> still have access to your test setup?

Unfortuntately I don't have access to a large RAID setup at the moment.

> What do you think about reinstituting the artificial max_sectors_kb cap,
> but bumping the default up from 512KB to 1280KB? I had our performance
> team run numbers on their test setup (which is a 12 disk raid0 with 32k
> chunk size, fwiw) with max_sectors_kb set to 1280 and, aside from one
> odd data point, things looked good.

Let's settle on that for 4.2 and then look it in more detail.

>
> Cheers,
> Jeff
---end quoted text---