2013-05-10 14:04:50

by David Oostdyk

[permalink] [raw]
Subject: high-speed disk I/O is CPU-bound?

Hello,

I have a few relatively high-end systems with hardware RAIDs which are
being used for recording systems, and I'm trying to get a better
understanding of contiguous write performance.

The hardware that I've tested with includes two high-end Intel E5-2600
and E5-4600 (~3GHz) series systems, as well as a slightly older Xeon
5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5" JBOD (with
either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD with 10kRPM
drives. I've tried LSI controllers (9285-8e, 9266-8i, as well as the
integrated Intel LSI controllers) as well as Adaptec Series 7 RAID
controllers (72405 and 71685).

Normally I'll setup the RAIDs as RAID60 and format them as XFS, but the
exact RAID level, filesystem type, and even RAID hardware don't seem to
matter very much from my observations (but I'm willing to try any
suggestions). As a basic benchmark, I have an application that simply
writes the same buffer (say, 128MB) to disk repeatedly. Alternatively
you could use the "dd" utility. (For these benchmarks, I set
/proc/sys/vm/dirty_bytes to 512M or lower, since these systems have a
lot of RAM.)

The basic observations are:

1. "single-threaded" writes, either a file on the mounted filesystem or
with a "dd" to the raw RAID device, seem to be limited to
1200-1400MB/sec. These numbers vary slightly based on whether
TurboBoost is affecting the writing process or not. "top" will show
this process running at 100% CPU.

2. With two benchmarks running on the same device, I see aggregate
write speeds of up to ~2.4GB/sec, which is closer to what I'd expect the
drives of being able to deliver. This can either be with two
applications writing to separate files on the same mounted file system,
or two separate "dd" applications writing to distinct locations on the
raw device. (Increasing the number of writers beyond two does not seem
to increase aggregate performance; "top" will show both processes
running at perhaps 80% CPU).

3. I haven't been able to find any tricks (lio_listio, multiple threads
writing to distinct file offsets, etc) that seem to deliver higher write
speeds when writing to a single file. (This might be xfs-specific, though)

4. Cheap tricks like making a software RAID0 of two hardware RAID
devices does not deliver any improved performance for single-threaded
writes. (Have not thoroughly tested this configuration fully with
multiple writers, though.)

5. Similar hardware on Windows seems to be able to deliver >3GB/sec
write speeds on a single-threaded writes, and the trick of making a
software RAID0 of two hardware RAIDs does deliver increased write
speeds. (I only point this out to say that I think the hardware is not
necessarily the bottleneck.)

The question is, is it possible that high-speed I/O to these hardware
RAIDs could actually be CPU-bound above ~1400MB/sec?

It seems to be the only explanation of the benchmarks that I've been
seeing, but I don't know where to start looking to really determine the
bottleneck. I'm certainly open to suggestions to running different
configurations or benchmarks.

Thanks for any help/advice!
Dave O.


2013-05-11 00:19:09

by Eric Wong

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

Cc-ing Jens

David Oostdyk <[email protected]> wrote:
> Hello,
>
> I have a few relatively high-end systems with hardware RAIDs which
> are being used for recording systems, and I'm trying to get a better
> understanding of contiguous write performance.
>
> The hardware that I've tested with includes two high-end Intel
> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5"
> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i,
> as well as the integrated Intel LSI controllers) as well as Adaptec
> Series 7 RAID controllers (72405 and 71685).

Which I/O scheduler are you using? noop (or deadline) may improve
things with hardware RAID.

> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
> the exact RAID level, filesystem type, and even RAID hardware don't
> seem to matter very much from my observations (but I'm willing to
> try any suggestions). As a basic benchmark, I have an application
> that simply writes the same buffer (say, 128MB) to disk repeatedly.
> Alternatively you could use the "dd" utility. (For these
> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
> these systems have a lot of RAM.)
>
> The basic observations are:
>
> 1. "single-threaded" writes, either a file on the mounted
> filesystem or with a "dd" to the raw RAID device, seem to be limited
> to 1200-1400MB/sec. These numbers vary slightly based on whether
> TurboBoost is affecting the writing process or not. "top" will show
> this process running at 100% CPU.
>
> 2. With two benchmarks running on the same device, I see aggregate
> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
> the drives of being able to deliver. This can either be with two
> applications writing to separate files on the same mounted file
> system, or two separate "dd" applications writing to distinct
> locations on the raw device. (Increasing the number of writers
> beyond two does not seem to increase aggregate performance; "top"
> will show both processes running at perhaps 80% CPU).
>
> 3. I haven't been able to find any tricks (lio_listio, multiple
> threads writing to distinct file offsets, etc) that seem to deliver
> higher write speeds when writing to a single file. (This might be
> xfs-specific, though)
>
> 4. Cheap tricks like making a software RAID0 of two hardware RAID
> devices does not deliver any improved performance for
> single-threaded writes. (Have not thoroughly tested this
> configuration fully with multiple writers, though.)
>
> 5. Similar hardware on Windows seems to be able to deliver >3GB/sec
> write speeds on a single-threaded writes, and the trick of making a
> software RAID0 of two hardware RAIDs does deliver increased write
> speeds. (I only point this out to say that I think the hardware is
> not necessarily the bottleneck.)
>
> The question is, is it possible that high-speed I/O to these
> hardware RAIDs could actually be CPU-bound above ~1400MB/sec?
>
> It seems to be the only explanation of the benchmarks that I've been
> seeing, but I don't know where to start looking to really determine
> the bottleneck. I'm certainly open to suggestions to running
> different configurations or benchmarks.
>
> Thanks for any help/advice!
> Dave O.

2013-05-12 16:53:43

by Rob Landley

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 05/10/2013 09:04:44 AM, David Oostdyk wrote:
> Hello,
>
> I have a few relatively high-end systems with hardware RAIDs which
> are being used for recording systems, and I'm trying to get a better
> understanding of contiguous write performance.
...
> The question is, is it possible that high-speed I/O to these hardware
> RAIDs could
> actually be CPU-bound above ~1400MB/sec?

In some setups your processor is calculating CRCs for the data. It's a
fairly cheap operation, but a cheap operation on gigabytes of data can
still saturate your memory bus.

Rob-

2013-05-13 14:58:42

by David Oostdyk

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 05/10/13 20:19, Eric Wong wrote:
> Cc-ing Jens
>
> David Oostdyk <[email protected]> wrote:
>> Hello,
>>
>> I have a few relatively high-end systems with hardware RAIDs which
>> are being used for recording systems, and I'm trying to get a better
>> understanding of contiguous write performance.
>>
>> The hardware that I've tested with includes two high-end Intel
>> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
>> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5"
>> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
>> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i,
>> as well as the integrated Intel LSI controllers) as well as Adaptec
>> Series 7 RAID controllers (72405 and 71685).
> Which I/O scheduler are you using? noop (or deadline) may improve
> things with hardware RAID.

I was using cfq, but I gave noop and deadline a try and don't see any
significant difference in my testing. Thanks for the suggestion! I had
not thought to test this yet.


>> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
>> the exact RAID level, filesystem type, and even RAID hardware don't
>> seem to matter very much from my observations (but I'm willing to
>> try any suggestions). As a basic benchmark, I have an application
>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>> Alternatively you could use the "dd" utility. (For these
>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>> these systems have a lot of RAM.)
>>
>> The basic observations are:
>>
>> 1. "single-threaded" writes, either a file on the mounted
>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>> to 1200-1400MB/sec. These numbers vary slightly based on whether
>> TurboBoost is affecting the writing process or not. "top" will show
>> this process running at 100% CPU.
>>
>> 2. With two benchmarks running on the same device, I see aggregate
>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>> the drives of being able to deliver. This can either be with two
>> applications writing to separate files on the same mounted file
>> system, or two separate "dd" applications writing to distinct
>> locations on the raw device. (Increasing the number of writers
>> beyond two does not seem to increase aggregate performance; "top"
>> will show both processes running at perhaps 80% CPU).
>>
>> 3. I haven't been able to find any tricks (lio_listio, multiple
>> threads writing to distinct file offsets, etc) that seem to deliver
>> higher write speeds when writing to a single file. (This might be
>> xfs-specific, though)
>>
>> 4. Cheap tricks like making a software RAID0 of two hardware RAID
>> devices does not deliver any improved performance for
>> single-threaded writes. (Have not thoroughly tested this
>> configuration fully with multiple writers, though.)
>>
>> 5. Similar hardware on Windows seems to be able to deliver >3GB/sec
>> write speeds on a single-threaded writes, and the trick of making a
>> software RAID0 of two hardware RAIDs does deliver increased write
>> speeds. (I only point this out to say that I think the hardware is
>> not necessarily the bottleneck.)
>>
>> The question is, is it possible that high-speed I/O to these
>> hardware RAIDs could actually be CPU-bound above ~1400MB/sec?
>>
>> It seems to be the only explanation of the benchmarks that I've been
>> seeing, but I don't know where to start looking to really determine
>> the bottleneck. I'm certainly open to suggestions to running
>> different configurations or benchmarks.
>>
>> Thanks for any help/advice!
>> Dave O.

2013-05-13 15:18:26

by David Oostdyk

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 05/12/13 12:53, Rob Landley wrote:
> On 05/10/2013 09:04:44 AM, David Oostdyk wrote:
>> Hello,
>>
>> I have a few relatively high-end systems with hardware RAIDs which
>> are being used for recording systems, and I'm trying to get a better
>> understanding of contiguous write performance.
> ...
>> The question is, is it possible that high-speed I/O to these hardware
>> RAIDs could
>> actually be CPU-bound above ~1400MB/sec?
> In some setups your processor is calculating CRCs for the data. It's a
> fairly cheap operation, but a cheap operation on gigabytes of data can
> still saturate your memory bus.
>
> Rob

At what level would you say this calculation is being applied? Somewhere
in the block/filesystem layer, or in the device driver, or at the
hardware level? I'm seeing write speeds that are about 1/4 the memory
bandwidth of a single thread, which would suggest at least one
"additional" pass through the data before it gets DMA'd out.


2013-05-16 00:59:23

by Dave Chinner

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

[cc xfs list, seeing as that's where all the people who use XFS in
these sorts of configurations hang out. ]

On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
> Hello,
>
> I have a few relatively high-end systems with hardware RAIDs which
> are being used for recording systems, and I'm trying to get a better
> understanding of contiguous write performance.
>
> The hardware that I've tested with includes two high-end Intel
> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5"
> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i,
> as well as the integrated Intel LSI controllers) as well as Adaptec
> Series 7 RAID controllers (72405 and 71685).
>
> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
> the exact RAID level, filesystem type, and even RAID hardware don't
> seem to matter very much from my observations (but I'm willing to
> try any suggestions).

Document them. There's many ways to screw them up and get bad
performance.

> As a basic benchmark, I have an application
> that simply writes the same buffer (say, 128MB) to disk repeatedly.
> Alternatively you could use the "dd" utility. (For these
> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
> these systems have a lot of RAM.)
>
> The basic observations are:
>
> 1. "single-threaded" writes, either a file on the mounted
> filesystem or with a "dd" to the raw RAID device, seem to be limited
> to 1200-1400MB/sec. These numbers vary slightly based on whether
> TurboBoost is affecting the writing process or not. "top" will show
> this process running at 100% CPU.

Expected. You are using buffered IO. Write speed is limited by the
rate at which your user process can memcpy data into the page cache.

> 2. With two benchmarks running on the same device, I see aggregate
> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
> the drives of being able to deliver. This can either be with two
> applications writing to separate files on the same mounted file
> system, or two separate "dd" applications writing to distinct
> locations on the raw device. (Increasing the number of writers
> beyond two does not seem to increase aggregate performance; "top"
> will show both processes running at perhaps 80% CPU).

Still using buffered IO, which means you are typically limited by
the rate at which the flusher thread can do writeback.

> 3. I haven't been able to find any tricks (lio_listio, multiple
> threads writing to distinct file offsets, etc) that seem to deliver
> higher write speeds when writing to a single file. (This might be
> xfs-specific, though)

How about using direct IO? Single threaded direct IO will beslower
than buffered IO, but throughput should scale linearly with the
number of threads if the IO size is large enough (e.g. 32MB).

> 4. Cheap tricks like making a software RAID0 of two hardware RAID
> devices does not deliver any improved performance for
> single-threaded writes. (Have not thoroughly tested this
> configuration fully with multiple writers, though.)

Of course not - you are CPU bound and nothing you do to the storage
will change that.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-16 11:46:01

by Stan Hoeppner

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 5/15/2013 7:59 PM, Dave Chinner wrote:
> [cc xfs list, seeing as that's where all the people who use XFS in
> these sorts of configurations hang out. ]
>
> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
>> Hello,
>>
>> I have a few relatively high-end systems with hardware RAIDs which
>> are being used for recording systems, and I'm trying to get a better
>> understanding of contiguous write performance.
>>
>> The hardware that I've tested with includes two high-end Intel
>> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
>> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5"
>> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
>> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i,
>> as well as the integrated Intel LSI controllers) as well as Adaptec
>> Series 7 RAID controllers (72405 and 71685).

So, you have something like the following raw aggregate drive b/w,
assuming average outer-inner track 120MB/s streaming write throughput
per drive:

45 drives ~5.4 GB/s
28 drives ~3.4 GB/s
24 drives ~2.8 GB/s

The two LSI HBAs you mention are PCIe 2.0 devices. Note that PCIe 2.0
x8 is limited to ~4GB/s each way. If those 45 drives are connected to
the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the
aggregate drive b/w. If they're connected to the 71685 via 8 lanes and
this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s.

>> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
>> the exact RAID level, filesystem type, and even RAID hardware don't
>> seem to matter very much from my observations (but I'm willing to
>> try any suggestions).

Lack of performance variability here tends to suggest your workloads are
all streaming in nature, and/or your application profile isn't taking
full advantage of the software stack and the hardware, i.e. insufficient
parallelism, overlapping IOs, etc. Or, see down below for another
possibility.

These are all current generation HBAs with fast multi-core ASICs and big
write cache. RAID6 parity writes even with high drive counts shouldn't
significantly degrade large streaming write performance. RMW workloads
will still suffer substantially as usual due to rotational latencies.
Fast ASICs can't solve this problem.

> Document them. There's many ways to screw them up and get bad
> performance.

More detailed info always helps.

>> As a basic benchmark, I have an application
>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>> Alternatively you could use the "dd" utility. (For these
>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>> these systems have a lot of RAM.)
>>
>> The basic observations are:
>>
>> 1. "single-threaded" writes, either a file on the mounted
>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>> to 1200-1400MB/sec. These numbers vary slightly based on whether
>> TurboBoost is affecting the writing process or not. "top" will show
>> this process running at 100% CPU.
>
> Expected. You are using buffered IO. Write speed is limited by the
> rate at which your user process can memcpy data into the page cache.
>
>> 2. With two benchmarks running on the same device, I see aggregate
>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>> the drives of being able to deliver. This can either be with two
>> applications writing to separate files on the same mounted file
>> system, or two separate "dd" applications writing to distinct
>> locations on the raw device.

2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If
you've daisy chained the SAS expander backplanes within a server chassis
(9266-8i/72405), or between external enclosures (9285-8e/71685), and
have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
RAID card, this would fully explain the 2.4GB/s wall, regardless of how
many parallel processes are writing, or any other software factor.

But surely you already know this, and you're using more than one 4 lane
cable. Just covering all the bases here, due to seeing 2.4 GB/s as the
stated wall. This number is just too coincidental to ignore.

>> (Increasing the number of writers
>> beyond two does not seem to increase aggregate performance; "top"
>> will show both processes running at perhaps 80% CPU).

So you're not referring to dd processes when you say "writers beyond
two". Otherwise you'd say "four" or "eight" instead of "both" processes.

> Still using buffered IO, which means you are typically limited by
> the rate at which the flusher thread can do writeback.
>
>> 3. I haven't been able to find any tricks (lio_listio, multiple
>> threads writing to distinct file offsets, etc) that seem to deliver
>> higher write speeds when writing to a single file. (This might be
>> xfs-specific, though)
>
> How about using direct IO? Single threaded direct IO will beslower
> than buffered IO, but throughput should scale linearly with the
> number of threads if the IO size is large enough (e.g. 32MB).

Try this quick/dirty parallel write test using dd with O_DIRECT file
based output using 1MB IOs. It fires up 16 dd processes writing 16
files in parallel, 4GB each. This test should give a fairly accurate
representation of real hardware throughput. Sum the MB/s figures from
all dd processes for the aggregate b/w.

#!/bin/sh
for i in {1..16}
do
dd if=/dev/zero of=/XFS_dir/file.$i oflag=direct bs=1M count=4k &
done
wait

>> 4. Cheap tricks like making a software RAID0 of two hardware RAID
>> devices does not deliver any improved performance for
>> single-threaded writes.

As Dave C points out, you'll never reach peak throughput with single
threaded buffered IO. You'd think it would be easy to hit peak write
speed with a single 7.2k SATA drive using a single write thread. Here's
a salient demonstration of why this may not be the case.

$ dd if=/dev/zero of=/XFS-mount/one-thread bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 17.8513 s, 58.7 MB/s

Now a 4 thread variant of the script mentioned above:

#!/bin/sh
for i in {1..4}
do
dd if=/dev/zero of=/XFS-mount/file.$i oflag=direct bs=1M count=512 &
done
wait

$ test.sh
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 20.3012 s, 26.4 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 20.3006 s, 26.4 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 20.3204 s, 26.4 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 20.324 s, 26.4 MB/s

Single thread buffered write: 59 MB/s
Quad thread O_DIRECT write: 105 MB/s

Again both targeting a single SATA disk. I just ran these tests on a 13
year old machine with dual 550MHz Celeron CPUs and 384MB of PC100 DRAM,
vanilla kernel 3.2.6, deadline elevator. The WD SATA disk is attached
via a $20 USD Silicon Image 3512 SATA 150 32 bit PCI card lacking NCQ
support. The system bus is 33MHz/32 bit PCI only, 132MB/s peak, tested
at 115MB/s net after PCI 2.1 protocol overhead. I keep this system
around for such demonstrations. Note that the SATA card and drive are
10 years newer than the core system, acquired in 2009.

On this machine the single thread buffered IO dd run reaches only some
51% of the net PCI throughput and eats 98% of one of the two 550MHz
CPUs. This is due to a number of factors including, but not limited to,
memcpy as Dave C points out, tiny 128KB L2 cache, no L3, the fact that
this platform performs snooping on the P6 bus, and other inefficiencies
of the 440BX chipset.

Now for the kicker. Quad parallel dd direct IO reaches 92% of net PCI
throughput with each dd process eating only 14% CPU, or 28% of each CPU
total. Its aggregate file write throughput into XFS is some 78% higher
than single thread dd using buffered IO.

> (Have not thoroughly tested this
>> configuration fully with multiple writers, though.)

You may not see a 78% bump with parallel O_DIRECT, but it should be
substantial nonetheless.

> Of course not - you are CPU bound and nothing you do to the storage
> will change that.

I'd agree 100% with Chinner if not for that pesky coincidental 2.4GB/s
number reported as the "brick wall". A little more info should clear
this up.

--
Stan

2013-05-16 15:35:19

by David Oostdyk

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 05/16/13 07:36, Stan Hoeppner wrote:
> On 5/15/2013 7:59 PM, Dave Chinner wrote:
>> [cc xfs list, seeing as that's where all the people who use XFS in
>> these sorts of configurations hang out. ]
>>
>> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
>>> Hello,
>>>
>>> I have a few relatively high-end systems with hardware RAIDs which
>>> are being used for recording systems, and I'm trying to get a better
>>> understanding of contiguous write performance.
>>>
>>> The hardware that I've tested with includes two high-end Intel
>>> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
>>> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5"
>>> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
>>> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i,
>>> as well as the integrated Intel LSI controllers) as well as Adaptec
>>> Series 7 RAID controllers (72405 and 71685).
> So, you have something like the following raw aggregate drive b/w,
> assuming average outer-inner track 120MB/s streaming write throughput
> per drive:
>
> 45 drives ~5.4 GB/s
> 28 drives ~3.4 GB/s
> 24 drives ~2.8 GB/s
>
> The two LSI HBAs you mention are PCIe 2.0 devices. Note that PCIe 2.0
> x8 is limited to ~4GB/s each way. If those 45 drives are connected to
> the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the
> aggregate drive b/w. If they're connected to the 71685 via 8 lanes and
> this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s.
>
>>> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
>>> the exact RAID level, filesystem type, and even RAID hardware don't
>>> seem to matter very much from my observations (but I'm willing to
>>> try any suggestions).
> Lack of performance variability here tends to suggest your workloads are
> all streaming in nature, and/or your application profile isn't taking
> full advantage of the software stack and the hardware, i.e. insufficient
> parallelism, overlapping IOs, etc. Or, see down below for another
> possibility.
>
> These are all current generation HBAs with fast multi-core ASICs and big
> write cache. RAID6 parity writes even with high drive counts shouldn't
> significantly degrade large streaming write performance. RMW workloads
> will still suffer substantially as usual due to rotational latencies.
> Fast ASICs can't solve this problem.
>
>> Document them. There's many ways to screw them up and get bad
>> performance.
> More detailed info always helps.
>
>>> As a basic benchmark, I have an application
>>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>>> Alternatively you could use the "dd" utility. (For these
>>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>>> these systems have a lot of RAM.)
>>>
>>> The basic observations are:
>>>
>>> 1. "single-threaded" writes, either a file on the mounted
>>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>>> to 1200-1400MB/sec. These numbers vary slightly based on whether
>>> TurboBoost is affecting the writing process or not. "top" will show
>>> this process running at 100% CPU.
>> Expected. You are using buffered IO. Write speed is limited by the
>> rate at which your user process can memcpy data into the page cache.
>>
>>> 2. With two benchmarks running on the same device, I see aggregate
>>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>>> the drives of being able to deliver. This can either be with two
>>> applications writing to separate files on the same mounted file
>>> system, or two separate "dd" applications writing to distinct
>>> locations on the raw device.
> 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If
> you've daisy chained the SAS expander backplanes within a server chassis
> (9266-8i/72405), or between external enclosures (9285-8e/71685), and
> have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
> RAID card, this would fully explain the 2.4GB/s wall, regardless of how
> many parallel processes are writing, or any other software factor.
>
> But surely you already know this, and you're using more than one 4 lane
> cable. Just covering all the bases here, due to seeing 2.4 GB/s as the
> stated wall. This number is just too coincidental to ignore.

We definitely have two 4-lane cables being used, but this is an
interesting coincidence. I'd be surprised if anyone could really
achieve the theoretical throughput on one cable, though. We have one
JBOD that only takes a single 4-lane cable, and we seem to cap out at
closer to 1450MB/sec on that unit. (This is just a single point of
reference, and I don't have many tests where only one 4-lane cable was
in use.)

>>> (Increasing the number of writers
>>> beyond two does not seem to increase aggregate performance; "top"
>>> will show both processes running at perhaps 80% CPU).
> So you're not referring to dd processes when you say "writers beyond
> two". Otherwise you'd say "four" or "eight" instead of "both" processes.
>
>> Still using buffered IO, which means you are typically limited by
>> the rate at which the flusher thread can do writeback.
>>
>>> 3. I haven't been able to find any tricks (lio_listio, multiple
>>> threads writing to distinct file offsets, etc) that seem to deliver
>>> higher write speeds when writing to a single file. (This might be
>>> xfs-specific, though)
>> How about using direct IO? Single threaded direct IO will beslower
>> than buffered IO, but throughput should scale linearly with the
>> number of threads if the IO size is large enough (e.g. 32MB).
> Try this quick/dirty parallel write test using dd with O_DIRECT file
> based output using 1MB IOs. It fires up 16 dd processes writing 16
> files in parallel, 4GB each. This test should give a fairly accurate
> representation of real hardware throughput. Sum the MB/s figures from
> all dd processes for the aggregate b/w.
>
> #!/bin/sh
> for i in {1..16}
> do
> dd if=/dev/zero of=/XFS_dir/file.$i oflag=direct bs=1M count=4k &
> done
> wait
>
>>> 4. Cheap tricks like making a software RAID0 of two hardware RAID
>>> devices does not deliver any improved performance for
>>> single-threaded writes.
> As Dave C points out, you'll never reach peak throughput with single
> threaded buffered IO. You'd think it would be easy to hit peak write
> speed with a single 7.2k SATA drive using a single write thread. Here's
> a salient demonstration of why this may not be the case.
>
> $ dd if=/dev/zero of=/XFS-mount/one-thread bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 17.8513 s, 58.7 MB/s
>
> Now a 4 thread variant of the script mentioned above:
>
> #!/bin/sh
> for i in {1..4}
> do
> dd if=/dev/zero of=/XFS-mount/file.$i oflag=direct bs=1M count=512 &
> done
> wait
>
> $ test.sh
> 512+0 records in
> 512+0 records out
> 536870912 bytes (537 MB) copied, 20.3012 s, 26.4 MB/s
> 512+0 records in
> 512+0 records out
> 536870912 bytes (537 MB) copied, 20.3006 s, 26.4 MB/s
> 512+0 records in
> 512+0 records out
> 536870912 bytes (537 MB) copied, 20.3204 s, 26.4 MB/s
> 512+0 records in
> 512+0 records out
> 536870912 bytes (537 MB) copied, 20.324 s, 26.4 MB/s
>
> Single thread buffered write: 59 MB/s
> Quad thread O_DIRECT write: 105 MB/s
>
> Again both targeting a single SATA disk. I just ran these tests on a 13
> year old machine with dual 550MHz Celeron CPUs and 384MB of PC100 DRAM,
> vanilla kernel 3.2.6, deadline elevator. The WD SATA disk is attached
> via a $20 USD Silicon Image 3512 SATA 150 32 bit PCI card lacking NCQ
> support. The system bus is 33MHz/32 bit PCI only, 132MB/s peak, tested
> at 115MB/s net after PCI 2.1 protocol overhead. I keep this system
> around for such demonstrations. Note that the SATA card and drive are
> 10 years newer than the core system, acquired in 2009.
>
> On this machine the single thread buffered IO dd run reaches only some
> 51% of the net PCI throughput and eats 98% of one of the two 550MHz
> CPUs. This is due to a number of factors including, but not limited to,
> memcpy as Dave C points out, tiny 128KB L2 cache, no L3, the fact that
> this platform performs snooping on the P6 bus, and other inefficiencies
> of the 440BX chipset.
>
> Now for the kicker. Quad parallel dd direct IO reaches 92% of net PCI
> throughput with each dd process eating only 14% CPU, or 28% of each CPU
> total. Its aggregate file write throughput into XFS is some 78% higher
> than single thread dd using buffered IO.
>
>> (Have not thoroughly tested this
>>> configuration fully with multiple writers, though.)
> You may not see a 78% bump with parallel O_DIRECT, but it should be
> substantial nonetheless.
>
>> Of course not - you are CPU bound and nothing you do to the storage
>> will change that.
> I'd agree 100% with Chinner if not for that pesky coincidental 2.4GB/s
> number reported as the "brick wall". A little more info should clear
> this up.
>

You guys hit the nail on the head! With O_DIRECT I can use a single
writer thread and easily see the same throughput that I _ever_ saw in
the multiple-writer case (~2.4GB/sec), and "top" shows the writer at 10%
CPU usage. I've modified my application to use O_DIRECT and it makes a
world of difference.

[It's interesting that you see performance benefits for O_DIRECT even
with a single SATA drive. The reason it took me so long to test
O_DIRECT in this case, is that I never saw any significant benefit from
using it in the past. But that is when I didn't have such fast storage,
so I probably wasn't hitting the bottleneck with buffered I/O?]

So I have two systems, one with an LSI controller and one with an
Adaptec 71685, each has two 4-lane cables going to 24 and 28 disks
respectively, and they both are hitting about 2.4GB/sec. I'm interested
to test the Adaptec 74205 which is x8 3.0 and can connect six 4-lane
cables directly to 24 drives. That might shed some light on whether the
2.4GB/sec "limit" is due to cable throughput, and I will follow up if
that test proves interesting.

Thank you both for the suggestions!

- Dave O.



Attachments:
smime.p7s (4.46 kB)
S/MIME Cryptographic Signature

2013-05-16 22:57:07

by Dave Chinner

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote:
> On 05/16/13 07:36, Stan Hoeppner wrote:
> >On 5/15/2013 7:59 PM, Dave Chinner wrote:
> >>[cc xfs list, seeing as that's where all the people who use XFS in
> >>these sorts of configurations hang out. ]
> >>
> >>On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
> >>>As a basic benchmark, I have an application
> >>>that simply writes the same buffer (say, 128MB) to disk repeatedly.
> >>>Alternatively you could use the "dd" utility. (For these
> >>>benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
> >>>these systems have a lot of RAM.)
> >>>
> >>>The basic observations are:
> >>>
> >>>1. "single-threaded" writes, either a file on the mounted
> >>>filesystem or with a "dd" to the raw RAID device, seem to be limited
> >>>to 1200-1400MB/sec. These numbers vary slightly based on whether
> >>>TurboBoost is affecting the writing process or not. "top" will show
> >>>this process running at 100% CPU.
> >>Expected. You are using buffered IO. Write speed is limited by the
> >>rate at which your user process can memcpy data into the page cache.
> >>
> >>>2. With two benchmarks running on the same device, I see aggregate
> >>>write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
> >>>the drives of being able to deliver. This can either be with two
> >>>applications writing to separate files on the same mounted file
> >>>system, or two separate "dd" applications writing to distinct
> >>>locations on the raw device.
> >2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If
> >you've daisy chained the SAS expander backplanes within a server chassis
> >(9266-8i/72405), or between external enclosures (9285-8e/71685), and
> >have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
> >RAID card, this would fully explain the 2.4GB/s wall, regardless of how
> >many parallel processes are writing, or any other software factor.
> >
> >But surely you already know this, and you're using more than one 4 lane
> >cable. Just covering all the bases here, due to seeing 2.4 GB/s as the
> >stated wall. This number is just too coincidental to ignore.
>
> We definitely have two 4-lane cables being used, but this is an
> interesting coincidence. I'd be surprised if anyone could really
> achieve the theoretical throughput on one cable, though. We have
> one JBOD that only takes a single 4-lane cable, and we seem to cap
> out at closer to 1450MB/sec on that unit. (This is just a single
> point of reference, and I don't have many tests where only one
> 4-lane cable was in use.)

You can get pretty close to the theoretical limit on the back end
SAS cables - just like you can with FC.

What I'd suggest you do is look at the RAID card configuration -
often they default to active/passive failover configurations when
there are multiple channels to the same storage. Then hey only use
one of the cables for all traffic. Some RAID cards offer
ative/active or "load balanced" options where all back end paths are
used in redundant configurations rather than just one....

> You guys hit the nail on the head! With O_DIRECT I can use a single
> writer thread and easily see the same throughput that I _ever_ saw
> in the multiple-writer case (~2.4GB/sec), and "top" shows the writer
> at 10% CPU usage. I've modified my application to use O_DIRECT and
> it makes a world of difference.

Be aware that O_DIRECT is not a magic bullet. It can make your IO
go a lot slower on some worklaods and storage configs....

> [It's interesting that you see performance benefits for O_DIRECT
> even with a single SATA drive. The reason it took me so long to
> test O_DIRECT in this case, is that I never saw any significant
> benefit from using it in the past. But that is when I didn't have
> such fast storage, so I probably wasn't hitting the bottleneck with
> buffered I/O?]

Right - for applications not designed to use direct IO from the
ground up, this is typically the case - buffered IO is faster right
up to the point where you run out of CPU....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-17 11:56:20

by Stan Hoeppner

[permalink] [raw]
Subject: Re: high-speed disk I/O is CPU-bound?

On 5/16/2013 5:56 PM, Dave Chinner wrote:
> On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote:
>> On 05/16/13 07:36, Stan Hoeppner wrote:
>>> On 5/15/2013 7:59 PM, Dave Chinner wrote:
>>>> [cc xfs list, seeing as that's where all the people who use XFS in
>>>> these sorts of configurations hang out. ]
>>>>
>>>> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
>>>>> As a basic benchmark, I have an application
>>>>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>>>>> Alternatively you could use the "dd" utility. (For these
>>>>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>>>>> these systems have a lot of RAM.)
>>>>>
>>>>> The basic observations are:
>>>>>
>>>>> 1. "single-threaded" writes, either a file on the mounted
>>>>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>>>>> to 1200-1400MB/sec. These numbers vary slightly based on whether
>>>>> TurboBoost is affecting the writing process or not. "top" will show
>>>>> this process running at 100% CPU.
>>>> Expected. You are using buffered IO. Write speed is limited by the
>>>> rate at which your user process can memcpy data into the page cache.
>>>>
>>>>> 2. With two benchmarks running on the same device, I see aggregate
>>>>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>>>>> the drives of being able to deliver. This can either be with two
>>>>> applications writing to separate files on the same mounted file
>>>>> system, or two separate "dd" applications writing to distinct
>>>>> locations on the raw device.
>>> 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If
>>> you've daisy chained the SAS expander backplanes within a server chassis
>>> (9266-8i/72405), or between external enclosures (9285-8e/71685), and
>>> have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
>>> RAID card, this would fully explain the 2.4GB/s wall, regardless of how
>>> many parallel processes are writing, or any other software factor.
>>>
>>> But surely you already know this, and you're using more than one 4 lane
>>> cable. Just covering all the bases here, due to seeing 2.4 GB/s as the
>>> stated wall. This number is just too coincidental to ignore.
>>
>> We definitely have two 4-lane cables being used, but this is an
>> interesting coincidence. I'd be surprised if anyone could really
>> achieve the theoretical throughput on one cable, though. We have
>> one JBOD that only takes a single 4-lane cable, and we seem to cap
>> out at closer to 1450MB/sec on that unit. (This is just a single
>> point of reference, and I don't have many tests where only one
>> 4-lane cable was in use.)
>
> You can get pretty close to the theoretical limit on the back end
> SAS cables - just like you can with FC.

Yep.

> What I'd suggest you do is look at the RAID card configuration -
> often they default to active/passive failover configurations when
> there are multiple channels to the same storage. Then hey only use
> one of the cables for all traffic. Some RAID cards offer
> ative/active or "load balanced" options where all back end paths are
> used in redundant configurations rather than just one....

Also read the docs for your JBOD chassis. Some have a single expander
module with 2 host ports while some have two such expanders for
redundancy and have 4 total host ports. The latter requires dual ported
drives. In this config you'd use one host port on each expander and
configure the RAID HBA for multipathing. (It may be possible to use all
4 host ports in this setup but this requires a RAID HBA with 4 external
4 lane connectors. I'm not aware of any at this time, nut only two port
models. So you'd have to use two non-RAID HBAs each with two 4 lane
ports, SCSI multipath, and Linux md/RAID.)

Most JBODs that use the LSI 2x36 expander ASIC will give you full b/w
over two host ports in a single expander single chassis config. Other
JBODs may direct wire one of the two host port to the expansion port so
you may only get full 8 lane host bandwidth with an expansion unit
attached. There are likely other configurations I'm not aware of.

>> You guys hit the nail on the head! With O_DIRECT I can use a single
>> writer thread and easily see the same throughput that I _ever_ saw
>> in the multiple-writer case (~2.4GB/sec), and "top" shows the writer
>> at 10% CPU usage. I've modified my application to use O_DIRECT and
>> it makes a world of difference.
>
> Be aware that O_DIRECT is not a magic bullet. It can make your IO
> go a lot slower on some worklaods and storage configs....
>
>> [It's interesting that you see performance benefits for O_DIRECT
>> even with a single SATA drive.

The single SATA drive has little to do with it actually. It's the
limited CPU/RAM bus b/w of the box. The reason O_DIRECT shows a 78%
improvement in disk throughput is a direct result of dramatically
decreased memory pressure, allowing full speed DMA from RAM to the HBA
over the PCI bus. The pressure caused by the mem-mem copying of
buffered IO causes every read in the CPU to be a cache miss, further
exacerbating the load on the CPU/RAM buses. All the memory reads cause
extra CPU bus snooping to update the L2s. The constant cache misses and
resulting waits on memory reads are what drive the CPU to 98% utilization.

>> The reason it took me so long to
>> test O_DIRECT in this case, is that I never saw any significant
>> benefit from using it in the past. But that is when I didn't have
>> such fast storage, so I probably wasn't hitting the bottleneck with
>> buffered I/O?]
>
> Right - for applications not designed to use direct IO from the
> ground up, this is typically the case - buffered IO is faster right
> up to the point where you run out of CPU....

Or memory bandwidth, which in turn runs you out of CPU.

--
Stan