2004-10-26 01:19:04

by Jesse Barnes

[permalink] [raw]
Subject: Buffered write slowness

I've been doing some simple disk I/O benchmarking with an eye towards
improving large, striped volume bandwidth. I ran some tests on individual
disks and filesystems to establish a baseline and found that things generally
scale quite well:

o one thread/disk using O_DIRECT on the block device
read avg: 2784.81 MB/s
write avg: 2585.60 MB/s

o one thread/disk using O_DIRECT + filesystem
read avg: 2635.98 MB/s
write avg: 2573.39 MB/s

o one thread/disk using buffered I/O + filesystem
read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
write avg: 1394.99 MB/s

Configuration:
o 8p sn2 ia64 box
o 8GB memory
o 58 disks across 16 controllers
(4 disks for 10 of them and 3 for the other 6)
o aggregate I/O bw available is about 2.8GB/s

Test:
o one I/O thread per disk, round robined across the 8 CPUs
o each thread did ~450MB of I/O depending on the test (ran for 10s)
Note: the total was > 8GB so in the buffered read case not everything
could be cached

As you can see, for a test that does one thread/disk things look really good
(very close to the available bandwidth in the system) with the exception of
buffered writes. I've attached the vmstat and profile from that run in case
anyone's interested. It seems that there was some spinlock contention in
that run that wasn't present in other runs.

Preliminary runs on a large volume showed that a single thread reading from a
striped volume w/O_DIRECT performed poorly, while a single thread writing to
a volume the same way was able to get slightly over 1GB/s. Using multiple
read threads against the volume increased the bandwidth to near 1GB/s, but
multiple threads writing slightly slowed performance. My tests and the
system configuration have changed slightly though, so don't put much stock in
these numbers until I rerun them (and collect profiles and such).

Thanks,
Jesse

P.S. The 'dev-fs' in the filenames doesn't mean I was using devfs (I wasn't,
not that it should matter), just that I was running per-dev tests with a
filesystem. :)


Attachments:
(No filename) (2.09 kB)
profile-buffered-write-dev-fs.txt (1.67 kB)
vmstat-buffered-write-dev-fs.txt (6.72 kB)
Download all attachments

2004-10-29 17:48:29

by Jesse Barnes

[permalink] [raw]
Subject: Re: Buffered I/O slowness

On Monday, October 25, 2004 6:14 pm, Jesse Barnes wrote:
> I've been doing some simple disk I/O benchmarking with an eye towards
> improving large, striped volume bandwidth. I ran some tests on individual
> disks and filesystems to establish a baseline and found that things
> generally scale quite well:
>
> o one thread/disk using O_DIRECT on the block device
> read avg: 2784.81 MB/s
> write avg: 2585.60 MB/s
>
> o one thread/disk using O_DIRECT + filesystem
> read avg: 2635.98 MB/s
> write avg: 2573.39 MB/s
>
> o one thread/disk using buffered I/O + filesystem
> read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
> read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
> write avg: 1394.99 MB/s
>
> Configuration:
> o 8p sn2 ia64 box
> o 8GB memory
> o 58 disks across 16 controllers
> (4 disks for 10 of them and 3 for the other 6)
> o aggregate I/O bw available is about 2.8GB/s
>
> Test:
> o one I/O thread per disk, round robined across the 8 CPUs
> o each thread did ~450MB of I/O depending on the test (ran for 10s)
> Note: the total was > 8GB so in the buffered read case not everything
> could be cached

More results here. I've run some tests on a large dm striped volume formatted
with XFS. It had 64 disks with a 64k stripe unit (XFS was made aware of this
at format time), and I explicitly set the readahead using blockdev to 524288
blocks. The results aren't as bad as my previous runs, but are still much
slower than they ought to be I think given the direct I/O results above.
This is after a fresh mount, so the pagecache was empty when I started the
tests.

o one thread on one large volume using buffered I/O + filesystem
read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s

I'm intentionally issuing very large reads and writes here to take advantage
of the striping, but it looks like both the readahead and regular buffered
I/O code will split the I/O into page sized chunks? The call chain is pretty
long, but it looks to me like do_generic_mapping_read() will split the reads
up by page and issue them independently to the lower levels. In the direct
I/O case, up to 64 pages are issued at a time, which seems like it would help
throughput quite a bit. The profile seems to confirm this. Unfortunately I
didn't save the vmstat output for this run (and now the fc switch is
misbehaving so I have to fix that before I run again), but iirc the system
time was pretty high given that only one thread was issuing I/O.

So maybe a few things need to be done:
o set readahead to larger values by default for dm volumes at setup time
(the default was very small)
o maybe bypass readahead for very large requests?
if the process is doing a huge request, chances are that readahead won't
benefit it as much as a process doing small requests
o not sure about writes yet, I haven't looked at that call chain much yet

Does any of this sound reasonable at all? What else could be done to make the
buffered I/O layer friendlier to large requests?

Thanks,
Jesse


Attachments:
(No filename) (3.10 kB)
vol-buffered-read-profile.txt (1.67 kB)
Download all attachments

2004-10-29 23:13:10

by Andrew Morton

[permalink] [raw]
Subject: Re: Buffered I/O slowness

Jesse Barnes <[email protected]> wrote:
>
> ...
> o one thread on one large volume using buffered I/O + filesystem
> read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
> write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s
>
> I'm intentionally issuing very large reads and writes here to take advantage
> of the striping, but it looks like both the readahead and regular buffered
> I/O code will split the I/O into page sized chunks?

No, the readahead code will assemble single BIOs up to the size of the
readahead window. So the single-page-reads in do_generic_mapping_read()
should never happen, because the pages are in cache from the readahead.

> The call chain is pretty
> long, but it looks to me like do_generic_mapping_read() will split the reads
> up by page and issue them independently to the lower levels. In the direct
> I/O case, up to 64 pages are issued at a time, which seems like it would help
> throughput quite a bit. The profile seems to confirm this. Unfortunately I
> didn't save the vmstat output for this run (and now the fc switch is
> misbehaving so I have to fix that before I run again), but iirc the system
> time was pretty high given that only one thread was issuing I/O.
>
> So maybe a few things need to be done:
> o set readahead to larger values by default for dm volumes at setup time
> (the default was very small)

Well possibly. dm has control of queue->backing_dev_info and is free to
tune the queue's default readahead.

> o maybe bypass readahead for very large requests?
> if the process is doing a huge request, chances are that readahead won't
> benefit it as much as a process doing small requests

Maybe - but bear in mind that this is all pinned memory when the I/O is in
flight, so some upper bound has to remain.

> o not sure about writes yet, I haven't looked at that call chain much yet
>
> Does any of this sound reasonable at all? What else could be done to make the
> buffered I/O layer friendlier to large requests?

I'm not sure that we know what's going on yet. I certainly don't. The
above numbers look good, so what's the problem???

Suggest you get geared up to monitor the BIOs going into submit_bio().
Look at their bi_sector and bi_size. Make sure that buffered I/O is doing
the right thing.

2004-10-30 00:22:41

by Jesse Barnes

[permalink] [raw]
Subject: Re: Buffered I/O slowness

On Friday, October 29, 2004 4:08 pm, Andrew Morton wrote:
> > I'm intentionally issuing very large reads and writes here to take
> > advantage of the striping, but it looks like both the readahead and
> > regular buffered I/O code will split the I/O into page sized chunks?
>
> No, the readahead code will assemble single BIOs up to the size of the
> readahead window. So the single-page-reads in do_generic_mapping_read()
> should never happen, because the pages are in cache from the readahead.

Yeah, I realized that after I sent the message. The readahead looks like it
might be ok.

> > So maybe a few things need to be done:
> > o set readahead to larger values by default for dm volumes at setup
> > time (the default was very small)
>
> Well possibly. dm has control of queue->backing_dev_info and is free to
> tune the queue's default readahead.

Yep, I'll give that a try and see if I can come up with a reasonable default
(something more the stripe unit seems like a start).

> > o maybe bypass readahead for very large requests?
> > if the process is doing a huge request, chances are that readahead
> > won't benefit it as much as a process doing small requests
>
> Maybe - but bear in mind that this is all pinned memory when the I/O is in
> flight, so some upper bound has to remain.

Right, for the direct I/O case, it looks like things are limited to 64 pages
at a time.

>
> > o not sure about writes yet, I haven't looked at that call chain much
> > yet
> >
> > Does any of this sound reasonable at all? What else could be done to
> > make the buffered I/O layer friendlier to large requests?
>
> I'm not sure that we know what's going on yet. I certainly don't. The
> above numbers look good, so what's the problem???

The numbers are ~1/3 of what the machine is capable of with direct I/O. That
seems like it's much lower than it should be to me. Cache cold reads into
the page cache seem like they should be nearly as fast as direct reads (at
least on a CPU where the extra data copying overhead isn't getting in the
way).

> Suggest you get geared up to monitor the BIOs going into submit_bio().
> Look at their bi_sector and bi_size. Make sure that buffered I/O is doing
> the right thing.

Ok, I'll give that a try.

Thanks,
Jesse

2004-10-30 00:33:13

by Andrew Morton

[permalink] [raw]
Subject: Re: Buffered I/O slowness

Jesse Barnes <[email protected]> wrote:
>
> > I'm not sure that we know what's going on yet. I certainly don't. The
> > above numbers look good, so what's the problem???
>
> The numbers are ~1/3 of what the machine is capable of with direct I/O.

Are there CPU cycles to spare? If you have just one CPU copying 1GB/sec
out of pagecache, maybe it is pegged?

2004-11-01 18:47:13

by Jesse Barnes

[permalink] [raw]
Subject: Re: Buffered I/O slowness

On Friday, October 29, 2004 5:30 pm, Andrew Morton wrote:
> Jesse Barnes <[email protected]> wrote:
> > > I'm not sure that we know what's going on yet. I certainly don't. The
> > > above numbers look good, so what's the problem???
> >
> > The numbers are ~1/3 of what the machine is capable of with direct I/O.
>
> Are there CPU cycles to spare? If you have just one CPU copying 1GB/sec
> out of pagecache, maybe it is pegged?

Hm, I thought I had more CPU to spare, but when I set the readahead to a large
value, I'm taking ~100% of the CPU time on the CPU doing the read. ~98% of
that is system time. When I run 8 copies (this is an 8 CPU system), I get
~4GB/s and all the CPUs are near fully busy. I guess things aren't as bad as
I initially thought.

Thanks,
Jesse

2004-11-01 19:05:50

by Jesse Barnes

[permalink] [raw]
Subject: Re: Buffered I/O slowness

On Monday, November 1, 2004 10:26 am, Jesse Barnes wrote:
> On Friday, October 29, 2004 5:30 pm, Andrew Morton wrote:
> > Jesse Barnes <[email protected]> wrote:
> > > > I'm not sure that we know what's going on yet. I certainly don't.
> > > > The above numbers look good, so what's the problem???
> > >
> > > The numbers are ~1/3 of what the machine is capable of with direct I/O.
> >
> > Are there CPU cycles to spare? If you have just one CPU copying 1GB/sec
> > out of pagecache, maybe it is pegged?
>
> Hm, I thought I had more CPU to spare, but when I set the readahead to a
> large value, I'm taking ~100% of the CPU time on the CPU doing the read.
> ~98% of that is system time. When I run 8 copies (this is an 8 CPU
> system), I get ~4GB/s and all the CPUs are near fully busy. I guess things
> aren't as bad as I initially thought.

OTOH, if I run 8 copies against 8 separate files (the test above was 8 I/O
threads on the same file), I'm seeing ~16% CPU for each CPU in the machine
and only about 700 MB/s of I/O throughput, so this case *does* look like a
problem. Here's the profile (this is 2.6.10-rc1-mm2).

Jesse

mgr Aggregate throughput: 6241.204239 MB in 10.183594s; 612.868541 MB/s
116885 total 0.0162
50577 ia64_pal_call_static 263.4219
42784 default_idle 95.5000
6148 ia64_save_scratch_fpregs 96.0625
5908 ia64_load_scratch_fpregs 92.3125
4738 __copy_user 2.0008
2079 _spin_unlock_irq 12.9938
926 _spin_unlock_irqrestore 4.8229
374 sn_dma_flush 0.2997
192 generic_make_request 0.1250
177 clone_endio 0.2634
149 _read_unlock_irq 0.9313
135 dm_table_unplug_all 0.4688
128 buffered_rmqueue 0.0597
122 mptscsih_io_done 0.0428
117 clear_page 0.7312
96 __end_that_request_first 0.0811
94 _spin_lock_irqsave 0.2670
92 mempool_alloc 0.0927
88 handle_IRQ_event 0.3056
80 _write_unlock_irq 0.3571
80 mpage_end_io_read 0.1471
61 kmem_cache_alloc 0.2383
59 xfs_iomap 0.0181
59 xfs_bmapi 0.0038
59 do_mpage_readpage 0.0249
55 dm_table_any_congested 0.1719
53 pcibr_dma_unmap 0.3312
51 scsi_io_completion 0.0228
47 kmem_cache_free 0.1224