2002-01-30 22:57:00

by Jeffrey W. Baker

[permalink] [raw]
Subject: 2.4.17: pwrite destroys block I/O throughput

Hi there,

I've never heard of pwrite and pread before, but htdig apparently makes
very heavy use of it. On my 2.4.17 SMP 2GB HIGHMEM + ext3 system,
running htdig destroys input rates for every other process. htdig's
output proceeds at approximately 2MB/s, but input for the entire system
runs only at about 4KB/s (YES, KB!). If I suspend htdig, system block
input increases to the normal rate of 10-50MB/s. Output still works, as
a dd from /dev/zero to a 400MB file runs at about 25MB/s.

Is linux's pwrite() just horribly broken? Is htdig the only program
that uses it?

Here's a little snapshot of htdig's syscalls, strace -s 0:

pwrite(6, ""..., 8192, 20717568) = 8192
pread(6, ""..., 8192, 138395648) = 8192
pwrite(6, ""..., 8192, 127918080) = 8192
pread(6, ""..., 8192, 179281920) = 8192
pwrite(6, ""..., 8192, 79732736) = 8192
pread(6, ""..., 8192, 137633792) = 8192
pwrite(6, ""..., 8192, 28409856) = 8192
pread(6, ""..., 8192, 157958144) = 8192
pwrite(6, ""..., 8192, 96583680) = 8192
pread(6, ""..., 8192, 141254656) = 8192
pwrite(6, ""..., 8192, 131031040) = 8192
pread(6, ""..., 8192, 19095552) = 8192
pwrite(6, ""..., 8192, 82698240) = 8192
pread(6, ""..., 8192, 170573824) = 8192
pwrite(6, ""..., 8192, 152616960) = 8192
pread(6, ""..., 8192, 207298560) = 8192
pwrite(6, ""..., 8192, 148635648) = 8192
pread(6, ""..., 8192, 202768384) = 8192
pwrite(6, ""..., 8192, 174055424) = 8192

It's seeking all over the place. Maybe pwrite/pread bypass the elevator
and proper I/O scheduling.

-jwb


2002-01-30 23:25:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: 2.4.17: pwrite destroys block I/O throughput

On Jan 30, 2002 14:55 -0800, Jeffrey W. Baker wrote:
> I've never heard of pwrite and pread before, but htdig apparently makes
> very heavy use of it.

Me neither, but always something new to learn. The man page says they
are "read/write without changing the offset of the file descriptor".

> On my 2.4.17 SMP 2GB HIGHMEM + ext3 system, running htdig destroys
> input rates for every other process.
>
> Is linux's pwrite() just horribly broken? Is htdig the only program
> that uses it?

Well, the sys_read() and sys_pread() code is functionally identical,
with the exception that the latter uses the passed in offset instead
of the file->f_pos. The same is true for sys_write() and sys_pwrite().

> Here's a little snapshot of htdig's syscalls, strace -s 0:
>
> pwrite(6, ""..., 8192, 20717568) = 8192
> pread(6, ""..., 8192, 138395648) = 8192
> pwrite(6, ""..., 8192, 127918080) = 8192
> pread(6, ""..., 8192, 179281920) = 8192
> pwrite(6, ""..., 8192, 79732736) = 8192
> pread(6, ""..., 8192, 137633792) = 8192
> pwrite(6, ""..., 8192, 28409856) = 8192
> pread(6, ""..., 8192, 157958144) = 8192
> pwrite(6, ""..., 8192, 96583680) = 8192
> pread(6, ""..., 8192, 141254656) = 8192
> pwrite(6, ""..., 8192, 131031040) = 8192
> pread(6, ""..., 8192, 19095552) = 8192
> pwrite(6, ""..., 8192, 82698240) = 8192
> pread(6, ""..., 8192, 170573824) = 8192
> pwrite(6, ""..., 8192, 152616960) = 8192
> pread(6, ""..., 8192, 207298560) = 8192
> pwrite(6, ""..., 8192, 148635648) = 8192
> pread(6, ""..., 8192, 202768384) = 8192
> pwrite(6, ""..., 8192, 174055424) = 8192
>
> It's seeking all over the place. Maybe pwrite/pread bypass the elevator
> and proper I/O scheduling.

No, it appears that htdig is just a stupidly coded program, randomly
reading and writing all over this huge file. It is basically impossible
for the kernel to deal with this intelligently (i.e. readahead doesn't
work at all), and given that the file is 200MB in size it is probably
thrashing the disk to death with seeks.

Granted, you _should_ get some benefit to write buffering, but it is
possible that the file is opened in O_SYNC mode, which will kill you.
Maybe you can fix that a bit by using the ext3 "data=journal" mode
for that filesystem, if it is really important to you, because that
will make all of the sync writes one nice contiguous write (avoiding
most seeks), at the expense of halving your write performance.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-01-30 23:35:24

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.17: pwrite destroys block I/O throughput

"Jeffrey W. Baker" wrote:
>
> Hi there,
>
> I've never heard of pwrite and pread before, but htdig apparently makes
> very heavy use of it.

pwrite() is nice. There's nothing special about it from a kernel
point of view. It's equivalent to lseek+write to lower layers.

> Is linux's pwrite() just horribly broken? Is htdig the only program
> that uses it?

Anything which does lots of discontiguous writes can do this.
Probably the recent shortening of the request queue made
it a little worse, but without the ability to perform
write merging at the buffercache LRU list level, we don't
really have a fix.

The reason why it makes your *read* throughput so bad is
that the writes are asynchronous. So htdig can cheerfully
fill the request queue with 128 writes (and 128 seeks!) but
processes which are doing reads cannot do this asynchronously
(apart from readhead, which doesn't help much here).

So the readers get stuck on a queue behind 127 write seeks.
Eventually their read hits the head of the queue and gets
serviced. Then they request another read. And they go
to the back of the queue (or maybe the middle, if they get
lucky - depends what block they're trying to read).

> Here's a little snapshot of htdig's syscalls, strace -s 0:
>
> pwrite(6, ""..., 8192, 20717568) = 8192
> pread(6, ""..., 8192, 138395648) = 8192
> pwrite(6, ""..., 8192, 127918080) = 8192
> ...
>

ug. So we do have a real-world case.

> It's seeking all over the place. Maybe pwrite/pread bypass the elevator
> and proper I/O scheduling.

Nope. It's just a pathological case.

You'll get much, much better behaviour with

http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre7/read-latency2.patch

Because it

a) boosts the priority of readers and
b) Increases the request queue size a lot, so write merges will
be more common.

Long-term, the only fix for this is to perform the write-merging
at a much higher level - to give it visibility of all the writable
data in the machine.

-

2002-01-30 23:38:31

by Jeffrey W. Baker

[permalink] [raw]
Subject: Re: 2.4.17: pwrite destroys block I/O throughput

On Wed, 2002-01-30 at 14:55, Jeffrey W. Baker wrote:
> Hi there,
>
> I've never heard of pwrite and pread before, but htdig apparently makes
> very heavy use of it. On my 2.4.17 SMP 2GB HIGHMEM + ext3 system,
> running htdig destroys input rates for every other process. htdig's
> output proceeds at approximately 2MB/s, but input for the entire system
> runs only at about 4KB/s (YES, KB!). If I suspend htdig, system block
> input increases to the normal rate of 10-50MB/s. Output still works, as
> a dd from /dev/zero to a 400MB file runs at about 25MB/s.

I looked around a little more and it seems that htdig is doing SYNC I/O
on its db2 files. This may be the problem. If so, it seems that sync
I/O is able to DoS all other I/O consumers on the machine.

-jwb