LinuxLists.cc - Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

2006-02-24 20:23:10

Subject: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Greetings,

*** Please CC: me on replies -- I'm not subscribed.

Short Problem Description/Question:

When switching from kernel 2.4.31 to 2.6.13 (with everything else the same),
there is a drastic increase in the time required to perform 'fseek()' on
larger files (e.g. 4.3 MB, using ReiserFS [in case it matters], in my test
case).

It seems that any seeks in a range larger than 128KB (regardless of the file
size or the position within the file) cause the performace to drop
precipitously. As near as I can determine, this happens because the virtual
memory manager (VMM) in 2.6.13 is not caching the full 4.3 MB file. In fact,
only a maximum of a 128KB segment of the file seems to be cached.

Can anyone please explain this change in behavior and/or recommend a 2.6.x VM
setting to revert to the old (_much_ faster) 'fseek()' behavior from 2.4.x
kernels?

-----------------------------------

More Details:

I'm running Slackware 10.2 (2.4.31 and 2.6.13 stock kernels) on a 400 MHz AMD
K6-2 laptop with 192MB of RAM.

I have an application that does many (20,000 - 50,000) 'fseek()' calls on the
same large file. In 2.4.31 (and other earlier 2.4.x kernels), it runs very
fast, even on large files (e.g. 4.3 MB).

I culled the problem down to a C code sample (see below).

Some timing tests with 20,000 'fseek()' calls:

Kernel 2.4.31: 1st run -- 0m8.0s; 2nd run 0m0.6s;

Kernel 2.6.13: 1st run -- 32.0s; 2nd run 29.0s;

Some timing tests with 200,000 'fseek()' calls:

Kernel 2.4.31: 6.0s

Kernel 2.6.13: 4m50s

Clearly, the 2.4.31 results are speedy because the whole 4MB file has been
cached.

What I cannot figure out is this: what has changed in 2.6.x kernels to cause
the performance to degrade so drastically?!?

Assuming it's somehow related to the 2.6.x VMM code, I've read everything I
could in the 'usr/src/linux-2.6.13/Documentation/vm/' directory and I've run
'vmstat' and dumped the various '/proc/sys/vm/*' settings. I've tried
tweaking settings (some [most?] of which I don't fully understand [e.g.
'/proc/sys/vm/lowmem_reserve_ratio']). I've tried scanning the VM code for
clues but, not being a Virtual Memory guru, I've come up empty. I've searched
the web and LKML to no avail.

I'm completely at a loss -- any suggestions would be much welcomed!

-----------------------------------

Here's a quick 'n' dirty test routine I wrote which demonstrates the problem
on a 4MB file generated with this command:

dd if=/dev/zero of=/tmp/fseek-4MB bs=1024 count=4096

Compile:

gcc -o fseek-test fseek-test.c

Run (1st parm [required] is filename; 2nd parm [optional, 20K is default] is
loop count):

fseek-test /tmp/fseek-4MB 20000

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

int main (int argc, char *argv[])
{
if (argc < 2) {
printf("You must specify the filename!\n");
}
else {
FILE *inp_fh;
if ((inp_fh = fopen(argv[1], "rb")) == 0) {
printf("Error ('%s') opening data file ('%s') for input!\n",
strerror(errno), argv[1]);
}
else {
int j, pos;
int max_calls = 20000;
if (argc > 2) {
max_calls = atoi(argv[2]);
if (max_calls < 100) max_calls = 100;
if (max_calls > 999999) max_calls = 999999;
}
printf("Performing %d calls to 'fseek()' on file '%s'...\n",
max_calls, argv[1]);
for (j=0; j < max_calls; j++) {
pos = (int)(((double)random() / (double)RAND_MAX) * 4000000.0);
if (fseek(inp_fh, pos, SEEK_SET)) {
printf("Error ('%s') seeking to position %d!\n",
strerror(errno), pos);
}
}
fclose(inp_fh);
}
}
exit(0);
}

-----------------------------------

Any advice is much appreciated... TIA!

*** Please CC: me on replies -- I'm not subscribed.

Bill Marr

2006-02-24 23:32:25

by Robert Hancock

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Marr wrote:
> Clearly, the 2.4.31 results are speedy because the whole 4MB file has been
> cached.

I don't think this is clear at all. The entire file should always be
cached, not doing this would be insane.

> What I cannot figure out is this: what has changed in 2.6.x kernels to cause
> the performance to degrade so drastically?!?

fseek() is a C library call, not a system call itself - there may be
something that glibc is doing differently. Are you using the same glibc
version with both kernels?

Just from this program it could be something else entirely that explains
the difference in speed, like the random number generator..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2006-02-25 05:17:57

by Andrew Morton

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Marr <[email protected]> wrote:
>
> ..
>
> When switching from kernel 2.4.31 to 2.6.13 (with everything else the same),
> there is a drastic increase in the time required to perform 'fseek()' on
> larger files (e.g. 4.3 MB, using ReiserFS [in case it matters], in my test
> case).
>
> It seems that any seeks in a range larger than 128KB (regardless of the file
> size or the position within the file) cause the performace to drop
> precipitously.
>

Interesting.

What's happening is that glibc does a read from the file within each
fseek(). Which might seem a bit silly because the app could seek somewhere
else without doing any IO. But then the app would be silly too.

Also, glibc is using the value returned in struct stat's blksize (a hint as
to this file's preferred read chunk size) as, umm, a hint as to this file's
preferred read size.

Most filesystems return 4k in stat.blksize. But in 2.6, reiserfs bumped
that to 128k to get good I/O patterns. Consequently this:

> for (j=0; j < max_calls; j++) {
> pos = (int)(((double)random() / (double)RAND_MAX) * 4000000.0);
> if (fseek(inp_fh, pos, SEEK_SET)) {
> printf("Error ('%s') seeking to position %d!\n",
> strerror(errno), pos);
> }
> }

runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
on every fseek.

- There may be a libc stdio function which allows you to tune this
behaviour.

- libc should probably be a bit more defensive about this anyway -
plainly the filesystem is being silly.

- You can alter the filesystem's behaviour by mounting with the
`nolargeio=1' option. That sets stat.blksize back to 4k.

This will alter the behaviour of every reiserfs filesystem in the
machine. Even the already mounted ones.

`mount -o remount,nolargeio=1' can probably also be used. But that
won't affect inodes which are already in cache - a umount/mount cycle may
be needed.

If you like, you can just mount and unmount a different reiserfs
filesystem to switch this reiserfs filesystem's behaviour. IOW: the
reiserfs guys were lazy and went and made this a global variable :(

- fseek is a pretty dumb function anyway - you're better off with
stateless functions like pread() - half the number of syscalls, don't
have to track where the file pointer is at. I don't know if there's a
pread()-like function in stdio though?

No happy answers there, sorry. But a workaround.

2006-02-26 13:08:11

by Ingo Oeser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

On Saturday, 25. February 2006 06:16, Andrew Morton wrote:
> runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
> on every fseek.

Thats the bug. If I seek, I never like to have an read issued.
seek should just return whether the result is a valid offset
in the underlying object.

It is perfectly valid to have a real time device which produces data
very fast and where you are allowed to skip without reading anything.

This device coul be a pipe, which just allows forward seeking for exactly
this (implemented by me some years ago).

> - fseek is a pretty dumb function anyway - you're better off with
> stateless functions like pread() - half the number of syscalls, don't
> have to track where the file pointer is at. I don't know if there's a
> pread()-like function in stdio though?

pread and anything else not using RELATIVE descriptor offsets are not
very useful for pipe like interfaces that can seek, but just forward.

There are even cases, where you can seek forward and backward, but
only with relative offsets ever, because you have a circular buffer indexed by time.
If you like to get the last N minutes, the relative index is always stable,
but the absolute offset jumps.

So I hope glibc will fix fseek to work as advertised.

But for the simple file case all your answers are valid.

Regards

Ingo Oeser

Attachments:

(No filename) (1.33 kB)
(No filename) (191.00 B)
Download all attachments

2006-02-26 13:50:47

by Nick Piggin

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Ingo Oeser wrote:
> On Saturday, 25. February 2006 06:16, Andrew Morton wrote:
>
>>runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
>>on every fseek.
>
>
> Thats the bug. If I seek, I never like to have an read issued.
> seek should just return whether the result is a valid offset
> in the underlying object.
>
> It is perfectly valid to have a real time device which produces data
> very fast and where you are allowed to skip without reading anything.
>
> This device coul be a pipe, which just allows forward seeking for exactly
> this (implemented by me some years ago).
>
>
>>- fseek is a pretty dumb function anyway - you're better off with
>> stateless functions like pread() - half the number of syscalls, don't
>> have to track where the file pointer is at. I don't know if there's a
>> pread()-like function in stdio though?
>
>
> pread and anything else not using RELATIVE descriptor offsets are not
> very useful for pipe like interfaces that can seek, but just forward.
>
> There are even cases, where you can seek forward and backward, but
> only with relative offsets ever, because you have a circular buffer indexed by time.
> If you like to get the last N minutes, the relative index is always stable,
> but the absolute offset jumps.
>
> So I hope glibc will fix fseek to work as advertised.
>
> But for the simple file case all your answers are valid.
>

Not really. The app is not silly if it does an fseek() then a _write_.
Writing page sized and aligned chunks should not require previously
uptodate pagecache, so doing a pre-read like this is a complete waste.

Actually glibc tries to turn this pre-read off if the seek is to a page
aligned offset, presumably to handle this case. However a big write
would only have to RMW the first and last partial pages, so pre-reading
128KB in this case is wrong.

And I would also say a 4K read is wrong as well, because a big read will
be less efficient due to the extra syscall and small IO.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-26 14:11:27

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

On Mon, 2006-02-27 at 00:50 +1100, Nick Piggin wrote:
>
> Not really. The app is not silly if it does an fseek() then a _write_.
> Writing page sized and aligned chunks should not require previously
> uptodate pagecache, so doing a pre-read like this is a complete waste.
>
> Actually glibc tries to turn this pre-read off if the seek is to a page
> aligned offset, presumably to handle this case. However a big write
> would only have to RMW the first and last partial pages, so pre-reading
> 128KB in this case is wrong.
>
> And I would also say a 4K read is wrong as well, because a big read will
> be less efficient due to the extra syscall and small IO.

I can very much see the point of issuing a sys_readahead instead.....

2006-02-27 20:25:05

by Marr

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

On Saturday 25 February 2006 12:16am, Andrew Morton wrote:
> Marr <[email protected]> wrote:
> > ..
> >
> > When switching from kernel 2.4.31 to 2.6.13 (with everything else the
> > same), there is a drastic increase in the time required to perform
> > 'fseek()' on larger files (e.g. 4.3 MB, using ReiserFS [in case it
> > matters], in my test case).
> >
> > It seems that any seeks in a range larger than 128KB (regardless of the
> > file size or the position within the file) cause the performace to drop
> > precipitously.
>
> Interesting.
>
> What's happening is that glibc does a read from the file within each
> fseek(). Which might seem a bit silly because the app could seek somewhere
> else without doing any IO. But then the app would be silly too.
>
> Also, glibc is using the value returned in struct stat's blksize (a hint as
> to this file's preferred read chunk size) as, umm, a hint as to this file's
> preferred read size.
>
> Most filesystems return 4k in stat.blksize. But in 2.6, reiserfs bumped
> that to 128k to get good I/O patterns. Consequently this:
> > for (j=0; j < max_calls; j++) {
> > pos = (int)(((double)random() / (double)RAND_MAX) *
> > 4000000.0); if (fseek(inp_fh, pos, SEEK_SET)) {
> > printf("Error ('%s') seeking to position %d!\n",
> > strerror(errno), pos);
> > }
> > }
>
> runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
> on every fseek.

(...snip...)

> - You can alter the filesystem's behaviour by mounting with the
> `nolargeio=1' option. That sets stat.blksize back to 4k.

Greetings again,

*** Please CC: me on replies -- I'm not subscribed.

First off, many thanks to all who replied. A special "thank you" to Andrew
Morton for his valuable insight -- very much appreciated!

Apologies for my delay in replying. I wanted to do some proper testing in
order to have something intelligent to report.

Based on Andrew's excellent advice, I've re-tested. As before, I tested under
the stock (Slackware 10.2) 2.4.31 and 2.6.13 kernels. This time, I tested
ext2, ext3, and reiserfs (with and without the 'nolargeio=1' mount option)
filesystems.

Some notes on the testing:

(1) This is on a faster machine and a faster hard disk drive than the
testing from my initial email, so the absolute times are not meaningful in
comparison.

(2) I found (unsurprisingly) that ext2 and ext3 times were very similar, so
I'm reporting them as one here.

(3) I'm only reporting the times for the 2nd and subsequent runs of the
'fdisk_seek' test. On all cases (except for the 2.6.13 kernel with reiserfs
without the 'nolargeio=1' setting), the 1st run after mounting the filesystem
was predictably slower (uncached file content). The 2nd and subsequent runs
are all close enough to be considered identical.

(4) All tests were done on the same 4MB zero-filled file described in my
initial email.

Timing tests with 200,000 randomized 'fseek()' calls:

Kernel 2.4.31:

ext2/3: 2.8s
reiserfs (w/o 'nolargeio=1'): 2.8s

Kernel 2.6.13:

ext2/3: 3.0s
reiserfs (w/o 'nolargeio=1'): 2m12s (ouch!)
reiserfs (with 'nolargeio=1'): 3.0s

Basically, the "reiserfs without 'nolargeio=1' option on a 2.6.x kernel" is
the "problem child". Every run, from the 1st to the nth, takes the same
amount of time and is _incredibly_ slow for any application which is doing a
lot of file seeking outside of a 128KB window.

Clearly, however, there are 2 workarounds when using a 2.6.x kernel: (A) Use
ext2/ext3 or (B) use the 'nolargeio=1' mount option when using reiserfs.

Aside: For some reason, the 'nolargeio' option for the 'reiserfs' filesystem
is not mentioned on their page of such info:

http://www.namesys.com/mount-options.html

On Saturday 25 February 2006 12:16am, Andrew Morton wrote:
> No happy answers there, sorry. But a workaround.

Actually, 2 workarounds, both good ones. Thanks again, Andrew, for your
excellent advice!

*** Please CC: me on replies -- I'm not subscribed.

Bill Marr

2006-02-27 20:52:07

by Hans Reiser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Sounds like the real problem is that glibc is doing filesystem
optimizations without making them conditional on the filesystem type.
Does anyone know the email address of the glibc guy so we can ask him
not to do that?

My entry for the ugliest thought of the day: I wonder if the kernel can
test the glibc version and.....

Hans

Nick Piggin wrote:

>
> Actually glibc tries to turn this pre-read off if the seek is to a page
> aligned offset, presumably to handle this case. However a big write
> would only have to RMW the first and last partial pages, so pre-reading
> 128KB in this case is wrong.
>
> And I would also say a 4K read is wrong as well, because a big read will
> be less efficient due to the extra syscall and small IO.
>

2006-02-27 21:53:41

by Hans Reiser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Andrew Morton wrote:

>
>runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
>on every fseek.
>
>- There may be a libc stdio function which allows you to tune this
> behaviour.
>
>- libc should probably be a bit more defensive about this anyway -
> plainly the filesystem is being silly.
>
>
I really thank you for isolating the problem, but I don't see how you
can do other than blame glibc for this. The recommended IO size is only
relevant to uncached data, and glibc is using it regardless of whether
or not it is cached or uncached. Do I misunderstand something myself here?

>- You can alter the filesystem's behaviour by mounting with the
> `nolargeio=1' option. That sets stat.blksize back to 4k.
>
> This will alter the behaviour of every reiserfs filesystem in the
> machine. Even the already mounted ones.
>
> `mount -o remount,nolargeio=1' can probably also be used. But that
> won't affect inodes which are already in cache - a umount/mount cycle may
> be needed.
>
> If you like, you can just mount and unmount a different reiserfs
> filesystem to switch this reiserfs filesystem's behaviour. IOW: the
> reiserfs guys were lazy and went and made this a global variable :(
>
>- fseek is a pretty dumb function anyway - you're better off with
> stateless functions like pread() - half the number of syscalls, don't
> have to track where the file pointer is at. I don't know if there's a
> pread()-like function in stdio though?
>
>No happy answers there, sorry. But a workaround.
>
>
>
>

2006-02-28 00:15:01

by Bill Davidsen

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Hans Reiser wrote:
> Andrew Morton wrote:
>
>> runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
>> on every fseek.
>>
>> - There may be a libc stdio function which allows you to tune this
>> behaviour.
>>
>> - libc should probably be a bit more defensive about this anyway -
>> plainly the filesystem is being silly.
>>
>>
> I really thank you for isolating the problem, but I don't see how you
> can do other than blame glibc for this. The recommended IO size is only
> relevant to uncached data, and glibc is using it regardless of whether
> or not it is cached or uncached. Do I misunderstand something myself here?

I think the issue is not "blame" but what effect this behavior would
have on things like database loads, where seek-write would be common.
Good to get this info to users and admins.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2006-02-28 00:34:13

by Nick Piggin

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Hans Reiser wrote:

>Sounds like the real problem is that glibc is doing filesystem
>optimizations without making them conditional on the filesystem type.
>

I'm not sure that it should even be conditional on the filesystem type...
To me it seems silly to even bother doing it, although I guess there
is another level of buffering involved which might mean it makes more
sense.

>Does anyone know the email address of the glibc guy so we can ask him
>not to do that?
>
>

Ulrich Drepper I guess. But don't tell him I sent you ;)

>My entry for the ugliest thought of the day: I wonder if the kernel can
>test the glibc version and.....
>
>Hans
>
>Nick Piggin wrote:
>
>
>>Actually glibc tries to turn this pre-read off if the seek is to a page
>>aligned offset, presumably to handle this case. However a big write
>>would only have to RMW the first and last partial pages, so pre-reading
>>128KB in this case is wrong.
>>
>>And I would also say a 4K read is wrong as well, because a big read will
>>be less efficient due to the extra syscall and small IO.
>>
>>
--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-28 18:38:41

by Hans Reiser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Bill Davidsen wrote:

> Hans Reiser wrote:
>
>> Andrew Morton wrote:
>>
>>> runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k
>>> read
>>> on every fseek.
>>>
>>> - There may be a libc stdio function which allows you to tune this
>>> behaviour.
>>>
>>> - libc should probably be a bit more defensive about this anyway -
>>> plainly the filesystem is being silly.
>>>
>>>
>> I really thank you for isolating the problem, but I don't see how you
>> can do other than blame glibc for this. The recommended IO size is only
>> relevant to uncached data, and glibc is using it regardless of whether
>> or not it is cached or uncached. Do I misunderstand something
>> myself here?
>
>
> I think the issue is not "blame" but what effect this behavior would
> have on things like database loads, where seek-write would be common.
> Good to get this info to users and admins.
>
Well, ok, let me phrase it as "this should be fixed in glibc". Does
anyone know who the maintainer for it is?

2006-02-28 18:42:13

by Hans Reiser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Nick Piggin wrote:

> Hans Reiser wrote:
>
>> Sounds like the real problem is that glibc is doing filesystem
>> optimizations without making them conditional on the filesystem type.
>
>
> I'm not sure that it should even be conditional on the filesystem type...
> To me it seems silly to even bother doing it, although I guess there
> is another level of buffering involved which might mean it makes more
> sense.
>
I was not saying that filesystem optimizations should be done in glibc
rather than in the kernel, I was merely forgoing judgement on that
point. Actually, I rather doubt that they should be in glibc, but
maybe someday someone will come up with some legit example of where it
belongs in glibc. I cannot think of one myself though.

2006-02-28 18:51:17

by Hans Reiser

[permalink] [raw]

Subject: Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?

Ulrich, it seems that glibc is doing something that looks like some sort
of attempt at a filesystem optimization for fseek() which really ought
to be in the filesystems instead of glibc. Could you comment, and
assuming you agree, fix it for us?

It particularly affects ReiserFS V3 performance in a highly negative
way, because we set stat.blksize to 128k. stat.blksize is intended to
hint what the preferred IO size is for an FS.

Could you read this thread and contribute to it?

Hans

The most important part of the thread to read was:

Marr <[email protected]> wrote:

>>
>> ..
>>
>> When switching from kernel 2.4.31 to 2.6.13 (with everything else the same),
>> there is a drastic increase in the time required to perform 'fseek()' on
>> larger files (e.g. 4.3 MB, using ReiserFS [in case it matters], in my test
>> case).
>>
>> It seems that any seeks in a range larger than 128KB (regardless of the file
>> size or the position within the file) cause the performace to drop
>> precipitously.
>>
>
>

Interesting.

What's happening is that glibc does a read from the file within each
fseek(). Which might seem a bit silly because the app could seek somewhere
else without doing any IO. But then the app would be silly too.

Also, glibc is using the value returned in struct stat's blksize (a hint as
to this file's preferred read chunk size) as, umm, a hint as to this file's
preferred read size.

Most filesystems return 4k in stat.blksize. But in 2.6, reiserfs bumped
that to 128k to get good I/O patterns. Consequently this:

>> for (j=0; j < max_calls; j++) {
>> pos = (int)(((double)random() / (double)RAND_MAX) * 4000000.0);
>> if (fseek(inp_fh, pos, SEEK_SET)) {
>> printf("Error ('%s') seeking to position %d!\n",
>> strerror(errno), pos);
>> }
>> }
>
>

runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k read
on every fseek.

Nick Piggin wrote:

> Hans Reiser wrote:
>
>> Sounds like the real problem is that glibc is doing filesystem
>> optimizations without making them conditional on the filesystem type.
>
>
> I'm not sure that it should even be conditional on the filesystem type...
> To me it seems silly to even bother doing it, although I guess there
> is another level of buffering involved which might mean it makes more
> sense.
>
>
>> My entry for the ugliest thought of the day: I wonder if the kernel can
>> test the glibc version and.....
>>
>> Hans
>>
>> Nick Piggin wrote:
>>
>>
>>> Actually glibc tries to turn this pre-read off if the seek is to a page
>>> aligned offset, presumably to handle this case. However a big write
>>> would only have to RMW the first and last partial pages, so pre-reading
>>> 128KB in this case is wrong.
>>>
>>> And I would also say a 4K read is wrong as well, because a big read
>>> will
>>> be less efficient due to the extra syscall and small IO.
>>>

2006-03-05 23:03:09

by L A Walsh

[permalink] [raw]

Subject: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

Does this happen with a seek call as well, or is this limited
to fseek?

if you look at "hdparm's" idea of read-ahead, what does it say
for the device?. I.e.:

hdparm /dev/hda:

There is a line entitled "readahead". What does it say?

I noticed that this seems to default to "256" sectors, or 128K
in 2.6.

This may be unrelated, but what does the kernel do with
this number? I seem to remember this being set to ~8-16 (4-8K)
in 2.4. I thought it was the number of sectors to read ahead, by
default, when a read was done, but I haven't noticed a performance
degradation like I would expect for such a large read-ahead value.

On the other hand: you do seem to be experiencing something consistent
with that setting. I'm not sure under what circumstances the kernel
uses the "readahead" value as a number of sectors to read ahead...

Have the disk read routines changed with respect to this value?

-linda
< bottom or top posting is a personal preference somewhat based
on the email tool one uses. In a GUI, bottom posting often means
you can't see what the person wrote without skipping to the end
of message. When dealing with Chronological information, it
often makes more sense to put the most recent information _first>

Bill Davidsen wrote:
> Hans Reiser wrote:
>> Andrew Morton wrote:
>>> runs like a dog on 2.6's reiserfs. libc is doing a (probably) 128k
>>> read
>>> on every fseek.
>>>
>>> - There may be a libc stdio function which allows you to tune this
>>> behaviour.
>>>
>>> - libc should probably be a bit more defensive about this anyway -
>>> plainly the filesystem is being silly.
>> I really thank you for isolating the problem, but I don't see how you
>> can do other than blame glibc for this. The recommended IO size is only
>> relevant to uncached data, and glibc is using it regardless of whether
>> or not it is cached or uncached. Do I misunderstand something
>> myself here?
> I think the issue is not "blame" but what effect this behavior would
> have on things like database loads, where seek-write would be common.
> Good to get this info to users and admins.
>

2006-03-07 19:56:11

by Marr

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

On Sunday 05 March 2006 6:02pm, Linda Walsh wrote:
> Does this happen with a seek call as well, or is this limited
> to fseek?
>
> if you look at "hdparm's" idea of read-ahead, what does it say
> for the device?. I.e.:
>
> hdparm /dev/hda:
>
> There is a line entitled "readahead". What does it say?

Linda,

I don't know (based on your email addressing) if you were directing this
question at me, but since I'm the guy who originally reported this issue,
here are my 'hdparm' results on my (standard Slackware 10.2) ReiserFS
filesystem:

2.6.13 (with 'nolargeio=1' for reiserfs mount):
readahead = 256 (on)

2.6.13 (without 'nolargeio=1' for reiserfs mount):
readahead = 256 (on)

2.4.31 ('nolargeio' option irrelevant/unavailable for 2.4.x):
readahead = 8 (on)

*** Please CC: me on replies -- I'm not subscribed.

Regards,
Bill Marr

> I noticed that this seems to default to "256" sectors, or 128K
> in 2.6.
>
> This may be unrelated, but what does the kernel do with
> this number? I seem to remember this being set to ~8-16 (4-8K)
> in 2.4. I thought it was the number of sectors to read ahead, by
> default, when a read was done, but I haven't noticed a performance
> degradation like I would expect for such a large read-ahead value.
>
> On the other hand: you do seem to be experiencing something consistent
> with that setting. I'm not sure under what circumstances the kernel
> uses the "readahead" value as a number of sectors to read ahead...
>
> Have the disk read routines changed with respect to this value?
>
> -linda

2006-03-07 21:16:11

by L A Walsh

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

Marr wrote:
> On Sunday 05 March 2006 6:02pm, Linda Walsh wrote:
>> Does this happen with a seek call as well, or is this limited
>> to fseek?
>>
>> if you look at "hdparm's" idea of read-ahead, what does it say
>> for the device?. I.e.:
>>
>> hdparm /dev/hda:
>>
>> There is a line entitled "readahead". What does it say?
>
> Linda,
>
> I don't know (based on your email addressing) if you were directing this
> question at me, but since I'm the guy who originally reported this issue,
> here are my 'hdparm' results on my (standard Slackware 10.2) ReiserFS
> filesystem:
>
> 2.6.13 (with 'nolargeio=1' for reiserfs mount):
> readahead = 256 (on)
>
> 2.6.13 (without 'nolargeio=1' for reiserfs mount):
> readahead = 256 (on)
>
> 2.4.31 ('nolargeio' option irrelevant/unavailable for 2.4.x):
> readahead = 8 (on)
>
> *** Please CC: me on replies -- I'm not subscribed.
>
> Regards,
> Bill Marr
--------
Could you retry your test with read-ahead set to a smaller
value? Say the same as in 2.4 (8) or 16 and see if that changes
anything?

hdparm -a8 /dev/hdx
or
hdparm -a16 /dev/hdx

2006-03-12 21:55:30

by Marr

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

On Tuesday 07 March 2006 4:15pm, Linda Walsh wrote:
> Marr wrote:
> > On Sunday 05 March 2006 6:02pm, Linda Walsh wrote:
> >> Does this happen with a seek call as well, or is this limited
> >> to fseek?
> >>
> >> if you look at "hdparm's" idea of read-ahead, what does it say
> >> for the device?. I.e.:
> >>
> >> hdparm /dev/hda:
> >>
> >> There is a line entitled "readahead". What does it say?
> >
> > Linda,
> >
> > I don't know (based on your email addressing) if you were directing this
> > question at me, but since I'm the guy who originally reported this issue,
> > here are my 'hdparm' results on my (standard Slackware 10.2) ReiserFS
> > filesystem:
> >
> > 2.6.13 (with 'nolargeio=1' for reiserfs mount):
> > readahead = 256 (on)
> >
> > 2.6.13 (without 'nolargeio=1' for reiserfs mount):
> > readahead = 256 (on)
> >
> > 2.4.31 ('nolargeio' option irrelevant/unavailable for 2.4.x):
> > readahead = 8 (on)
> >
> > *** Please CC: me on replies -- I'm not subscribed.
> >
> > Regards,
> > Bill Marr
>
> --------
> Could you retry your test with read-ahead set to a smaller
> value? Say the same as in 2.4 (8) or 16 and see if that changes
> anything?
>
> hdparm -a8 /dev/hdx
> or
> hdparm -a16 /dev/hdx

Linda (et al),

Sorry for the delayed reply. I finally got a chance to run another test (but
on a different machine than the last time, so don't try to compare old timing
numbers with these numbers).

I went ahead and tried all permutations, just to be sure. As before, these
reported times are all for 200,000 random 'fseek()' calls on the same
zero-filled 4MB file on a standard Slackware 10.2 ReiserFS partition and
kernels.

(Values shown for 'readahead' are set by 'hdparm -a## /dev/hda' command.)

-----------------------------------
Timing Results:

On 2.6.13, *without* 'nolargeio=1': 4m35s (ouch!) for _all_ variants (256, 16,
8) of 'readahead'

On 2.6.13, _with_ 'nolargeio=1': 0m6s for _all_ variants (256, 16, 8) of
'readahead'

On 2.4.31: 0m6s for _all_ variants (128 [256 is illegal -- 'BLKRASET failed:
Invalid argument'], 16, 8) of 'readahead'

-----------------------------------

I half-expected to see improvement for the '2.6.13 without nolargeio=1' case
when lowering the read-ahead from 256 sectors to 16 or 8 sectors, but there
clearly was no improvement whatsoever.

I tried turning 'readahead' off entirely ('hdparm -A0 /dev/hda') and, although
it correctly reported "setting drive read-lookahead to 0 (off)", an immediate
follow-on query ('hdparm /dev/hda') showed that it was still ON ("readahead =
256 (on)")! I went ahead and ran the test again anyway and (unsurprisingly)
got the same excessive times (4m35s) for 200K seeks.

Confused, but still (for now) happily using the 'nolargeio=1' workaround with
all my 2.6.13 kernels with ReiserFS.... :^/

*** Please CC: me on replies -- I'm not subscribed.

Regards,
Bill Marr

2006-03-12 22:15:44

by Mark Lord

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

Marr wrote:
>
> I tried turning 'readahead' off entirely ('hdparm -A0 /dev/hda') and, although

No, that should be "hdparm -a0 /dev/hda" (lowercase "-a").
And the same "-a" for all of your other test variants.

If you did it all with "-A", then the results are invalid,
and need to be redone.

The hdparm manpage explains this, but in a nutshell, "-A" is the
low-level drive firmware "look-ahead" mechanism, whereas "-a" is
the Linux kernel "read-ahead" scheme.

In general, most uppercase hdparm flags are drive *firmware* settings.

Cheers

2006-03-13 04:38:25

by Marr

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

On Sunday 12 March 2006 5:15pm, Mark Lord wrote:
> Marr wrote:
> > I tried turning 'readahead' off entirely ('hdparm -A0 /dev/hda') and,
> > although
>
> No, that should be "hdparm -a0 /dev/hda" (lowercase "-a").

Aha, you're right! Thanks for the clarification.

> And the same "-a" for all of your other test variants.
>
> If you did it all with "-A", then the results are invalid,
> and need to be redone.

Actually, that's impossible to do ('hdparm' won't take such settings with
'-A'). And, as my original email stated:

(Values shown for 'readahead' are set by 'hdparm -a## /dev/hda' command.)

In other words, the important tests were done correctly. Sorry I didn't make
it clearer, but that last test with '-A0' was a complete afterthought (based
on what I saw on a quick look at the 'man hdparm' page) and in no way negates
the results from the first part of the tests.

> The hdparm manpage explains this, but in a nutshell, "-A" is the
> low-level drive firmware "look-ahead" mechanism, whereas "-a" is
> the Linux kernel "read-ahead" scheme.

You are, of course, correct. Unfortunately, my 'man hdparm' page ("Version 6.1
April 2005") doesn't make this as clear as it could be. The distinction is
subtle. To quote the '-a'/'-A' part:

-a Get/set sector count for filesystem read-ahead. This is used to
improve performance in sequential reads of large files, by
prefetching additional blocks in anticipation of them being
needed by the running task. In the current kernel version
(2.0.10) this has a default setting of 8 sectors (4KB). This
value seems good for most purposes, but in a system where most
file accesses are random seeks, a smaller setting might provide
better performance. Also, many IDE drives also have a separate
built-in read-ahead function, which alleviates the need for a
filesystem read-ahead in many situations.

-A Disable/enable the IDE drive's read-lookahead feature (usually
ON by default). Usage: -A0 (disable) or -A1 (enable).

A bad interpretation on my part. Thanks again for setting me straight.

Anyway, not that it really matters, but I re-did the testing with '-a0' and it
didn't help one iota. The 2.6.13 kernel on ReiserFS (without using
'nolargeio=1' as a mount option) still takes about 4m35s to fseek 200,000
times on that 4MB file, even with 'hdparm -a0 /dev/hda' in effect.

*** Please CC: me on replies -- I'm not subscribed.

Regards,
Bill Marr

2006-03-13 14:42:23

by Mark Lord

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

Marr wrote:
>
> Anyway, not that it really matters, but I re-did the testing with '-a0' and it
> didn't help one iota. The 2.6.13 kernel on ReiserFS (without using
> 'nolargeio=1' as a mount option) still takes about 4m35s to fseek 200,000
> times on that 4MB file, even with 'hdparm -a0 /dev/hda' in effect.

Does it make a difference when done on the filesystem *partition*
rather than the base drive? At one time, this mattered, and it may
still work that way today.

Eg. hdparm -a0 /dev/hda3 rather than hdparm -a0 /dev/hda

??

2006-03-13 18:15:37

by Hans Reiser

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

Ulrich, what are your plans regarding fixing this? Are you just going
to ignore it or?

Hans

Mark Lord wrote:

> Marr wrote:
>
>>
>> Anyway, not that it really matters, but I re-did the testing with
>> '-a0' and it didn't help one iota. The 2.6.13 kernel on ReiserFS
>> (without using 'nolargeio=1' as a mount option) still takes about
>> 4m35s to fseek 200,000 times on that 4MB file, even with 'hdparm -a0
>> /dev/hda' in effect.
>
>
> Does it make a difference when done on the filesystem *partition*
> rather than the base drive? At one time, this mattered, and it may
> still work that way today.
>
> Eg. hdparm -a0 /dev/hda3 rather than hdparm -a0 /dev/hda
>
> ??
>
>

2006-03-13 20:01:39

by Marr

[permalink] [raw]

Subject: Re: Readahead value 128K? (was Re: Drastic Slowdown of 'fseek()' Calls From 2.4 to 2.6 -- VMM Change?)

On Monday 13 March 2006 9:41am, Mark Lord wrote:
> Marr wrote:
> > Anyway, not that it really matters, but I re-did the testing with '-a0'
> > and it didn't help one iota. The 2.6.13 kernel on ReiserFS (without using
> > 'nolargeio=1' as a mount option) still takes about 4m35s to fseek 200,000
> > times on that 4MB file, even with 'hdparm -a0 /dev/hda' in effect.
>
> Does it make a difference when done on the filesystem *partition*
> rather than the base drive? At one time, this mattered, and it may
> still work that way today.
>
> Eg. hdparm -a0 /dev/hda3 rather than hdparm -a0 /dev/hda
>
> ??

Unfortunately, it makes no difference. That is, after successfully setting
'-a0' on the partition in question (instead of the whole HDD device itself),
the 200,000 random 'fseek()' calls still take about 4m35s on ReiserFS
(without using 'nolargeio=1' as a mount option) under kernel 2.6.13.

P.S. I've CC:ed you and the others on my reply to Al Boldi's request for the
'hdparm -I /dev/hda' information, in case it helps at all.

Thanks for your inputs, Mark -- much appreciated!

*** Please CC: me on replies -- I'm not subscribed.

Regards,
Bill Marr