2009-09-18 19:32:43

by Jim Meyering

[permalink] [raw]
Subject: efficient access to "rotational"; new fcntl?

About a year ago, I fixed GNU rm to avoid quadratic readdir/stat seek
penalties on ext3 and ext4 that became overwhelming on directories with
many entries (1M entries started to look like infloop or DoS).

From coreutils-7.0 NEWS:

chgrp, chmod, chown, chcon, du, rm: now all display linear performance,
even when operating on million-entry directories on ext3 and ext4 file
systems. Before, they would exhibit O(N^2) performance, due to linear
per-entry seek time cost when operating on entries in readdir order.
Rm was improved directly, while the others inherit the improvement
from the newer version of fts in gnulib.

To do that efficiently, I changed the core readdir loop (mainly via
gnulib's fts.c) to preprocess entries, when needed, by sorting them
on inode, and *then* processing them. This optimization is enabled
only when the affected file system is of a type likely to benefit,
and when the number of directory entries is large enough to matter.

However, with e.g., an ext4 partition on non-rotational hardware like
an SSD, that preprocessing is unnecessary and in fact wasted effort.
I'd like to avoid the waste by querying the equivalent of
/sys/.../rotational, via a syscall like fcntl or statvfs,
given a file descriptor.

Is there an efficient way to get that single bit?

IMHO, locating and opening-then-reading a file like
/sys/devices/pci0000:00/0000:00:1f.2/host4/target4:0:0/4:0:0:0/block/sdc/queue/rotational
is not appropriate in the context of a low-level library function like fts.

Jim


2009-09-18 22:17:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On Fri, Sep 18, 2009 at 09:31:50PM +0200, Jim Meyering wrote:
> chgrp, chmod, chown, chcon, du, rm: now all display linear performance,
> even when operating on million-entry directories on ext3 and ext4 file
> systems. Before, they would exhibit O(N^2) performance, due to linear
> per-entry seek time cost when operating on entries in readdir order.
> Rm was improved directly, while the others inherit the improvement
> from the newer version of fts in gnulib.

Excellent! I didn't know that (since my userspace is still Ubuntu
9.04, which is still using coreutils 6.10).

> However, with e.g., an ext4 partition on non-rotational hardware like
> an SSD, that preprocessing is unnecessary and in fact wasted effort.
> I'd like to avoid the waste by querying the equivalent of
> /sys/.../rotational, via a syscall like fcntl or statvfs,
> given a file descriptor.

Have you benchmarked it both ways? The preprocessing will cost some
extra CPU time, sure, but for a sufficiently large directory, or if
the user is deleting a very large directory hierarchy, such that "rm
-rf" spans multiple journal transactions, deleting the files in inode
order will still avoid some filesystem metadata blocks getting written
multiple times (which for SSD's, especially the crappier ones with
nasty write amplification factors) could show a performance impact.

> Is there an efficient way to get that single bit?

Not today; if it's really useful, we could add it, of course. But how
much overhead are you trying to avoid by avoiding the pre-processing
before unlinking the files?

Regards,

- Ted

2009-09-19 08:23:56

by Jim Meyering

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

Theodore Tso wrote:
> On Fri, Sep 18, 2009 at 09:31:50PM +0200, Jim Meyering wrote:
>> chgrp, chmod, chown, chcon, du, rm: now all display linear performance,
>> even when operating on million-entry directories on ext3 and ext4 file
>> systems. Before, they would exhibit O(N^2) performance, due to linear
>> per-entry seek time cost when operating on entries in readdir order.
>> Rm was improved directly, while the others inherit the improvement
>> from the newer version of fts in gnulib.
>
> Excellent! I didn't know that (since my userspace is still Ubuntu
> 9.04, which is still using coreutils 6.10).

Heh. Time to upgrade.
With the upcoming coreutils-7.7, I've removed a quadratic
component in rm -r (without -f), and rewrote it to give
rm -rf an additional 4-5x speed-up in some nasty cases.

>> However, with e.g., an ext4 partition on non-rotational hardware like
>> an SSD, that preprocessing is unnecessary and in fact wasted effort.
>> I'd like to avoid the waste by querying the equivalent of
>> /sys/.../rotational, via a syscall like fcntl or statvfs,
>> given a file descriptor.
>
> Have you benchmarked it both ways? The preprocessing will cost some
> extra CPU time, sure, but for a sufficiently large directory, or if
> the user is deleting a very large directory hierarchy, such that "rm
> -rf" spans multiple journal transactions, deleting the files in inode
> order will still avoid some filesystem metadata blocks getting written
> multiple times (which for SSD's, especially the crappier ones with
> nasty write amplification factors) could show a performance impact.

Yeah, I mentioned I should do exactly that on IRC yesterday.
I've just run some tests, and see that at least with one SSD (OCZ Summit
120GB), the 0.5s cost of sorting pays off handsomely with a 12-x speed-up,
saving 5.5 minutes, when removing a 1-million-empty-file directory.

----------------------------------------
Timing rm -rf million-file-dir vs. ext4 on a 120GB OCZ Summit on Fedora 11
This is using the very latest rm/remove.c from coreutils.git.
The one rewritten to use fts.

Creation took about 63 seconds:
mkdir d;(cd d && seq 1000000|xargs touch)

Removal with inode-sort preprocessing (the 0.543s is sort duration):
$ env time ./rm -rf d
0.543050295
1.62user 20.13system 0:28.25elapsed 77%CPU (0avgtext+0avgdata 0maxresident)k
9968inputs+8outputs (0major+74445minor)pagefaults 0swaps

2nd trial: (create million-file dir)
$ mkdir d;(cd d && seq 1000000|env time xargs touch)
0.63user 62.14system 1:06.49elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
40inputs+16outputs (1major+19701minor)pagefaults 0swaps

Remove it:
$ env time ./rm -rf d
0.570515343
1.72user 18.49system 0:26.45elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+8outputs (0major+74445minor)pagefaults 0swaps
---------------------------------------------

Repeating, but with fts' sort-on-inode disabled:
ouch. It would have taken about 6 minutes.
I killed it after ~3, when it had removed half of the entries.

Conclusion:
Even on an SSD, this sort-on-inode preprocessing gives more
than a 10-x speed-up when removing a 1-million-empty-file directory.
Hence, fts does not need access to the "rotational" bit, after all.

2009-09-19 08:31:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On Sat, 19 Sep 2009 10:01:51 +0200
Jim Meyering <[email protected]> wrote:
> Yeah, I mentioned I should do exactly that on IRC yesterday.
> I've just run some tests, and see that at least with one SSD (OCZ
> Summit 120GB), the 0.5s cost of sorting pays off handsomely with a
> 12-x speed-up, saving 5.5 minutes, when removing a
> 1-million-empty-file directory.
>

likely because you actually reduce the amount of IO; inodes share
disk blocks; repeated unlinks in random order likely write the same
block multiple times....


btw have you given thought about using threads as part of rm -r[f] ?
(would make the unlinks of unrelated directories/files asynchronous)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-19 09:07:31

by Jim Meyering

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

Arjan van de Ven wrote:
> On Sat, 19 Sep 2009 10:01:51 +0200
> Jim Meyering <[email protected]> wrote:
>> Yeah, I mentioned I should do exactly that on IRC yesterday.
>> I've just run some tests, and see that at least with one SSD (OCZ
>> Summit 120GB), the 0.5s cost of sorting pays off handsomely with a
>> 12-x speed-up, saving 5.5 minutes, when removing a
>> 1-million-empty-file directory.
>>
>
> likely because you actually reduce the amount of IO; inodes share
> disk blocks; repeated unlinks in random order likely write the same
> block multiple times....

That makes sense.
Maybe cache effects, too?

> btw have you given thought about using threads as part of rm -r[f] ?
> (would make the unlinks of unrelated directories/files asynchronous)

While it is certainly a nicely parallelizable process,
rm usually runs so quickly that I doubt it'd be worthwhile.
If you know in advance that parallelizing a particular recursive
removal would give a significant benefit, it's probably best to do it
via e.g., xargs --max-procs=N.

However, sort *would* benefit, and some UCLA students implemented that
for a term project. Unfortunately, the project is stalled because the
implementation was not efficient enough, and no one has found the
time to improve it since.

2009-09-19 09:19:08

by Arjan van de Ven

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On Sat, 19 Sep 2009 11:07:21 +0200
Jim Meyering <[email protected]> wrote:

> > btw have you given thought about using threads as part of rm -r[f] ?
> > (would make the unlinks of unrelated directories/files asynchronous)
>
> While it is certainly a nicely parallelizable process,
> rm usually runs so quickly that I doubt it'd be worthwhile.
> If you know in advance that parallelizing a particular recursive
> removal would give a significant benefit, it's probably best to do it
> via e.g., xargs --max-procs=N.

deleting large files has several seeks kind of cost (small files is
cheap). At least on ext3. I guess with btrfs being the future it's
indeed not worth doing in userspace.


> However, sort *would* benefit, and some UCLA students implemented that
> for a term project. Unfortunately, the project is stalled because the
> implementation was not efficient enough, and no one has found the
> time to improve it since.

parallel sort... call me skeptical. My gut feeling is that you'll get
killed by communication overhead.
(sort seems to be more communication than raw cpu use)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-19 11:11:50

by Avi Kivity

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On 09/19/2009 12:19 PM, Arjan van de Ven wrote:
>> However, sort *would* benefit, and some UCLA students implemented that
>> for a term project. Unfortunately, the project is stalled because the
>> implementation was not efficient enough, and no one has found the
>> time to improve it since.
>>
> parallel sort... call me skeptical. My gut feeling is that you'll get
> killed by communication overhead.
> (sort seems to be more communication than raw cpu use)
>
>

Why? a sort that fits in memory is purely cpu and memory access.

Instead of O(N log N) you'd get K * O(N/K log N/K) followed by an O(N)
merge. For large N and small K, you get a speedup of roughly K (since
the O(N) merge is dominated by the preceding sort.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-09-19 11:25:30

by Willy Tarreau

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On Fri, Sep 18, 2009 at 09:31:50PM +0200, Jim Meyering wrote:
> About a year ago, I fixed GNU rm to avoid quadratic readdir/stat seek
> penalties on ext3 and ext4 that became overwhelming on directories with
> many entries (1M entries started to look like infloop or DoS).
>
> From coreutils-7.0 NEWS:
>
> chgrp, chmod, chown, chcon, du, rm: now all display linear performance,
> even when operating on million-entry directories on ext3 and ext4 file
> systems. Before, they would exhibit O(N^2) performance, due to linear
> per-entry seek time cost when operating on entries in readdir order.
> Rm was improved directly, while the others inherit the improvement
> from the newer version of fts in gnulib.
>
> To do that efficiently, I changed the core readdir loop (mainly via
> gnulib's fts.c) to preprocess entries, when needed, by sorting them
> on inode, and *then* processing them. This optimization is enabled
> only when the affected file system is of a type likely to benefit,
> and when the number of directory entries is large enough to matter.

This is excellent. I've been used to play with variations of
'find -printf "%i %p\n"|sort -n|cut -f2|xargs <cmd>' to sort file
operations by inode to speed them up, especially when rm'ing a
kernel tree or when doing a prefetch to build from hot cache.

I think you might notice excellent results on CDs where seeks cost a
lot. That's where I first use the find|sort method above. I'm really
happy to see that such mechanisms are now implemented inside my
everyday tools, and I can't wait to upgrade!

Regards,
Willy

2009-09-19 11:30:12

by Arjan van de Ven

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On Sat, 19 Sep 2009 14:11:38 +0300
Avi Kivity <[email protected]> wrote:

> On 09/19/2009 12:19 PM, Arjan van de Ven wrote:
> >> However, sort *would* benefit, and some UCLA students implemented
> >> that for a term project. Unfortunately, the project is stalled
> >> because the implementation was not efficient enough, and no one
> >> has found the time to improve it since.
> >>
> > parallel sort... call me skeptical. My gut feeling is that you'll
> > get killed by communication overhead.
> > (sort seems to be more communication than raw cpu use)
> >
> >
>
> Why? a sort that fits in memory is purely cpu and memory access.

memory access == communication due to cache line bounces....

>
> Instead of O(N log N) you'd get K * O(N/K log N/K) followed by an
> O(N) merge. For large N and small K, you get a speedup of roughly K
> (since the O(N) merge is dominated by the preceding sort.

doing big-O arithmatic and then add constant divisions/multipliers is
rather.. dangerous ;)

You actually get K * C1 * O(N/K log N/K) + C2 * O(N)
where C1/C2 is the actual cost of the extra intra-cpu communication.
(and for most sorting, I suspect the communication costs dominate over
CPU cost)

I'd be happy to be proven wrong...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-19 11:40:32

by Avi Kivity

[permalink] [raw]
Subject: Re: efficient access to "rotational"; new fcntl?

On 09/19/2009 02:30 PM, Arjan van de Ven wrote:
> On Sat, 19 Sep 2009 14:11:38 +0300
> Avi Kivity<[email protected]> wrote:
>
>
>> On 09/19/2009 12:19 PM, Arjan van de Ven wrote:
>>
>>>> However, sort *would* benefit, and some UCLA students implemented
>>>> that for a term project. Unfortunately, the project is stalled
>>>> because the implementation was not efficient enough, and no one
>>>> has found the time to improve it since.
>>>>
>>>>
>>> parallel sort... call me skeptical. My gut feeling is that you'll
>>> get killed by communication overhead.
>>> (sort seems to be more communication than raw cpu use)
>>>
>>>
>>>
>> Why? a sort that fits in memory is purely cpu and memory access.
>>
> memory access == communication due to cache line bounces....
>


For a large sort cache line bounces will be negligible. First, memory
size greatly exceeds cache size. Second, every cach eline will be
accessed multiple times.

If you're sorting 2MB, then yes, threads aren't needed.

>> Instead of O(N log N) you'd get K * O(N/K log N/K) followed by an
>> O(N) merge. For large N and small K, you get a speedup of roughly K
>> (since the O(N) merge is dominated by the preceding sort.
>>
> doing big-O arithmatic and then add constant divisions/multipliers is
> rather.. dangerous ;)
>

I'm wearing my seat belt.

> You actually get K * C1 * O(N/K log N/K) + C2 * O(N)
> where C1/C2 is the actual cost of the extra intra-cpu communication.
> (and for most sorting, I suspect the communication costs dominate over
> CPU cost)
>
> I'd be happy to be proven wrong...
>

Again, if the dataset is large, only a small fraction of the cache lines
will be on another cpu. The majority will be in main memory.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.