2006-11-22 21:44:36

by Wendy Cheng

[permalink] [raw]
Subject: [PATCH] prune_icache_sb

There seems to have a need to prune inode cache entries for specific
mount points (per vfs superblock) due to performance issues found after
some io intensive commands ("rsyn" for example). The problem is
particularly serious for one of our kernel modules where it caches its
(cluster) locks based on vfs inode implementation. These locks are
created by inode creation call and get purged when s_op->clear_inode()
is invoked. With larger servers that equipped with plenty of memory, the
page dirty ratio may not pass the threshold to trigger VM reclaim logic
but the accumulated inode counts (and its associated cluster locks)
could causes unacceptable performance degradation for latency sensitive
applications.

After adding the uploaded inode trimming patch, together with
shrink_dcache_sb(), we are able to keep the latency for one real world
application within a satisfactory bound (consistently stayed within 5
seconds, compared to the original fluctuation between 5 to 16 seconds).
The calls are placed in one of our kernel daemons that wakes up in a
tunable interval to do the trimming work as shown in the following code
segment. Would appreciate if this patch can get accepted into mainline
kernel.

i_percent = sdp->sd_tune.gt_inoded_purge;
if (i_percent) {
if (i_percent > 100) i_percent = 100;
a_count = atomic_read(&sdp->sd_inode_count);
i_count = a_count * i_percent / 100;
(void) shrink_dcache_sb(sdp->sd_vfs);
(void) prune_icache_sb(i_count, sdp->sd_vfs);
}

-- Wendy





Attachments:
inode_prune_sb.patch (2.09 kB)

2006-11-22 23:36:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Wed, 22 Nov 2006 16:35:07 -0500
Wendy Cheng <[email protected]> wrote:

> There seems to have a need to prune inode cache entries for specific
> mount points (per vfs superblock) due to performance issues found after
> some io intensive commands ("rsyn" for example). The problem is
> particularly serious for one of our kernel modules where it caches its
> (cluster) locks based on vfs inode implementation. These locks are
> created by inode creation call and get purged when s_op->clear_inode()
> is invoked. With larger servers that equipped with plenty of memory, the
> page dirty ratio may not pass the threshold to trigger VM reclaim logic
> but the accumulated inode counts (and its associated cluster locks)
> could causes unacceptable performance degradation for latency sensitive
> applications.
>
> After adding the uploaded inode trimming patch, together with
> shrink_dcache_sb(), we are able to keep the latency for one real world
> application within a satisfactory bound (consistently stayed within 5
> seconds, compared to the original fluctuation between 5 to 16 seconds).
> The calls are placed in one of our kernel daemons that wakes up in a
> tunable interval to do the trimming work as shown in the following code
> segment. Would appreciate if this patch can get accepted into mainline
> kernel.
>
> i_percent = sdp->sd_tune.gt_inoded_purge;
> if (i_percent) {
> if (i_percent > 100) i_percent = 100;
> a_count = atomic_read(&sdp->sd_inode_count);
> i_count = a_count * i_percent / 100;
> (void) shrink_dcache_sb(sdp->sd_vfs);
> (void) prune_icache_sb(i_count, sdp->sd_vfs);
> }
>
>...
>
> --- linux-2.6.18/include/linux/fs.h 2006-09-19 23:42:06.000000000 -0400
> +++ ups-kernel/include/linux/fs.h 2006-11-22 13:55:55.000000000 -0500
> @@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
> static inline void insert_inode_hash(struct inode *inode) {
> __insert_inode_hash(inode, inode->i_ino);
> }
> -
> +extern void prune_icache_sb(int nr_to_scan, struct super_block *sb);
> +
> extern struct file * get_empty_filp(void);
> extern void file_move(struct file *f, struct list_head *list);
> extern void file_kill(struct file *f);
> --- linux-2.6.18/fs/inode.c 2006-09-19 23:42:06.000000000 -0400
> +++ ups-kernel/fs/inode.c 2006-11-22 14:12:28.000000000 -0500
> @@ -411,7 +411,7 @@ static int can_unuse(struct inode *inode
> * If the inode has metadata buffers attached to mapping->private_list then
> * try to remove them.
> */
> -static void prune_icache(int nr_to_scan)
> +static void __prune_icache(int nr_to_scan, struct super_block *sb)
> {
> LIST_HEAD(freeable);
> int nr_pruned = 0;
> @@ -428,7 +428,8 @@ static void prune_icache(int nr_to_scan)
>
> inode = list_entry(inode_unused.prev, struct inode, i_list);
>
> - if (inode->i_state || atomic_read(&inode->i_count)) {
> + if (inode->i_state || atomic_read(&inode->i_count)
> + || (sb && (inode->i_sb != sb))) {
> list_move(&inode->i_list, &inode_unused);
> continue;

This search is potentially inefficient. It would be better walk
sb->s_inodes.

2006-11-28 00:03:20

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:
> This search is potentially inefficient. It would be better walk
> sb->s_inodes.
>
>
Not sure about walking thru sb->s_inodes for several reasons....

1. First, the changes made are mostly for file server setup with large
fs size - the entry count in sb->s_inodes may not be shorter then
inode_unused list.
2. Different from calls such as drop_pagecache_sb() (that doesn't do
list entry removal), we're walking thru the list to dispose the entries.
This implies we are walking thru one list (sb->s_inodes) to remove the
other list's entries (inode_unused). This feels awkward.
3. The new code will be very similar to current prune_icache() with few
differences - e.g., we really don't want to list_move() within the
sb->s_inodes list itself (as done in prune_icache() that moves the
examined entry to the tail of the inode_unused list). We have to either
duplicate the code or clutter the current prune_icache() routine.

Pruning based on sb->s_inodes *does* have its advantage but a simple and
plain patch as shown in previous post (that has been well-tested out in
two large scale production systems) could be equally effective. Make
sense ?

-- Wendy

2006-11-28 00:53:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Mon, 27 Nov 2006 18:52:58 -0500
Wendy Cheng <[email protected]> wrote:

> Andrew Morton wrote:
> > This search is potentially inefficient. It would be better walk
> > sb->s_inodes.
> >
> >
> Not sure about walking thru sb->s_inodes for several reasons....
>
> 1. First, the changes made are mostly for file server setup with large
> fs size - the entry count in sb->s_inodes may not be shorter then
> inode_unused list.

umm, that's the best-case. We also care about worst-case. Think:
1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
from the being-unmounted filesytem. The code as-proposed will do 100x more
work that it needs to do. All under a global spinlock.

> 2. Different from calls such as drop_pagecache_sb() (that doesn't do
> list entry removal), we're walking thru the list to dispose the entries.
> This implies we are walking thru one list (sb->s_inodes) to remove the
> other list's entries (inode_unused). This feels awkward.
> 3. The new code will be very similar to current prune_icache() with few
> differences - e.g., we really don't want to list_move() within the
> sb->s_inodes list itself (as done in prune_icache() that moves the
> examined entry to the tail of the inode_unused list). We have to either
> duplicate the code or clutter the current prune_icache() routine.
>
> Pruning based on sb->s_inodes *does* have its advantage but a simple and
> plain patch as shown in previous post (that has been well-tested out in
> two large scale production systems) could be equally effective. Make
> sense ?
>

I also worry about the whole thing:

> There seems to have a need to prune inode cache entries for specific mount
> points (per vfs superblock) due to performance issues found after some io
> intensive commands ("rsyn" for example). The problem is particularly
> serious for one of our kernel modules where it caches its (cluster) locks
> based on vfs inode implementation. These locks are created by inode
> creation call and get purged when s_op->clear_inode() is invoked. With
> larger servers that equipped with plenty of memory, the page dirty ratio
> may not pass the threshold to trigger VM reclaim logic but the accumulated
> inode counts (and its associated cluster locks) could causes unacceptable
> performance degradation for latency sensitive applications.

What's this about "the page dirty ratio may not pass the threshold to
trigger VM reclaim logic"? Page reclaim isn't triggered by a dirty page
ratio.

Page reclaim is triggered by a shortage of free pages. And page reclaim is
supposed to reclaim unused inodes in an orderly and balanced fashion. It
appears that it's not doing so in your case and we'd need to see more
details (please) so we can understand why it is not working.

You're proposing that we not do any of that and that the filesytem be able
to call into a VM memory reclaim function for not-clearly-understood
reasons. This is a workaround.

Please help us to understand what has gone wrong with inode reclaim. And
please see if you can find time to help us with this rather than adding
some RH-specific fix and then forgetting about it (sensible though that
approach would be...)

Thanks.

2006-11-28 21:51:28

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:
> On Mon, 27 Nov 2006 18:52:58 -0500
> Wendy Cheng <[email protected]> wrote:
>
>
>> Not sure about walking thru sb->s_inodes for several reasons....
>>
>> 1. First, the changes made are mostly for file server setup with large
>> fs size - the entry count in sb->s_inodes may not be shorter then
>> inode_unused list.
>>
>
> umm, that's the best-case. We also care about worst-case. Think:
> 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
> from the being-unmounted filesytem. The code as-proposed will do 100x more
> work that it needs to do. All under a global spinlock.
>
By walking thru sb->s_inodes, we also need to take inode_lock and
iprune_mutex (?), since we're purging the inodes from the system - or
specifically, removing them from inode_unused list. There is really not
much difference from the current prune_icache() logic. What's been
proposed here is simply *exporting* the prune_icache() kernel code to
allow filesystems to trim (purge a small percentage of ) its
(potentially will be) unused per-mount inodes for *latency* considerations.

I made a mistake by using the "page dirty ratio" to explain the problem
(sorry! I was not thinking well in previous write-up) that could mislead
you to think this is a VM issue. This is not so much about
low-on-free-pages (and/or memory fragmentation) issue (though
fragmentation is normally part of the symptoms). What the (external)
kernel module does is to tie its cluster-wide file lock with in-memory
inode that is obtained during file look-up time. The lock is removed
from the machine when

1. the lock is granted to other (cluster) machine; or
2. the in-memory inode is purged from the system.

One of the clusters that has this latency issue is an IP/TV application
where it "rsync" with main station server (with long geographical
distance) every 15 minutes. It subsequently (and constantly) generates
large amount of inode (and locks) hanging around. When other nodes,
served as FTP servers, within the same cluster are serving the files,
DLM has to wade through huge amount of locks entries to know whether the
lock requests can be granted. That's where this latency issue gets
popped out. Our profiling data shows when the cluster performance is
dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in
lock searching logic. From VM point of view, the system does not have
memory shortage so it doesn't have a need to kick off prune_icache() call.

This issue could also be fixed in several different ways - maybe by a
better DLM hash function, maybe by asking IT people to umount the
filesystem where *all* per-mount inodes are unconditionally purged (but
it defeats the purpose of caching inodes and, in our case, the locks)
after each rsync, ...., etc. But I do think the proposed patch is the
most sensible way to fix this issue and believe it will be one of these
functions that if you export it, people will find a good use of it. It
helps with memory fragmentation and/or shortage *before* it becomes a
problem as well. I certainly understand and respect a maintainer's
daunting job on how to take/reject a patch - let me know how you think
so I can start to work on other solutions if required.

-- Wendy

2006-11-29 00:22:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Tue, 28 Nov 2006 16:41:07 -0500
Wendy Cheng <[email protected]> wrote:

> Andrew Morton wrote:
> > On Mon, 27 Nov 2006 18:52:58 -0500
> > Wendy Cheng <[email protected]> wrote:
> >
> >
> >> Not sure about walking thru sb->s_inodes for several reasons....
> >>
> >> 1. First, the changes made are mostly for file server setup with large
> >> fs size - the entry count in sb->s_inodes may not be shorter then
> >> inode_unused list.
> >>
> >
> > umm, that's the best-case. We also care about worst-case. Think:
> > 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
> > from the being-unmounted filesytem. The code as-proposed will do 100x more
> > work that it needs to do. All under a global spinlock.
> >
> By walking thru sb->s_inodes, we also need to take inode_lock and
> iprune_mutex (?), since we're purging the inodes from the system - or
> specifically, removing them from inode_unused list. There is really not
> much difference from the current prune_icache() logic.

There's quite a bit of difference. The change you're proposing will
perform poorly if it is used in the scenario which I describe above. It
will waste CPU cycles and will destroy the inode_unused LRU ordering (for
what that's worth, which isn't much).

Trust me, every single time we've had an inefficient search in core kernel,
someone has gone and done something which hits it and causes general
meltdown in their workload. So we've had to make significant changes to
remove the O(n) or higher search complexity.

And in this case we *already have* the date structures in place to make it
O(1).

> What's been
> proposed here is simply *exporting* the prune_icache() kernel code to
> allow filesystems to trim (purge a small percentage of ) its
> (potentially will be) unused per-mount inodes for *latency* considerations.

It just happens to work in your setup. If you have a large machine with
two filesystems and you run rsync on both filesystems and run FTP agains
one of them, it might not work so well. Because the proposed
prune_icache_sb() might need to chew through 500,000 inodes from the wrong
superblock before reclaiming any of the inodes which you want to reclaim.
Or something like that.

> I made a mistake by using the "page dirty ratio" to explain the problem
> (sorry! I was not thinking well in previous write-up) that could mislead
> you to think this is a VM issue. This is not so much about
> low-on-free-pages (and/or memory fragmentation) issue (though
> fragmentation is normally part of the symptoms). What the (external)
> kernel module does is to tie its cluster-wide file lock with in-memory
> inode that is obtained during file look-up time. The lock is removed
> from the machine when
>
> 1. the lock is granted to other (cluster) machine; or
> 2. the in-memory inode is purged from the system.

It seems peculiar to be tying the lifetime of a DLM lock to the system's
memory size and current memory pressure?

> One of the clusters that has this latency issue is an IP/TV application
> where it "rsync" with main station server (with long geographical
> distance) every 15 minutes. It subsequently (and constantly) generates
> large amount of inode (and locks) hanging around. When other nodes,
> served as FTP servers, within the same cluster are serving the files,
> DLM has to wade through huge amount of locks entries to know whether the
> lock requests can be granted. That's where this latency issue gets
> popped out. Our profiling data shows when the cluster performance is
> dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in
> lock searching logic. From VM point of view, the system does not have
> memory shortage so it doesn't have a need to kick off prune_icache() call.

OK..

> This issue could also be fixed in several different ways - maybe by a
> better DLM hash function,

It does sound like the lock lookup is broken.

I assume there's some reason for keeping these things floating about in
memory, so there must be a downside to artificially pruning them in
this manner? If so, a (much) faster lookup would seem to be the best fix.

> maybe by asking IT people to umount the
> filesystem where *all* per-mount inodes are unconditionally purged (but
> it defeats the purpose of caching inodes and, in our case, the locks)
> after each rsync, ...., etc. But I do think the proposed patch is the
> most sensible way to fix this issue and believe it will be one of these
> functions that if you export it, people will find a good use of it. It
> helps with memory fragmentation and/or shortage *before* it becomes a
> problem as well. I certainly understand and respect a maintainer's
> daunting job on how to take/reject a patch - let me know how you think
> so I can start to work on other solutions if required.

We shouldn't export this particular implementation to modules because it
has bad failure modes. There might be a case for exposing an
i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.


2006-11-29 05:54:46

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:

>We shouldn't export this particular implementation to modules because it
>has bad failure modes. There might be a case for exposing an
>i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.
>
>
>
>
Ok, thanks for looking into this - it is appreciated. I'll try to figure
out something else.

-- Wendy

2006-11-30 16:16:10

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

How about a simple and plain change with this uploaded patch ....

The idea is, instead of unconditionally dropping every buffer associated
with the particular mount point (that defeats the purpose of page
caching), base kernel exports the "drop_pagecache_sb()" call that allows
page cache to be trimmed. More importantly, it is changed to offer the
choice of not randomly purging any buffer but the ones that seem to be
unused (i_state is NULL and i_count is zero). This will encourage
filesystem(s) to pro actively response to vm memory shortage if they
choose so.

From our end (cluster locks are expensive - that's why we cache them),
one of our kernel daemons will invoke this newly exported call based on
a set of pre-defined tunables. It is then followed by a lock reclaim
logic to trim the locks by checking the page cache associated with the
inode (that this cluster lock is created for). If nothing is attached to
the inode (based on i_mapping->nrpages count), we know it is a good
candidate for trimming and will subsequently drop this lock (instead of
waiting until the end of vfs inode life cycle).

Note that I could do invalidate_inode_pages() within our kernel modules
to accomplish what drop_pagecache_sb() does (without coming here to bug
people) but I don't have access to inode_lock as an external kernel
module. So either EXPORT_SYMBOL(inode_lock) or this patch ?

The end result (of this change) should encourage filesystem to actively
avoid depleting too much memory and we'll encourage our applications to
understand clustering locality issues.

Haven't tested this out though - would appreciate some comments before
spending more efforts on this direction.

-- Wendy



Attachments:
inode_prune_sb.patch (1.68 kB)

2006-11-30 19:31:41

by Nate Diller

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On 11/30/06, Wendy Cheng <[email protected]> wrote:
> How about a simple and plain change with this uploaded patch ....
>
> The idea is, instead of unconditionally dropping every buffer associated
> with the particular mount point (that defeats the purpose of page
> caching), base kernel exports the "drop_pagecache_sb()" call that allows
> page cache to be trimmed. More importantly, it is changed to offer the
> choice of not randomly purging any buffer but the ones that seem to be
> unused (i_state is NULL and i_count is zero). This will encourage
> filesystem(s) to pro actively response to vm memory shortage if they
> choose so.
>
> From our end (cluster locks are expensive - that's why we cache them),
> one of our kernel daemons will invoke this newly exported call based on
> a set of pre-defined tunables. It is then followed by a lock reclaim
> logic to trim the locks by checking the page cache associated with the
> inode (that this cluster lock is created for). If nothing is attached to
> the inode (based on i_mapping->nrpages count), we know it is a good
> candidate for trimming and will subsequently drop this lock (instead of
> waiting until the end of vfs inode life cycle).

I have a patch that is a more comprehensive version of this idea, but
it is not fully debugged, and has suffered some bitrot in the past
couple months. This turns out to be a good performance improvement in
the general case too, but is more complex than your idea because there
are real locking changes needed to avoid deadlocks. I can send you a
copy of the patch if you are interested.

> Note that I could do invalidate_inode_pages() within our kernel modules
> to accomplish what drop_pagecache_sb() does (without coming here to bug
> people) but I don't have access to inode_lock as an external kernel
> module. So either EXPORT_SYMBOL(inode_lock) or this patch ?

like i said above, you have to be careful when touching inode_lock,
dcache_lock, and the mapping's tree_lock, because of potential
deadlocks. the mapping's lock can be taken from softirq context, but
the inode and dcache locks cannot.

NATE

2006-12-01 21:23:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng <[email protected]> wrote:

> How about a simple and plain change with this uploaded patch ....
>
> The idea is, instead of unconditionally dropping every buffer associated
> with the particular mount point (that defeats the purpose of page
> caching), base kernel exports the "drop_pagecache_sb()" call that allows
> page cache to be trimmed. More importantly, it is changed to offer the
> choice of not randomly purging any buffer but the ones that seem to be
> unused (i_state is NULL and i_count is zero). This will encourage
> filesystem(s) to pro actively response to vm memory shortage if they
> choose so.

argh.

In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc. (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).

We do not want to go "encouraging" filesystems to play games tuning and
trimming VFS caches and things like that. If a patch doing that were to
turn up it would be heartily shouted at and the originator would be asked
to go off and implement the functionality in core VFS so that a) all
filesystems can immediately utilise it and b) other filesystems aren't
tempted to go off and privately implement something similar.

So please bear this philosophy in mind, and think about this feature from
that perspective.

One approach might be to add a per-superblock upper-bound on the number of
cached dentries and/or inodes. Typically that would be controlled by a
(re)mount option. Although we could perhaps discuss a sysfs representation
of this (and, presumably, other mount options).

But I'd expect such a proposal to have a hard time, because we'd need to
know why such a thing is needed: we prefer auto-tuning, and that's what we
have now, so what's gone wrong with it and how can we fix it, rather than
adding a manual override?

> From our end (cluster locks are expensive - that's why we cache them),
> one of our kernel daemons will invoke this newly exported call based on
> a set of pre-defined tunables. It is then followed by a lock reclaim
> logic to trim the locks by checking the page cache associated with the
> inode (that this cluster lock is created for). If nothing is attached to
> the inode (based on i_mapping->nrpages count), we know it is a good
> candidate for trimming and will subsequently drop this lock (instead of
> waiting until the end of vfs inode life cycle).

Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms. Seems odd.

If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?

Did you look at improving that lock-lookup algorithm, btw? Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

> Note that I could do invalidate_inode_pages() within our kernel modules
> to accomplish what drop_pagecache_sb() does (without coming here to bug
> people) but I don't have access to inode_lock as an external kernel
> module. So either EXPORT_SYMBOL(inode_lock) or this patch ?
>
> The end result (of this change) should encourage filesystem to actively
> avoid depleting too much memory

That is _not_ a filesytem responsibility! inode cache is owned and
maintained by the VFS.

> and we'll encourage our applications to
> understand clustering locality issues.

?

> Haven't tested this out though - would appreciate some comments before
> spending more efforts on this direction.
>
> -- Wendy
>
>
>

2006-12-03 17:42:18

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:

>On Thu, 30 Nov 2006 11:05:32 -0500
>Wendy Cheng <[email protected]> wrote:
>
>
>
>>
>>The idea is, instead of unconditionally dropping every buffer associated
>>with the particular mount point (that defeats the purpose of page
>>caching), base kernel exports the "drop_pagecache_sb()" call that allows
>>page cache to be trimmed. More importantly, it is changed to offer the
>>choice of not randomly purging any buffer but the ones that seem to be
>>unused (i_state is NULL and i_count is zero). This will encourage
>>filesystem(s) to pro actively response to vm memory shortage if they
>>choose so.
>>
>>
>
>argh.
>
>
I read this as "It is ok to give system admin(s) commands (that this
"drop_pagecache_sb() call" is all about) to drop page cache. It is,
however, not ok to give filesystem developer(s) this very same function
to trim their own page cache if the filesystems choose to do so" ?

>In Linux a filesystem is a dumb layer which sits between the VFS and the
>I/O layer and provides dumb services such as reading/writing inodes,
>reading/writing directory entries, mapping pagecache offsets to disk
>blocks, etc. (This model is to varying degrees incorrect for every
>post-ext2 filesystem, but that's the way it is).
>
>
Linux kernel, particularly the VFS layer, is starting to show signs of
inadequacy as the software components built upon it keep growing. I have
doubts that it can keep up and handle this complexity with a development
policy like you just described (filesystem is a dumb layer ?). Aren't
these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect
example why trying to do too many things inside vfs layer for so many
filesystems is a bad idea ? By the way, since we're on this subject,
could we discuss a little bit about vfs rename call (or I can start
another new discussion thread) ?

Note that linux do_rename() starts with the usual lookup logic, followed
by "lock_rename", then a final round of dentry lookup, and finally comes
to filesystem's i_op->rename call. Since lock_rename() only calls for
vfs layer locks that are local to this particular machine, for a cluster
filesystem, there exists a huge window between the final lookup and
filesystem's i_op->rename calls such that the file could get deleted
from another node before fs can do anything about it. Is it possible
that we could get a new function pointer (lock_rename) in
inode_operations structure so a cluster filesystem can do proper locking ?

>>From our end (cluster locks are expensive - that's why we cache them),
>>one of our kernel daemons will invoke this newly exported call based on
>>a set of pre-defined tunables. It is then followed by a lock reclaim
>>logic to trim the locks by checking the page cache associated with the
>>inode (that this cluster lock is created for). If nothing is attached to
>>the inode (based on i_mapping->nrpages count), we know it is a good
>>candidate for trimming and will subsequently drop this lock (instead of
>>waiting until the end of vfs inode life cycle).
>>
>>
>
>Again, I don't understand why you're tying the lifetime of these locks to
>the VFS inode reclaim mechanisms. Seems odd.
>
>
Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting the
request (communication overhead).
2. It involves disk flushing if bouncing between nodes. Say one node
requests a read lock after another node's write... before the read lock
can be granted, the write node needs to flush the data to the disk (disk
io overhead).

For optimization purpose, we want to refrain the disk flush after writes
and hope (and encourage) the next person who requests the lock to be on
the very same node (to take the advantage of OS write-back logic).
That's why the locks are cached on the very same node. It will not get
removed unless necessary.
What would be better to build the lock caching on top of the existing
inode cache logic - since these are the objects that the cluster locks
are created for in the first place.

>If you want to put an upper bound on the number of in-core locks, why not
>string them on a list and throw away the old ones when the upper bound is
>reached?
>
>
Don't take me wrong. DLM *has* a tunable to set the max lock counts. We
do drop the locks but to drop the right locks, we need a little bit help
from VFS layer. Latency requirement is difficult to manage.

>Did you look at improving that lock-lookup algorithm, btw? Core kernel has
>no problem maintaining millions of cached VFS objects - is there any reason
>why your lock lookup cannot be similarly efficient?
>
>
Don't be so confident. I did see some complaints from ext3 based mail
servers in the past - when the storage size was large enough, people had
to explicitly umount the filesystem from time to time to rescue their
performance. I don't recall the details at this moment though.

For us with this particular customer, it is a 15TB storage.

-- Wendy

2006-12-03 20:48:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Sun, 03 Dec 2006 12:49:42 -0500
Wendy Cheng <[email protected]> wrote:

> Andrew Morton wrote:
>
> >On Thu, 30 Nov 2006 11:05:32 -0500
> >Wendy Cheng <[email protected]> wrote:
> >
> >
> >
> >>
> >>The idea is, instead of unconditionally dropping every buffer associated
> >>with the particular mount point (that defeats the purpose of page
> >>caching), base kernel exports the "drop_pagecache_sb()" call that allows
> >>page cache to be trimmed. More importantly, it is changed to offer the
> >>choice of not randomly purging any buffer but the ones that seem to be
> >>unused (i_state is NULL and i_count is zero). This will encourage
> >>filesystem(s) to pro actively response to vm memory shortage if they
> >>choose so.
> >>
> >>
> >
> >argh.
> >
> >
> I read this as "It is ok to give system admin(s) commands (that this
> "drop_pagecache_sb() call" is all about) to drop page cache. It is,
> however, not ok to give filesystem developer(s) this very same function
> to trim their own page cache if the filesystems choose to do so" ?

If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
administrators - it's a convenience thing for developers, to get repeatable
benchmarks. Attempts to make it a per-numa-node control for admin purposes have
been rejected.

> >In Linux a filesystem is a dumb layer which sits between the VFS and the
> >I/O layer and provides dumb services such as reading/writing inodes,
> >reading/writing directory entries, mapping pagecache offsets to disk
> >blocks, etc. (This model is to varying degrees incorrect for every
> >post-ext2 filesystem, but that's the way it is).
> >
> >
> Linux kernel, particularly the VFS layer, is starting to show signs of
> inadequacy as the software components built upon it keep growing. I have
> doubts that it can keep up and handle this complexity with a development
> policy like you just described (filesystem is a dumb layer ?). Aren't
> these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect
> example why trying to do too many things inside vfs layer for so many
> filesystems is a bad idea ?

That's not a very well-chosen example, but yes, the old ext2-based model has
needed to be extended as new filesystems come along.

> By the way, since we're on this subject,
> could we discuss a little bit about vfs rename call (or I can start
> another new discussion thread) ?
>
> Note that linux do_rename() starts with the usual lookup logic, followed
> by "lock_rename", then a final round of dentry lookup, and finally comes
> to filesystem's i_op->rename call. Since lock_rename() only calls for
> vfs layer locks that are local to this particular machine, for a cluster
> filesystem, there exists a huge window between the final lookup and
> filesystem's i_op->rename calls such that the file could get deleted
> from another node before fs can do anything about it. Is it possible
> that we could get a new function pointer (lock_rename) in
> inode_operations structure so a cluster filesystem can do proper locking ?

That would need a new thread, and probably (at least pseudo-) code, and
cc's to the appropriate maintainers (although that part of the kernel isn't
really maintained any more - it has fallen into the patch-and-run model).

> >>From our end (cluster locks are expensive - that's why we cache them),
> >>one of our kernel daemons will invoke this newly exported call based on
> >>a set of pre-defined tunables. It is then followed by a lock reclaim
> >>logic to trim the locks by checking the page cache associated with the
> >>inode (that this cluster lock is created for). If nothing is attached to
> >>the inode (based on i_mapping->nrpages count), we know it is a good
> >>candidate for trimming and will subsequently drop this lock (instead of
> >>waiting until the end of vfs inode life cycle).
> >>
> >>
> >
> >Again, I don't understand why you're tying the lifetime of these locks to
> >the VFS inode reclaim mechanisms. Seems odd.
> >
> >
> Cluster locks are expensive because:
>
> 1. Every node in the cluster has to agree about it upon granting the
> request (communication overhead).
> 2. It involves disk flushing if bouncing between nodes. Say one node
> requests a read lock after another node's write... before the read lock
> can be granted, the write node needs to flush the data to the disk (disk
> io overhead).
>
> For optimization purpose, we want to refrain the disk flush after writes
> and hope (and encourage) the next person who requests the lock to be on
> the very same node (to take the advantage of OS write-back logic).
> That's why the locks are cached on the very same node. It will not get
> removed unless necessary.
> What would be better to build the lock caching on top of the existing
> inode cache logic - since these are the objects that the cluster locks
> are created for in the first place.

hmm, I suppose that makes sense.

Are there dentries associated with these locks?

> >If you want to put an upper bound on the number of in-core locks, why not
> >string them on a list and throw away the old ones when the upper bound is
> >reached?
> >
> >
> Don't take me wrong. DLM *has* a tunable to set the max lock counts. We
> do drop the locks but to drop the right locks, we need a little bit help
> from VFS layer. Latency requirement is difficult to manage.
>
> >Did you look at improving that lock-lookup algorithm, btw? Core kernel has
> >no problem maintaining millions of cached VFS objects - is there any reason
> >why your lock lookup cannot be similarly efficient?
> >
> >
> Don't be so confident. I did see some complaints from ext3 based mail
> servers in the past - when the storage size was large enough, people had
> to explicitly umount the filesystem from time to time to rescue their
> performance. I don't recall the details at this moment though.

People have had plenty of problems with oversized inode-caches in the past,
but I think they were due to memory consumption, not to lookup inefficiency.

My question _still_ remains unanswered. Third time: is is possible to
speed up this lock-lookup code?

Perhaps others can take a look at it - where is it?

2006-12-04 05:50:25

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:

>On Sun, 03 Dec 2006 12:49:42 -0500
>Wendy Cheng <[email protected]> wrote:
>
>
>
>>I read this as "It is ok to give system admin(s) commands (that this
>>"drop_pagecache_sb() call" is all about) to drop page cache. It is,
>>however, not ok to give filesystem developer(s) this very same function
>>to trim their own page cache if the filesystems choose to do so" ?
>>
>>
>
>If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
>administrators - it's a convenience thing for developers, to get repeatable
>benchmarks. Attempts to make it a per-numa-node control for admin purposes have
>been rejected.
>
>
Just saw you suggested the "next door" LKML thread ("la la la la ...
swappiness") to use "-o sync" ? Well, now I do see you're determined ...
anyway, I think I got my stuff working without this kernel patch ...
still under testing though.

The rename post will be done first thing tomorrow morning.

>>[snip] .......
>>
>>
>hmm, I suppose that makes sense.
>
>Are there dentries associated with these locks?
>
>
Yes, dentries are part of the logic (during lookup time) but
book-keepings (reference count, reclaim, delete, etc) are all done thru
inode structures.

>
>
>>>Did you look at improving that lock-lookup algorithm, btw? Core kernel has
>>>no problem maintaining millions of cached VFS objects - is there any reason
>>>why your lock lookup cannot be similarly efficient?
>>>
>>>
>>>
Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h).
I'm on an older version of DLM that uses FNV hash algorithm
(http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance
test runs to compare these two methods.

A final note on this subject - I may not agree with you (about various
things) but your comments and amazingly quick responses are very very
appreciated !

-- Wendy

2006-12-04 06:28:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

On Mon, 04 Dec 2006 00:57:50 -0500
Wendy Cheng <[email protected]> wrote:

> >
> >
> >>>Did you look at improving that lock-lookup algorithm, btw? Core kernel has
> >>>no problem maintaining millions of cached VFS objects - is there any reason
> >>>why your lock lookup cannot be similarly efficient?
> >>>
> >>>
> >>>
> Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h).
> I'm on an older version of DLM that uses FNV hash algorithm
> (http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance
> test runs to compare these two methods.

I'd be surprised if the choice of hash algorithm itself makes much difference.
But we can't say much about it unless we can see the code (ie: your code).

Is it a simple old hash-to-find-the-bucket-then-walk-a-list implementation?
If so, what does the bucket count distribution look like? What is the average
walk length? Does it use a single lock, or hashed locking, or a lock-per-bucket?

etc.

2006-12-04 16:41:38

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

--- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c 2006-12-04 17:25:02.000000000 +0100
@@ -371,6 +371,19 @@ static void kmem_list3_init(struct kmem_
} while (0)

/*
+ * Define the reciprocal value of B so that
+ * ((u32)A / (u32)B) can be replaced by :
+ * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
+ * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply
+ */
+static inline u32 reciprocal_value(unsigned int k)
+{
+ u64 val = (1LL << 32) + (k - 1);
+ do_div(val, k);
+ return (u32)val;
+}
+
+/*
* struct kmem_cache
*
* manages a cache.
@@ -385,6 +398,7 @@ struct kmem_cache {
unsigned int shared;

unsigned int buffer_size;
+ unsigned int reciprocal_buffer_size;
/* 3) touched by every alloc & free from the backend */
struct kmem_list3 *nodelists[MAX_NUMNODES];

@@ -626,10 +640,17 @@ static inline void *index_to_obj(struct
return slab->s_mem + cache->buffer_size * idx;
}

+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ * Using the fact that buffer_size is a constant for a particular cache,
+ * we can replace (offset / cache->buffer_size) by
+ * ((u64)offset * cache->reciprocal_buffer_size) >> 32
+ */
static inline unsigned int obj_to_index(struct kmem_cache *cache,
struct slab *slab, void *obj)
{
- return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+ unsigned int offset = (obj - slab->s_mem);
+ return (u32)(((u64)offset * cache->reciprocal_buffer_size) >> 32);
}

/*
@@ -1400,6 +1421,8 @@ void __init kmem_cache_init(void)

cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
cache_line_size());
+ cache_cache.reciprocal_buffer_size =
+ reciprocal_value(cache_cache.buffer_size);

for (order = 0; order < MAX_ORDER; order++) {
cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2320,7 @@ kmem_cache_create (const char *name, siz
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
cachep->buffer_size = size;
+ cachep->reciprocal_buffer_size = reciprocal_value(size);

if (flags & CFLGS_OFF_SLAB) {
cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);


Attachments:
(No filename) (1.23 kB)
slab_avoid_divides.patch (2.11 kB)
Download all attachments

2006-12-04 16:52:07

by Russell Cattelan

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Wendy Cheng wrote:

> Andrew Morton wrote:
>
>> On Thu, 30 Nov 2006 11:05:32 -0500
>> Wendy Cheng <[email protected]> wrote:
>>
>>
>>
>>>
>>> The idea is, instead of unconditionally dropping every buffer
>>> associated with the particular mount point (that defeats the purpose
>>> of page caching), base kernel exports the "drop_pagecache_sb()" call
>>> that allows page cache to be trimmed. More importantly, it is
>>> changed to offer the choice of not randomly purging any buffer but
>>> the ones that seem to be unused (i_state is NULL and i_count is
>>> zero). This will encourage filesystem(s) to pro actively response to
>>> vm memory shortage if they choose so.
>>>
>>
>>
>> argh.
>>
>>
> I read this as "It is ok to give system admin(s) commands (that this
> "drop_pagecache_sb() call" is all about) to drop page cache. It is,
> however, not ok to give filesystem developer(s) this very same
> function to trim their own page cache if the filesystems choose to do
> so" ?
>
>> In Linux a filesystem is a dumb layer which sits between the VFS and the
>> I/O layer and provides dumb services such as reading/writing inodes,
>> reading/writing directory entries, mapping pagecache offsets to disk
>> blocks, etc. (This model is to varying degrees incorrect for every
>> post-ext2 filesystem, but that's the way it is).
>>
>>
> Linux kernel, particularly the VFS layer, is starting to show signs of
> inadequacy as the software components built upon it keep growing. I
> have doubts that it can keep up and handle this complexity with a
> development policy like you just described (filesystem is a dumb layer
> ?). Aren't these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a
> perfect example why trying to do too many things inside vfs layer for
> so many filesystems is a bad idea ? By the way, since we're on this
> subject, could we discuss a little bit about vfs rename call (or I can
> start another new discussion thread) ?
>
> Note that linux do_rename() starts with the usual lookup logic,
> followed by "lock_rename", then a final round of dentry lookup, and
> finally comes to filesystem's i_op->rename call. Since lock_rename()
> only calls for vfs layer locks that are local to this particular
> machine, for a cluster filesystem, there exists a huge window between
> the final lookup and filesystem's i_op->rename calls such that the
> file could get deleted from another node before fs can do anything
> about it. Is it possible that we could get a new function pointer
> (lock_rename) in inode_operations structure so a cluster filesystem
> can do proper locking ?

It looks like the ocfs2 guys have the similar problem?

http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt

Does this change help fix gfs lock ordering problem as well?


-Russell Cattelan
[email protected]

2006-12-04 16:56:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

On Mon, 4 Dec 2006, Eric Dumazet wrote:

> Doing some math, we can use a reciprocal multiplication instead of a divide.

Could you generalize the reciprocal thingy so that the division
can be used from other parts of the kernel as well? It would be useful to
separately get some cycle counts on a regular division compared with your
division. If that shows benefit then we may think about using it in the
kernel. I am a bit surprised that this is still an issue on modern cpus.

2006-12-04 18:18:15

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

--- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 19:01:44.000000000 +0100
@@ -0,0 +1,30 @@
+#ifndef _LINUX_RECIPROCAL_DIV_H
+#define _LINUX_RECIPROCAL_DIV_H
+
+/*
+ * Define the reciprocal value of B so that
+ * ((u32)A / (u32)B) can be replaced by :
+ * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
+ * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply
+ */
+#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B))
+
+static inline u32 reciprocal_value(unsigned int k)
+{
+ if (__builtin_constant_p(k))
+ return RECIPROCAL_VALUE(k);
+ else {
+ u64 val = (1LL << 32) + (k - 1);
+ do_div(val, k);
+ return (u32)val;
+ }
+}
+
+/*
+ * We want to avoid an expensive divide : (A / B)
+ * If B is known in advance, its reciprocal R can be precalculated/stored.
+ * then (A / B) = (u32)(((u64)(A) * (R)) >> 32)
+ */
+#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32)
+
+#endif
--- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c 2006-12-04 19:03:42.000000000 +0100
@@ -107,6 +107,7 @@
#include <linux/mempolicy.h>
#include <linux/mutex.h>
#include <linux/rtmutex.h>
+#include <linux/reciprocal_div.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -385,6 +386,7 @@ struct kmem_cache {
unsigned int shared;

unsigned int buffer_size;
+ unsigned int reciprocal_buffer_size;
/* 3) touched by every alloc & free from the backend */
struct kmem_list3 *nodelists[MAX_NUMNODES];

@@ -626,10 +628,17 @@ static inline void *index_to_obj(struct
return slab->s_mem + cache->buffer_size * idx;
}

-static inline unsigned int obj_to_index(struct kmem_cache *cache,
- struct slab *slab, void *obj)
+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ * Using the fact that buffer_size is a constant for a particular cache,
+ * we can replace (offset / cache->buffer_size) by
+ * RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size)
+ */
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+ const struct slab *slab, void *obj)
{
- return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+ unsigned int offset = (obj - slab->s_mem);
+ return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size);
}

/*
@@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)

cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
cache_line_size());
+ cache_cache.reciprocal_buffer_size =
+ reciprocal_value(cache_cache.buffer_size);

for (order = 0; order < MAX_ORDER; order++) {
cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
cachep->buffer_size = size;
+ cachep->reciprocal_buffer_size = reciprocal_value(size);

if (flags & CFLGS_OFF_SLAB) {
cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);


Attachments:
(No filename) (2.14 kB)
slab_avoid_divides.patch (2.94 kB)
Download all attachments

2006-12-04 19:50:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

On Mon, 4 Dec 2006 19:18:29 +0100
Eric Dumazet <[email protected]> wrote:

> On Monday 04 December 2006 17:55, Christoph Lameter wrote:
> > Could you generalize the reciprocal thingy so that the division
> > can be used from other parts of the kernel as well? It would be useful to
> > separately get some cycle counts on a regular division compared with your
> > division. If that shows benefit then we may think about using it in the
> > kernel. I am a bit surprised that this is still an issue on modern cpus.
>
> OK I added a new include file, I am not sure it is the best way...
>
> Well, AFAIK this particular divide is the only one that hurts performance on
> my machines.
>
> Do you have in mind another spot in kernel where we could use this reciprocal
> divide as well ?
>
> Yes divide complexity is still an issue with modern CPUS :
> elapsed time for 10^9 loops on Pentium M 1.6 Ghz
> 24 s for the version using divides
> 3.8 s for the version using multiplies
>
> [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
>
> When some objects are allocated by one CPU but freed by another CPU we can
> consume lot of cycles doing divides in obj_to_index().
>
> (Typical load on a dual processor machine where network interrupts are handled
> by one particular CPU (allocating skbufs), and the other CPU is running the
> application (consuming and freeing skbufs))
>
> Here on one production server (dual-core AMD Opteron 285), I noticed this
> divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
> quite modern cpus and the divide is much more expensive on oldest
> architectures :

Yes, I've seen that divide hurting too.

I suspect it was with unusual everything-in-cache workloads, but whatever.

> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
> for a multiply.
>
> Doing some math, we can use a reciprocal multiplication instead of a divide.
>
> If we want to compute V = (A / B) __(A and B being u32 quantities)
> we can instead use :
>
> V = ((u64)A * RECIPROCAL(B)) >> 32 ;
>
> where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B
>
> Note :
>
> I wrote pure C code for clarity. gcc output for i386 is not optimal but
> acceptable :
>
> mull __ 0x14(%ebx)
> mov __ __%edx,%eax // part of the >> 32
> xor __ __ %edx,%edx // useless
> mov __ __%eax,(%esp) // could be avoided
> mov __ __%edx,0x4(%esp) // useless
> mov __ __(%esp),%ebx
>
> Signed-off-by: Eric Dumazet <[email protected]>
>

--- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 19:01:44.000000000 +0100
> @@ -0,0 +1,30 @@
> +#ifndef _LINUX_RECIPROCAL_DIV_H
> +#define _LINUX_RECIPROCAL_DIV_H
> +
> +/*
> + * Define the reciprocal value of B so that
> + * ((u32)A / (u32)B) can be replaced by :
> + * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
> + * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply

"to a multiply".


> + */
> +#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B))

Does this have to be a macro?

I worry that people might try to throw random types at this code and it'll
fail. Are you prepared to support s8, u8, s16, u16, s32, u32, s64 and u64?
I think not, so perhaps we should be documenting what we _do_ accept, and
adding typecheck() calls in there somewhere to enforce that. (In which case
it would need to be a macro).


> +static inline u32 reciprocal_value(unsigned int k)
> +{
> + if (__builtin_constant_p(k))
> + return RECIPROCAL_VALUE(k);
> + else {
> + u64 val = (1LL << 32) + (k - 1);
> + do_div(val, k);
> + return (u32)val;
> + }
> +}

We should clearly document that this function is for once-off setup
operations - we'd hate for people to call this with any frequency.

It should be uninlined if poss, too.

> +/*
> + * We want to avoid an expensive divide : (A / B)
> + * If B is known in advance, its reciprocal R can be precalculated/stored.
> + * then (A / B) = (u32)(((u64)(A) * (R)) >> 32)
> + */
> +#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32)

And again, depending upon our decision regarding what types this code will
support, this perhaps should become an inlined C function.

> +#endif
> --- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100
> +++ linux-2.6.19-ed/mm/slab.c 2006-12-04 19:03:42.000000000 +0100
> @@ -107,6 +107,7 @@
> #include <linux/mempolicy.h>
> #include <linux/mutex.h>
> #include <linux/rtmutex.h>
> +#include <linux/reciprocal_div.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -385,6 +386,7 @@ struct kmem_cache {
> unsigned int shared;
>
> unsigned int buffer_size;
> + unsigned int reciprocal_buffer_size;

If we decide to support only u32 for this operation, this should become
u32, for clarity.

> /* 3) touched by every alloc & free from the backend */
> struct kmem_list3 *nodelists[MAX_NUMNODES];
>
> @@ -626,10 +628,17 @@ static inline void *index_to_obj(struct
> return slab->s_mem + cache->buffer_size * idx;
> }
>
> -static inline unsigned int obj_to_index(struct kmem_cache *cache,
> - struct slab *slab, void *obj)
> +/*
> + * We want to avoid an expensive divide : (offset / cache->buffer_size)
> + * Using the fact that buffer_size is a constant for a particular cache,
> + * we can replace (offset / cache->buffer_size) by
> + * RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size)
> + */
> +static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> + const struct slab *slab, void *obj)
> {
> - return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
> + unsigned int offset = (obj - slab->s_mem);
> + return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size);
> }
>
> /*
> @@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)
>
> cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
> cache_line_size());
> + cache_cache.reciprocal_buffer_size =
> + reciprocal_value(cache_cache.buffer_size);
>
> for (order = 0; order < MAX_ORDER; order++) {
> cache_estimate(order, cache_cache.buffer_size,
> @@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
> if (flags & SLAB_CACHE_DMA)
> cachep->gfpflags |= GFP_DMA;
> cachep->buffer_size = size;
> + cachep->reciprocal_buffer_size = reciprocal_value(size);
>
> if (flags & CFLGS_OFF_SLAB) {
> cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
>
>
>

2006-12-04 19:56:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

This is similar stuff to asm-generic/div64.h right? The divide overhead
depends on the platform? Maybe it would better to place it in
asm-generic/div.h and then have platform specific functions?

2006-12-04 20:57:38

by Wendy Cheng

[permalink] [raw]
Subject: Re: [PATCH] prune_icache_sb

Russell Cattelan wrote:
> Wendy Cheng wrote:
>
>> Linux kernel, particularly the VFS layer, is starting to show signs
>> of inadequacy as the software components built upon it keep growing.
>> I have doubts that it can keep up and handle this complexity with a
>> development policy like you just described (filesystem is a dumb
>> layer ?). Aren't these DIO_xxx_LOCKING flags inside
>> __blockdev_direct_IO() a perfect example why trying to do too many
>> things inside vfs layer for so many filesystems is a bad idea ? By
>> the way, since we're on this subject, could we discuss a little bit
>> about vfs rename call (or I can start another new discussion thread) ?
>>
>> Note that linux do_rename() starts with the usual lookup logic,
>> followed by "lock_rename", then a final round of dentry lookup, and
>> finally comes to filesystem's i_op->rename call. Since lock_rename()
>> only calls for vfs layer locks that are local to this particular
>> machine, for a cluster filesystem, there exists a huge window between
>> the final lookup and filesystem's i_op->rename calls such that the
>> file could get deleted from another node before fs can do anything
>> about it. Is it possible that we could get a new function pointer
>> (lock_rename) in inode_operations structure so a cluster filesystem
>> can do proper locking ?
>
> It looks like the ocfs2 guys have the similar problem?
>
> http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt
>
>
>

Thanks for the pointer. Same as ocfs2, under current VFS code, both
GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly
~200 line draft patch ready for GFS1 (and am looking into GFS2 at this
moment). The issue here is, for GFS, if vfs lock_rename() can call us,
this complication can be greatly reduced. Will start another thread to
see whether the wish can be granted.

-- Wendy

2006-12-04 21:45:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

--- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 23:12:34.000000000 +0100
@@ -0,0 +1,30 @@
+#ifndef _LINUX_RECIPROCAL_DIV_H
+#define _LINUX_RECIPROCAL_DIV_H
+
+/*
+ * This file describes reciprocical division.
+ *
+ * This optimizes the (A/B) problem, when A and B are two u32
+ * and B is a known value (but not known at compile time)
+ *
+ * The math principle used is :
+ * Let RECIPROCAL_VALUE(B) be (((1LL << 32) + (B - 1))/ B)
+ * Then A / B = (u32)(((u64)(A) * (R)) >> 32)
+ *
+ * This replaces a divide by a multiply (and a shift), and
+ * is generally less expensive in CPU cycles.
+ */
+
+/*
+ * Computes the reciprocal value (R) for the value B of the divisor.
+ * Should not be called before each reciprocal_divide(),
+ * or else the performance is slower than a normal divide.
+ */
+extern u32 reciprocal_value(u32 B);
+
+
+static inline u32 reciprocal_divide(u32 A, u32 R)
+{
+ return (u32)(((u64)A * R) >> 32);
+}
+#endif
--- linux-2.6.19/lib/reciprocal_div.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/lib/reciprocal_div.c 2006-12-04 23:18:54.000000000 +0100
@@ -0,0 +1,10 @@
+#include <linux/types.h>
+#include <asm/div64.h>
+#include <linux/reciprocal_div.h>
+
+u32 reciprocal_value(u32 k)
+{
+ u64 val = (1LL << 32) + (k - 1);
+ do_div(val, k);
+ return (u32)val;
+}
--- linux-2.6.19/lib/Makefile 2006-12-04 23:12:30.000000000 +0100
+++ linux-2.6.19-ed/lib/Makefile 2006-12-04 23:15:14.000000000 +0100
@@ -5,7 +5,7 @@
lib-y := ctype.o string.o vsprintf.o cmdline.o \
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
idr.o div64.o int_sqrt.o bitmap.o extable.o prio_tree.o \
- sha1.o irq_regs.o
+ sha1.o irq_regs.o reciprocal_div.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
--- linux-2.6.19/mm/slab.c 2006-12-04 22:51:42.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c 2006-12-04 23:13:42.000000000 +0100
@@ -107,6 +107,7 @@
#include <linux/mempolicy.h>
#include <linux/mutex.h>
#include <linux/rtmutex.h>
+#include <linux/reciprocal_div.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -385,6 +386,7 @@ struct kmem_cache {
unsigned int shared;

unsigned int buffer_size;
+ u32 reciprocal_buffer_size;
/* 3) touched by every alloc & free from the backend */
struct kmem_list3 *nodelists[MAX_NUMNODES];

@@ -626,10 +628,17 @@ static inline void *index_to_obj(struct
return slab->s_mem + cache->buffer_size * idx;
}

-static inline unsigned int obj_to_index(struct kmem_cache *cache,
- struct slab *slab, void *obj)
+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ * Using the fact that buffer_size is a constant for a particular cache,
+ * we can replace (offset / cache->buffer_size) by
+ * reciprocal_divide(offset, cache->reciprocal_buffer_size)
+ */
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+ const struct slab *slab, void *obj)
{
- return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+ unsigned int offset = (obj - slab->s_mem);
+ return reciprocal_divide(offset, cache->reciprocal_buffer_size);
}

/*
@@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)

cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
cache_line_size());
+ cache_cache.reciprocal_buffer_size =
+ reciprocal_value(cache_cache.buffer_size);

for (order = 0; order < MAX_ORDER; order++) {
cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
cachep->buffer_size = size;
+ cachep->reciprocal_buffer_size = reciprocal_value(size);

if (flags & CFLGS_OFF_SLAB) {
cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);


Attachments:
reciprocal_division.patch (3.74 kB)

2006-12-04 21:56:24

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

From: Eric Dumazet <[email protected]>
Date: Mon, 04 Dec 2006 22:34:29 +0100

> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
> for a multiply.

For UltraSPARC I and II (which is what this 200mhz guy probably is),
it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles
for a 64-bit divide (32-bit divide is 37 cycles).

UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71
cycles (32/64-bit) for integer divides.

Niagara is even worse :-) 11 cycle integer multiply and a 72 cycle
integer divide (regardless of 32-bit or 64-bit).

(more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost).

So this change has tons of merit for sparc64 chips at least :-)

Also, the multiply can parallelize with other operations but it
seems that integer divide stalls the pipe for most of the duration
of the calculation. So this makes the divide even worse.

2006-12-04 22:45:14

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

David Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Mon, 04 Dec 2006 22:34:29 +0100
>
>> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
>> for a multiply.
>
> For UltraSPARC I and II (which is what this 200mhz guy probably is),
> it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles
> for a 64-bit divide (32-bit divide is 37 cycles).

I must have Ultra-2 (running Solaris :( )

for (ui = 0 ; ui < 100000000 ; ui++)
val += reciprocal_divide(ui, reciproc);

100000cb0: 83 31 20 00 srl %g4, 0, %g1
100000cb4: 82 48 40 12 mulx %g1, %l2, %g1
100000cb8: 83 30 70 20 srlx %g1, 0x20, %g1
100000cbc: 88 01 20 01 inc %g4
100000cc0: 80 a1 00 05 cmp %g4, %g5
100000cc4: 08 4f ff fb bleu %icc, 100000cb0
100000cc8: b0 06 00 01 add %i0, %g1, %i0

I confirm that this block uses 20 cycles/iteration,
while next one uses 72 cycles/iteration

for (ui = 0 ; ui < 100000000 ; ui++)
val += ui / value;

100000ca8: 83 31 20 00 srl %g4, 0, %g1
100000cac: 82 68 40 11 udivx %g1, %l1, %g1
100000cb0: 88 01 20 01 inc %g4
100000cb4: 80 a1 00 05 cmp %g4, %g5
100000cb8: 08 4f ff fc bleu %icc, 100000ca8
100000cbc: b0 06 00 01 add %i0, %g1, %i0


>
> UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71
> cycles (32/64-bit) for integer divides.
>
> Niagara is even worse :-) 11 cycle integer multiply and a 72 cycle
> integer divide (regardless of 32-bit or 64-bit).
>
> (more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost).
>
> So this change has tons of merit for sparc64 chips at least :-)
>
> Also, the multiply can parallelize with other operations but it
> seems that integer divide stalls the pipe for most of the duration
> of the calculation. So this makes the divide even worse.


2006-12-05 15:12:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

Hi!

> When some objects are allocated by one CPU but freed by another CPU we can
> consume lot of cycles doing divides in obj_to_index().
>
> (Typical load on a dual processor machine where network interrupts are handled
> by one particular CPU (allocating skbufs), and the other CPU is running the
> application (consuming and freeing skbufs))
>
> Here on one production server (dual-core AMD Opteron 285), I noticed this
> divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
> quite modern cpus and the divide is much more expensive on oldest
> architectures :
>
> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
> for a multiply.
>
> Doing some math, we can use a reciprocal multiplication instead of a divide.
>
> If we want to compute V = (A / B) (A and B being u32 quantities)
> we can instead use :
>
> V = ((u64)A * RECIPROCAL(B)) >> 32 ;
>
> where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Well, I guess it should be gcc doing this optimalization, not we by
hand. And I believe gcc *is* smart enough to do it in some cases...

pavel

--
Thanks for all the (sleeping) penguins.