LinuxLists.cc - Ocfs2 performance bugs of doom

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

Hi Daniel,

On Fri, Mar 03, 2006 at 02:27:52PM -0800, Daniel Phillips wrote:
> Hi ocfs2 guys,
>
> Actually, ocfs2 is not doomed - far from it! I just wanted to grab a few
> more of those eyeballs that normally glaze over each time they see a
> cluster mail run by on lkml[1]. Today's news: you are still not finished
> with global lock systime badness, and I already have more performance bugs
> for you.
Thanks for continuing to profile this stuff. We've definitely got lots of
profiling to do, so any help is welcome.

> So the ocfs2 guys switched from double word to single word hash table heads
> and quadrupled the size of the hash table. Problem solved, right? No!
Heh, well if you remember from irc I told you that there was more to do -
specifically that I was looking at increasing the memory dedicated to the
hash.

> I included two patches below. The first vmallocs a megabyte hash table
> instead of a single page. This definitively squishes away the global
> lock hash table overhead so that systime stays about twice as high as
> ext3 instead of starting at three times and increasing to four times
> with the (new improved) 4K hash table.
I completely agree that we need to increase the memory dedicated to the
hash, but I fear that vmallocing a 1 megabyte table per domain (effectively
per mount) is going overboard. I will assume that this was a straw man patch
:)

> Of course the Right Think To Do is kmalloc a vector of pages and mask the
> hash into it, but as a lazy person I would rather sink the time into
> writing this deathless prose.
Right. That's one possible approach - I was originally thinking of
allocating a large global lockres hash table allocated at module init time.
This way we don't have to make large allocs for each domain.

Before anything else though, I'd really like to get an idea of how large we
want things. This might very well dictate the severity of the solution.

> The __dlm_lookup_lockres rewrite shows a measurable improvement in
> hash chain traversal overhead. Every little bit helps.
Yes, this patch seems much easier to swallow ;) Thanks for doing that work.
I have some comments about it below.

> I failed to quantify the improvement precisely because there are other
> glitchy things going on that interfere with accurate measurement. So now
> we get to...
Definitely if you get a chance to show how much just the lookup optimization
helps, I'd like to know. I'll also try to gather some numbers.

> Things to notice: with the still-teensy hash table, systime for global
> lock lookups is still way too high, eventually stabilizing when lock
> purging kicks in at about 4 times ext3's systime. Not good (but indeed,
> not as excruciatingly painful as two days ago).
>
> With the attached patches we kill the hash table overhead dead, dead,
> dead. But look at real time for the untar and sync!
Indeed. The real time numbers are certainly confusing. I actually saw real
time decrease on most of my tests (usually I hack things to increase the
hash allocation to something like a few pages). I want to do some more
consecutive untars though.

Btw, for what it's worth, here's some single untar numbers I have lying
around. Sync numbers are basicaly the same.

Before the hlist patch:
real 0m21.279s
user 0m3.940s
sys 0m18.884s

With the hlist patch:
real 0m15.717s
user 0m3.924s
sys 0m13.324s

> Ocfs2 starts little or no writeout during the untar itself. Ext3 does.
> Solution: check ext3 to see what secret incantations Andrew conjured up
> to fiddle it into doing something reasonable.
Good idea. As you probably already know, we've never been beyond checking to
see how ext3 handles a given problem :)

> Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doing
> nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting
> in io_schedule. Waiting for io that never happens? This needs more
> investigation.
Did you notice this during the untar? If not, do you have any reproducible
test case?

> Delete performance remains horrible, even with a 256 meg journal[4] which
> is unconscionably big anyway. Compare to ext3, which deletes kernel trees
> at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more
> like a minute.
This doesn't surprise me - our unlink performance leaves much to be desired
at the moment. How many nodes did you have mounted when you ran that test?
Off the top of my head,the two things which I would guess are hurting delete
the most right now are node messaging and lack of directory read ahead. The
first will be solved by more intelligent use of the DLM so that the network
will only be hit for those nodes that actually care about a given inode --
unlink rename and mount/unmount are the only things left that still use the
slower 'vote' mechanism. Directory readahead is much easier to solve, it's
just that nobody has gotten around to fixing that yet :/ I bet there's more
to figure out with respect to unlink performance.

> I would chase down these glitches too, except that I really need to get
> back to my cluster block devices, without which we cannot fully bake
> our cluster cake. After all, there are six of you and only one of me, I
> had better go work on the thing nobody else is working on.
Hey, we appreciate the help so far - thanks.

> [2] It is high time we pried loose the ocfs2 design process from secret
> irc channels and dark tunnels running deep beneath Oracle headquarters,
> and started sharing the squishy goodness of filesystem clustering
> development with some more of the smart people lurking here.
We're not trying to hide anything from anyone :) I'm always happy to talk
about design. We've been in bugfix (and more recently, performance fix) mode
for a while now, so there hasn't been much new design yet.

> - bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]);
> -
> - /* check for pre-existing lock */
> - hlist_for_each(iter, bucket) {
> - tmpres = hlist_entry(iter, struct dlm_lock_resource,
> hash_node);
> - if (tmpres->lockname.len == len &&
> - memcmp(tmpres->lockname.name, name, len) == 0) {
> - dlm_lockres_get(tmpres);
> - break;
> - }
> -
> - tmpres = NULL;
> + if (likely(res->lockname.name[0] != name[0]))
Is the likely() here necessary? It actually surprises me that the check even
helps so much - if you check the way OCFS2 lock names are built, the first
few bytes are very regular - the first char is going to almost always be one
of M, D, or W. Hmm, I guess if you're hitting it 1/3 fewer times...

> + continue;
> + if (likely(res->lockname.len != len))
Also here, lengths of OCFS2 locks are also actually fairly regular, so
perhaps the likely() isn't right?
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-03-04 03:42:40

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

Mark Fasheh wrote:
> I completely agree that we need to increase the memory dedicated to the
> hash, but I fear that vmallocing a 1 megabyte table per domain (effectively
> per mount) is going overboard. I will assume that this was a straw man patch
> :)

A meg's worth of vmalloc is a small price to pay for killing off the current
gross cpu wastage. After that you get into various fiddling that really does
not matter much compared to the next half dozen items on the sucks-most list.

>>Of course the Right Think To Do is kmalloc a vector of pages and mask the
>>hash into it, but as a lazy person I would rather sink the time into
>>writing this deathless prose.
>
> Right. That's one possible approach - I was originally thinking of
> allocating a large global lockres hash table allocated at module init time.
> This way we don't have to make large allocs for each domain.
>
> Before anything else though, I'd really like to get an idea of how large we
> want things. This might very well dictate the severity of the solution.

I made the table big enough to stomp the lookup overhead into insignificance
up to the point your lock purging kicks in. If you want to adjust one you
also want to adjust the other, they go together. As a rough rule of thumb,
handling about 100,000 cached locks efficiently seems right to me.
Eventually, somebody will want to handle way more or way less than that, but
right now that requirement is pretty theoretical.

>>I failed to quantify the improvement precisely because there are other
>>glitchy things going on that interfere with accurate measurement. So now
>>we get to...
>
> Definitely if you get a chance to show how much just the lookup optimization
> helps, I'd like to know. I'll also try to gather some numbers.

To test this I set the hash table size even lower than you originally had it
and I think the simple-minded loop optimization improved throughput by some
double digit's worth of percent. In other words, well worth the two line
additional test.

>> ...look at real time for the untar and sync!
>
> Indeed. The real time numbers are certainly confusing. I actually saw real
> time decrease on most of my tests (usually I hack things to increase the
> hash allocation to something like a few pages). I want to do some more
> consecutive untars though.

This is major badness, but who knows, it might not even be in ocfs2 itself.
A lazy person might wait for the next point release before investigating
further.

>>Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doing
>>nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting
>>in io_schedule. Waiting for io that never happens? This needs more
>>investigation.
>
> Did you notice this during the untar? If not, do you have any reproducible
> test case?

During the untar. Very reproducible. Just untar a few kernel trees into
different directories. I will try another kernel release first though, just
in case this was a duh that has already been picked up.

>>Delete performance remains horrible, even with a 256 meg journal[4] which
>>is unconscionably big anyway. Compare to ext3, which deletes kernel trees
>>at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more
>>like a minute.
>
> This doesn't surprise me - our unlink performance leaves much to be desired
> at the moment. How many nodes did you have mounted when you ran that test?

One, on a local disk :-/

> Off the top of my head,the two things which I would guess are hurting delete
> the most right now are node messaging and lack of directory read ahead. The
> first will be solved by more intelligent use of the DLM so that the network
> will only be hit for those nodes that actually care about a given inode --
> unlink rename and mount/unmount are the only things left that still use the
> slower 'vote' mechanism. Directory readahead is much easier to solve, it's
> just that nobody has gotten around to fixing that yet :/ I bet there's more
> to figure out with respect to unlink performance.

I will watch you work on it with interest.

>>[2] It is high time we pried loose the ocfs2 design process from secret
>>irc channels and dark tunnels running deep beneath Oracle headquarters,
>>and started sharing the squishy goodness of filesystem clustering
>>development with some more of the smart people lurking here.
>
> We're not trying to hide anything from anyone :)

I wasn't implying that, but it's a fact that most of the ocfs2 discussion
happens offline, and I completely understand how easy it is to settle into
that pattern. But by so doing, we are really cheating the community of part
of the benefit of having an open source community in the first place, which
to be able to watch the development process in real time and learn from it.
Not to mention creating a better historical record and - need I mention it -
a googleable database of development strategies.

> I'm always happy to talk about design. We've been in bugfix (and more
> recently, performance fix) mode for a while now, so there hasn't been
> much new design yet.

Fortunately, it seems that the core design is pretty solid already. I'd
add that a main goal at this point should be to explain as much as
possible of the internals to outsiders, and I certainly include myself in
that at this point. Please use one syllable words where possible :-)

>>+ if (likely(res->lockname.name[0] != name[0]))
>
> Is the likely() here necessary? It actually surprises me that the check even
> helps so much - if you check the way OCFS2 lock names are built, the first
> few bytes are very regular - the first char is going to almost always be one
> of M, D, or W. Hmm, I guess if you're hitting it 1/3 fewer times...

Sure, you can improve the test by knowing more about the distribution,
which is prefectly fair because this function does not to be very general.
To tell the truth, I did not check to see which of my hacks sped things
up. I was more interested in just reorganizing it to make it easier to
optimize. You could check the second byte or first word
or whatever.

The likely itself is correct though: if the hash chains are short then
you don't care, and if they are long then the early fail is certainly
most likely. What I didn't check is if gcc is even smart enough to
unroll the loop the correct way to eliminate some jumps. It would be
easy to unroll this loop by hand a few iterations to test that.

>>+ continue;
>>+ if (likely(res->lockname.len != len))
>
> Also here, lengths of OCFS2 locks are also actually fairly regular, so
> perhaps the likely() isn't right?

Yes, that one is questionable. Maybe:

if (likely((res->lockname.name[0] ^ name[0]) | (res->lockname.len ^ len)))

which modern processors should parallelize very nicely.

At this point it might be a good idea to step back and let some
micro-optimization mavens lurking here have a go at it. Even a tiny
optimization in this inner loop is a major performance win.

Regards,

Daniel

2006-03-04 07:37:48

by Andrew Morton

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

Daniel Phillips <[email protected]> wrote:
>
> @@ -103,31 +103,28 @@ struct dlm_lock_resource * __dlm_lookup_
> const char *name,
> unsigned int len)
> {
> - unsigned int hash;
> - struct hlist_node *iter;
> - struct dlm_lock_resource *tmpres=NULL;
> struct hlist_head *bucket;
> + struct hlist_node *list;
>
> mlog_entry("%.*s\n", len, name);
>
> assert_spin_locked(&dlm->spinlock);
> + bucket = dlm->lockres_hash + full_name_hash(name, len) % DLM_HASH_BUCKETS;
>
> - hash = full_name_hash(name, len);

err, you might want to calculate that hash outside the spinlock.

Maybe have a lock per bucket, too.

A 1MB hashtable is verging on comical. How may data are there in total?

2006-03-05 19:23:03

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

On Fri, Mar 03, 2006 at 11:36:17PM -0800, Andrew Morton wrote:
> > - hash = full_name_hash(name, len);
>
> err, you might want to calculate that hash outside the spinlock.
Hmm, good catch.

> Maybe have a lock per bucket, too.
That could be an avenue to explore. I guess we probably want to optimize the
actual lookup and memory usage before looking at that.

> A 1MB hashtable is verging on comical. How may data are there in total?
Yes 1MB is much too big. We're looking right now at 3 lock resources per
inode, so total number of elements depends upon your file system I suppose.
Hopefully I'll have some time today to run some basic tests which may show
us what size strikes a better balance of performance versus memory overhead.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-03-06 01:28:32

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

Andrew Morton wrote:
> Daniel Phillips <[email protected]> wrote:
>> assert_spin_locked(&dlm->spinlock);
>> + bucket = dlm->lockres_hash + full_name_hash(name, len) % DLM_HASH_BUCKETS;
>>
>> - hash = full_name_hash(name, len);
>
> err, you might want to calculate that hash outside the spinlock.

Yah.

> Maybe have a lock per bucket, too.

So the lock memory is as much as the hash table? ;-)

> A 1MB hashtable is verging on comical. How may data are there in total?

Even with the 256K entry hash table, __dlm_lookup_lockres is still the
top systime gobbler:

-------------
real 31.01
user 25.29
sys 3.09
-------------

CPU: P4 / Xeon, speed 2793.37 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000
samples % image name app name symbol name
-------------------------------------------------------------------------------
17071831 71.2700 libbz2.so.1.0.2 libbz2.so.1.0.2 (no symbols)
17071831 100.000 libbz2.so.1.0.2 libbz2.so.1.0.2 (no symbols) [self]
-------------------------------------------------------------------------------
2638066 11.0132 vmlinux vmlinux __dlm_lookup_lockres
2638066 100.000 vmlinux vmlinux __dlm_lookup_lockres [self]
-------------------------------------------------------------------------------
332683 1.3889 oprofiled oprofiled (no symbols)
332683 100.000 oprofiled oprofiled (no symbols) [self]
-------------------------------------------------------------------------------
254736 1.0634 vmlinux vmlinux ocfs2_local_alloc_count_bits
254736 100.000 vmlinux vmlinux ocfs2_local_alloc_count_bits [self]
-------------------------------------------------------------------------------
176794 0.7381 tar tar (no symbols)
176794 100.000 tar tar (no symbols) [self]
-------------------------------------------------------------------------------

Note, this is uniprocessor, single node on a local disk. Something
pretty badly broken all right. Tomorrow I will take a look at the hash
distribution and see what's up.

I guess there are about 250k symbols in the table before purging
finally kicks in, which happens 5th or 6th time I untar a kernel tree.
So, 20,000 names times 5-6 times the three locks per inode Mark
mentioned. I'll actually measure that tomorrow instead of inferring
it.

I think this table is per-ocfs2-mount, and really really, a meg is
nothing if it makes CPU cycles go away. That's .05% of the memory
on this box, which is a small box where clusters are concerned. But
there is also some gratuitous cpu suck still happening in there that
needs investigating. I would not be surprised at all to learn that
full_name_hash is a terrible hash function.

Regards,

Daniel

2006-03-06 02:58:14

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

On Sun, Mar 05, 2006 at 05:28:21PM -0800, Daniel Phillips wrote:
> Note, this is uniprocessor, single node on a local disk. Something
> pretty badly broken all right. Tomorrow I will take a look at the hash
> distribution and see what's up.
>
> I guess there are about 250k symbols in the table before purging
> finally kicks in, which happens 5th or 6th time I untar a kernel tree.
> So, 20,000 names times 5-6 times the three locks per inode Mark
> mentioned. I'll actually measure that tomorrow instead of inferring
> it.
>
> I think this table is per-ocfs2-mount, and really really, a meg is
> nothing if it makes CPU cycles go away. That's .05% of the memory
> on this box, which is a small box where clusters are concerned. But
> there is also some gratuitous cpu suck still happening in there that
> needs investigating. I would not be surprised at all to learn that
> full_name_hash is a terrible hash function.
Can you try the attached patch? Here's a sample OCFS2 lock name:

M0000000000000000036cc0458354c5

So as you can see, things at the beginning of the name are very regular -
something that I don't think full_name_hash() is handling very well. I
hacked a (barely) new hash function to avoid the first 7 bytes and it
reliably saves me 2 seconds in real time on my untars. I think we can
actually make it much better (this is still pretty dumb) than that, but I'm
just curious as to how much it helps on your tests.

Signed-off-by: Mark Fasheh <[email protected]>

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 8f3a9e3..06cebb2 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -81,6 +81,15 @@ void __dlm_unhash_lockres(struct dlm_loc
dlm_lockres_put(lockres);
}

+static inline unsigned int
+dlm_name_hash(const unsigned char *name, unsigned int len)
+{
+ /* optimize for OCFS2 lock names */
+ if (len > 7)
+ return full_name_hash(&name[7], len - 7);
+ return full_name_hash(name, len);
+}
+
void __dlm_insert_lockres(struct dlm_ctxt *dlm,
struct dlm_lock_resource *res)
{
@@ -90,7 +99,7 @@ void __dlm_insert_lockres(struct dlm_ctx
assert_spin_locked(&dlm->spinlock);

q = &res->lockname;
- q->hash = full_name_hash(q->name, q->len);
+ q->hash = dlm_name_hash(q->name, q->len);
bucket = &(dlm->lockres_hash[q->hash % DLM_HASH_BUCKETS]);

/* get a reference for our hashtable */
@@ -112,7 +121,7 @@ struct dlm_lock_resource * __dlm_lookup_

assert_spin_locked(&dlm->spinlock);

- hash = full_name_hash(name, len);
+ hash = dlm_name_hash(name, len);

bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]);

2006-03-06 05:00:00

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Mark Fasheh wrote:
> On Sun, Mar 05, 2006 at 05:28:21PM -0800, Daniel Phillips wrote:
>>I think this table is per-ocfs2-mount, and really really, a meg is
>>nothing if it makes CPU cycles go away. That's .05% of the memory
>>on this box, which is a small box where clusters are concerned. But
>>there is also some gratuitous cpu suck still happening in there that
>>needs investigating. I would not be surprised at all to learn that
>>full_name_hash is a terrible hash function.
>
> Can you try the attached patch?

With patch:

Real User Sys
31.93 23.72 3.69
29.39 24.18 4.01
53.36 24.25 4.80
30.24 24.40 4.76
63.60 24.09 5.03
29.95 24.18 4.91

Without patch:

Real User Sys
30.15 23.90 3.58
54.99 24.38 4.56
31.49 24.63 5.28
60.09 24.71 5.47
31.44 24.34 5.72
65.36 24.31 5.84

So that hack apparently improves the bucket distribution quite a bit, but
look, the bad old linear systime creep is still very obvious. For that
there is no substitute for lots of buckets.

The hash function can be improved way more. Your resource encoding is
also painfully wasteful.

As for optimizing the loop... most of the loop overhead comes from
loading cold cache lines no doubt. The struct qstr is an impediment
there because it adds another cache line to every loop. Speaking of
that, the qstr is so far away from the list.next pointer that you're
most probably getting an extra cache line load there too. Just move
the qstr up next to the hash list.

Notice, those wildly gyrating real times are still manifesting. Seen
them over there?

Regards,

Daniel

2006-03-06 19:51:50

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

On Sun, Mar 05, 2006 at 08:59:50PM -0800, Daniel Phillips wrote:
> So that hack apparently improves the bucket distribution quite a bit, but
> look, the bad old linear systime creep is still very obvious. For that
> there is no substitute for lots of buckets.
Yes, I think the way to go right now is to allocate an array of pages and
index into that. We can make the array size a mount option so that the
default can be something reasonable ;)

> Speaking of that, the qstr is so far away from the list.next pointer that
> you're most probably getting an extra cache line load there too. Just move
> the qstr up next to the hash list.
Ahh, good catch.

> Notice, those wildly gyrating real times are still manifesting. Seen
> them over there?
Yes - times from a test run are below.

Real User Sys
14.35 3.82 11.99
16.75 3.62 13.90
32.12 4.06 15.81
18.94 3.99 16.69
35.18 3.96 15.80
18.66 3.89 16.44
52.09 4.24 16.66
21.25 3.85 17.09
19.73 4.00 17.20
35.40 3.88 16.60

So this definitely reproduces elsewhere.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-03-07 03:34:16

by Andi Kleen

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

Mark Fasheh <[email protected]> writes:

> On Sun, Mar 05, 2006 at 08:59:50PM -0800, Daniel Phillips wrote:
> > So that hack apparently improves the bucket distribution quite a bit, but
> > look, the bad old linear systime creep is still very obvious. For that
> > there is no substitute for lots of buckets.
> Yes, I think the way to go right now is to allocate an array of pages and
> index into that. We can make the array size a mount option so that the
> default can be something reasonable ;)

Did you actually do some statistics how long the hash chains are?
Just increasing hash tables blindly has other bad side effects, like
increasing cache misses.

-Andi

2006-03-07 04:58:56

[permalink] [raw]

Subject: Re: Ocfs2 performance bugs of doom

On Tue, Mar 07, 2006 at 04:34:12AM +0100, Andi Kleen wrote:
> Did you actually do some statistics how long the hash chains are?
> Just increasing hash tables blindly has other bad side effects, like
> increasing cache misses.
Yep, the gory details are at:

http://oss.oracle.com/~mfasheh/lock_distribution.csv

This measure was taken about 18,000 locks into a kernel untar. The only
change was that I switched things to only hash the last 18 characters of
lock resource names.

In short things aren't so bad that a larger hash table wouldn't help. We've
definitely got some peaks however. Our in-house laboratory of mathematicians
(read: Bill Irwin) is checking out methods by which we can smooth things out
a bit more :)
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-03-07 06:57:10

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Mark Fasheh wrote:
> On Tue, Mar 07, 2006 at 04:34:12AM +0100, Andi Kleen wrote:
>
>>Did you actually do some statistics how long the hash chains are?
>>Just increasing hash tables blindly has other bad side effects, like
>>increasing cache misses.
>
> Yep, the gory details are at:
>
> http://oss.oracle.com/~mfasheh/lock_distribution.csv
>
> This measure was taken about 18,000 locks into a kernel untar. The only
> change was that I switched things to only hash the last 18 characters of
> lock resource names.

Eventually you will realize how much those bloated lock resource names cost
in CPU and change them to a binary encoding.

> In short things aren't so bad that a larger hash table wouldn't help. We've
> definitely got some peaks however. Our in-house laboratory of mathematicians
> (read: Bill Irwin) is checking out methods by which we can smooth things out
> a bit more :)

Twould be a shame to invest a lot of effort hashing those horrible strings
before finally realizing the strings themselves should be gotten rid of, and
have to optimize the hash all over again.

Since the hash table won't fit in cache except on the beefiest of machines,
the right hash chain length to aim for is one. That requires both good a
hash function and big hash table. A sane resource encoding would also help.

Chestnut for the day: don't get too obsessed about optimizing for the light
load case. A cluster filesystem is supposed to be a beast of burden. It
needs big feet.

Regards,

Daniel

2006-03-09 06:26:31

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Mark Fasheh wrote:
> On Tue, Mar 07, 2006 at 04:34:12AM +0100, Andi Kleen wrote:
>>Did you actually do some statistics how long the hash chains are?
>>Just increasing hash tables blindly has other bad side effects, like
>>increasing cache misses.
>
> Yep, the gory details are at:
>
> http://oss.oracle.com/~mfasheh/lock_distribution.csv

Hi Mark,

I don't know how you got your statistics, but I went to the bother of
writing a proc interface to dump the bucket counts. Mostly to try out the
seq_file interface, though it turned into something more of a tour through
the slimy unberbelly of seq_read. (This patch includes surgery for one of
the several bunions I found there - seq_read failed to distinguish between
"buffer full" and "buffer overrun", expanding the buffer unnecessarily in
the former case). This also includes my simple-minded big hash table
code, with a 256K table size this time.

This proc interface always names itself /proc/ocfs2hash (a bug if you
create more than one lockspace). It gives the chain length as a uint32_t
per bucket.

Dump the buckets like this:

cat /proc/ocfs2hash | hexdump -e'32 " %1.1X" "\n"'

To total the buckets:

cat /proc/ocfs2hash | hexdump -v -e'"%1.1d\n"' | awk '{sum += $1} END {print "Total: ", sum;}'

After untarring four kernel trees, number of locks hits a peak of 128K.
With 64K buckets the hash, a typical region of the table looks like:

3 0 3 6 1 2 2 1 2 6 1 0 1 2 1 0 0 3 1 2 2 3 1 2 5 3 2 2 1 0 3 1
1 1 3 5 2 2 1 0 2 3 3 1 3 0 5 2 2 3 0 2 1 3 1 2 4 2 0 2 5 1 4 3
5 3 3 3 3 1 4 1 2 1 2 6 2 1 3 0 2 2 2 8 1 2 2 2 3 1 0 1 3 2 1 1
1 2 2 2 2 3 2 0 2 2 2 5 2 3 2 1 1 2 6 6 1 2 2 4 2 0 3 0 3 3 3 0
2 2 1 1 2 3 2 0 2 3 3 1 3 3 3 1 4 2 8 3 2 2 2 4 3 0 1 2 3 4 2 0
3 1 0 1 2 2 3 1 4 2 1 1 3 3 4 3 3 3 4 2 1 4 2 1 5 2 1 3 1 2 3 2
1 0 1 5 3 2 1 2 3 0 1 1 2 3 4 4 4 1 3 1 4 3 2 2 4 4 1 3 1 0 0 1
3 1 1 3 0 3 0 1 1 1 1 1 3 4 4 2 4 3 4 2 3 3 0 3 4 2 1 5 4 1 3 1
1 0 1 0 1 4 1 2 1 4 2 0 2 2 5 2 1 1 1 2 3 6 4 5 5 1 1 2 3 1 5 1
3 0 1 0 3 3 2 0 2 1 2 1 0 4 3 2 1 0 1 0 2 7 1 3 2 1 1 2 4 1 3 1
2 2 3 1 3 3 1 2 0 2 1 3 1 2 0 4 4 1 2 1 2 3 3 6 0 5 2 1 1 0 3 0
1 0 2 0 4 3 2 1 0 0 2 0 1 4 2 4 5 1 0 1 3 2 2 1 1 3 2 3 0 2 1 1
3 0 0 0 2 5 3 1 0 2 0 1 0 0 2 0 4 2 1 2 4 3 0 1 2 4 1 3 0 0 1 4

A poor distribution as you already noticed[1]. Even if it was a great
distribution, we would still average a little over two nodes per bucket
twice as many as we should allow unless you believe that people running
cluster filesystems have too much time on their hands and need to waste
some of it waiting for the computer to chew its way through millions
of cold cache lines.

So if we improve the hash function, 128K buckets (512K table size) is
the right number, given a steady-state number of hash resources around
128K. This is pretty much independent of load: if you run light loads
long enough, you will eventually fill up the hash table. Of course, if
we take a critical look at your locking strategy we might find some fat
to cut there too. Could I possibly interest you in writing up a tech
note on your global locking strategy?

[1] And this is our all-important dentry hash function. Oops.

Regards,

Daniel

diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2.6.16/fs/ocfs2/dlm/dlmcommon.h
--- 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2006-03-02 19:05:13.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmcommon.h 2006-03-08 15:47:53.000000000 -0800
@@ -36,8 +36,8 @@
#define DLM_LOCK_RES_OWNER_UNKNOWN O2NM_MAX_NODES
#define DLM_THREAD_SHUFFLE_INTERVAL 5 // flush everything every 5 passes
#define DLM_THREAD_MS 200 // flush at least every 200 ms
-
-#define DLM_HASH_BUCKETS (PAGE_SIZE / sizeof(struct hlist_head))
+#define DLM_HASH_SIZE (1 << 18)
+#define DLM_HASH_BUCKETS (DLM_HASH_SIZE / sizeof(struct hlist_head))

enum dlm_ast_type {
DLM_AST = 0,
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2.6.16/fs/ocfs2/dlm/dlmdomain.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2006-03-02 16:47:47.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmdomain.c 2006-03-08 20:26:47.000000000 -0800
@@ -33,6 +33,7 @@
#include <linux/spinlock.h>
#include <linux/delay.h>
#include <linux/err.h>
+#include <linux/vmalloc.h>

#include "cluster/heartbeat.h"
#include "cluster/nodemanager.h"
@@ -49,6 +50,80 @@
#define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_DOMAIN)
#include "cluster/masklog.h"

+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+static void hashdump_stop(struct seq_file *seq, void *v)
+{
+};
+
+static void *hashdump_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ return *pos < DLM_HASH_SIZE ? (++*pos, pos) : NULL;
+};
+
+static void *hashdump_start(struct seq_file *seq, loff_t *pos)
+{
+ return *pos < DLM_HASH_SIZE ? pos : NULL;
+};
+
+int hashdump_show(struct seq_file *seq, void *v)
+{
+ loff_t *ppos = v, pos = *ppos;
+ int len = seq->size - seq->count, tail = DLM_HASH_SIZE - pos;
+
+ if (len > tail)
+ len = tail;
+ memcpy(seq->buf + seq->count, seq->private + pos, len);
+ seq->count += len;
+ *ppos += len - 1;
+ return 0;
+}
+
+static struct seq_operations hashdump_ops = {
+ .start = hashdump_start,
+ .next = hashdump_next,
+ .stop = hashdump_stop,
+ .show = hashdump_show
+};
+
+static int hashdump_open(struct inode *inode, struct file *file)
+{
+ int err = seq_open(file, &hashdump_ops);
+ if (!err) {
+ struct seq_file *seq = file->private_data;
+ int size = DLM_HASH_BUCKETS * sizeof(int), *vec = vmalloc(size);
+ if (vec) {
+ struct dlm_ctxt *dlm = PDE(inode)->data;
+ struct hlist_head *heads = dlm->lockres_hash;
+ struct hlist_node *node;
+ struct dlm_lock_resource *res;
+ int i;
+
+ memset(vec, 0, size);
+ for (i = 0; i < DLM_HASH_BUCKETS; i++)
+ hlist_for_each_entry(res, node, heads + i, hash_node)
+ vec[i]++;
+ seq->private = vec;
+ }
+ }
+ return err;
+};
+
+int hashdump_release(struct inode *inode, struct file *file)
+{
+ vfree(((struct seq_file *)file->private_data)->private);
+ return seq_release(inode, file);
+}
+
+static struct file_operations hashdump_file_ops = {
+ .owner = THIS_MODULE,
+ .open = hashdump_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = hashdump_release
+};
+
/*
*
* spinlock lock ordering: if multiple locks are needed, obey this ordering:
@@ -194,7 +269,7 @@ static int dlm_wait_on_domain_helper(con
static void dlm_free_ctxt_mem(struct dlm_ctxt *dlm)
{
if (dlm->lockres_hash)
- free_page((unsigned long) dlm->lockres_hash);
+ vfree(dlm->lockres_hash);

if (dlm->name)
kfree(dlm->name);
@@ -223,6 +298,7 @@ static void dlm_ctxt_release(struct kref

wake_up(&dlm_domain_events);

+ remove_proc_entry("ocfs2hash", NULL);
dlm_free_ctxt_mem(dlm);

spin_lock(&dlm_domain_lock);
@@ -1191,7 +1267,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
goto leave;
}

- dlm->lockres_hash = (struct hlist_head *) __get_free_page(GFP_KERNEL);
+ dlm->lockres_hash = (struct hlist_head *)vmalloc(DLM_HASH_SIZE);
if (!dlm->lockres_hash) {
mlog_errno(-ENOMEM);
kfree(dlm->name);
@@ -1203,6 +1279,14 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
for (i=0; i<DLM_HASH_BUCKETS; i++)
INIT_HLIST_HEAD(&dlm->lockres_hash[i]);

+ {
+ struct proc_dir_entry *entry = create_proc_entry("ocfs2hash", 0, NULL);
+ if (entry) {
+ entry->proc_fops = &hashdump_file_ops;
+ entry->data = dlm;
+ }
+ }
+
strcpy(dlm->name, domain);
dlm->key = key;
dlm->node_num = o2nm_this_node();
diff -urp 2.6.16.clean/fs/seq_file.c 2.6.16/fs/seq_file.c
--- 2.6.16.clean/fs/seq_file.c 2006-02-12 16:27:25.000000000 -0800
+++ 2.6.16/fs/seq_file.c 2006-03-08 16:59:27.000000000 -0800
@@ -106,18 +106,17 @@ ssize_t seq_read(struct file *file, char
if (!size)
goto Done;
}
+expand:
/* we need at least one record in buffer */
- while (1) {
- pos = m->index;
- p = m->op->start(m, &pos);
- err = PTR_ERR(p);
- if (!p || IS_ERR(p))
- break;
- err = m->op->show(m, p);
- if (err)
- break;
- if (m->count < m->size)
- goto Fill;
+ pos = m->index;
+ p = m->op->start(m, &pos);
+ err = PTR_ERR(p);
+ if (!p || IS_ERR(p))
+ goto Errstop;
+
+ err = m->op->show(m, p);
+
+ if (err == -1) {
m->op->stop(m, p);
kfree(m->buf);
m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
@@ -125,12 +124,12 @@ ssize_t seq_read(struct file *file, char
goto Enomem;
m->count = 0;
m->version = 0;
+ goto expand;
}
- m->op->stop(m, p);
- m->count = 0;
- goto Done;
-Fill:
- /* they want more? let's try to get some more */
+
+ if (err)
+ goto Errstop;
+
while (m->count < size) {
size_t offs = m->count;
loff_t next = pos;
@@ -172,6 +171,10 @@ Enomem:
Efault:
err = -EFAULT;
goto Done;
+Errstop:
+ m->op->stop(m, p);
+ m->count = 0;
+ goto Done;
}
EXPORT_SYMBOL(seq_read);

2006-03-09 07:26:19

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Daniel Phillips wrote:

> After untarring four kernel trees, number of locks hits a peak of 128K.
> With 64K buckets the hash, a typical region of the table looks like:
>
> 3 0 3 6 1 2 2 1 2 6 1 0 1 2 1 0 0 3 1 2 2 3 1 2 5 3 2 2 1 0 3 1
> 1 1 3 5 2 2 1 0 2 3 3 1 3 0 5 2 2 3 0 2 1 3 1 2 4 2 0 2 5 1 4 3
> 5 3 3 3 3 1 4 1 2 1 2 6 2 1 3 0 2 2 2 8 1 2 2 2 3 1 0 1 3 2 1 1
> 1 2 2 2 2 3 2 0 2 2 2 5 2 3 2 1 1 2 6 6 1 2 2 4 2 0 3 0 3 3 3 0
> 2 2 1 1 2 3 2 0 2 3 3 1 3 3 3 1 4 2 8 3 2 2 2 4 3 0 1 2 3 4 2 0
> 3 1 0 1 2 2 3 1 4 2 1 1 3 3 4 3 3 3 4 2 1 4 2 1 5 2 1 3 1 2 3 2
> 1 0 1 5 3 2 1 2 3 0 1 1 2 3 4 4 4 1 3 1 4 3 2 2 4 4 1 3 1 0 0 1
> 3 1 1 3 0 3 0 1 1 1 1 1 3 4 4 2 4 3 4 2 3 3 0 3 4 2 1 5 4 1 3 1
> 1 0 1 0 1 4 1 2 1 4 2 0 2 2 5 2 1 1 1 2 3 6 4 5 5 1 1 2 3 1 5 1
> 3 0 1 0 3 3 2 0 2 1 2 1 0 4 3 2 1 0 1 0 2 7 1 3 2 1 1 2 4 1 3 1
> 2 2 3 1 3 3 1 2 0 2 1 3 1 2 0 4 4 1 2 1 2 3 3 6 0 5 2 1 1 0 3 0
> 1 0 2 0 4 3 2 1 0 0 2 0 1 4 2 4 5 1 0 1 3 2 2 1 1 3 2 3 0 2 1 1
> 3 0 0 0 2 5 3 1 0 2 0 1 0 0 2 0 4 2 1 2 4 3 0 1 2 4 1 3 0 0 1 4
>
> A poor distribution as you already noticed[1].
[...]
> [1] And this is our all-important dentry hash function. Oops.

Why do you say that? I don't think it is particularly bad. Your sample
has an average of 2.06, and a standard deviation of 1.49 while a random
assignment using glibc's random() here has a standard deviation of 1.44

I can't remember much stats, but I think that means in your sample of
over 400, you're likely to find several instances of "6" and not
unlikely to find some higher. Actually given that it is bound at zero
and so not a normal distribution, that is likely to move results a bit
further to the right [I'm sure there's a formula for that], which is
pretty consistent with what you have, and you would hardly notice a
difference using rand()

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 07:43:16

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Daniel Phillips wrote:

> A poor distribution as you already noticed[1]. Even if it was a great
> distribution, we would still average a little over two nodes per bucket
> twice as many as we should allow unless you believe that people running
> cluster filesystems have too much time on their hands and need to waste
> some of it waiting for the computer to chew its way through millions
> of cold cache lines.
>

Just interested: do the locks have any sort of locality of lookup?
If so, then have you tried moving hot (ie. the one you've just found,
or newly inserted) hash entries to the head of the hash list?

In applications with really good locality you can sometimes get away
with small hash tables (10s even 100s of collisions on average) without
taking too big a hit this way, because your entries basically get sorted
LRU for you.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 11:47:00

by Andi Kleen

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

On Thursday 09 March 2006 08:43, Nick Piggin wrote:

> Just interested: do the locks have any sort of locality of lookup?
> If so, then have you tried moving hot (ie. the one you've just found,
> or newly inserted) hash entries to the head of the hash list?
>
> In applications with really good locality you can sometimes get away
> with small hash tables (10s even 100s of collisions on average) without
> taking too big a hit this way, because your entries basically get sorted
> LRU for you.

LRU hashes have really bad cache behaviour though if that is not the case
because you possibily need to bounce around the hash heads as DIRTY
cache lines instead of keeping them in SHARED state.
My feeling would be that scalability is more important for this, which would
discourage this.

-Andi

2006-03-09 12:34:57

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Andi Kleen wrote:
> On Thursday 09 March 2006 08:43, Nick Piggin wrote:
>
>
>>Just interested: do the locks have any sort of locality of lookup?
>>If so, then have you tried moving hot (ie. the one you've just found,
>>or newly inserted) hash entries to the head of the hash list?
>>
>>In applications with really good locality you can sometimes get away
>>with small hash tables (10s even 100s of collisions on average) without
>>taking too big a hit this way, because your entries basically get sorted
>>LRU for you.
>
>
> LRU hashes have really bad cache behaviour though if that is not the case
> because you possibily need to bounce around the hash heads as DIRTY
> cache lines instead of keeping them in SHARED state.
> My feeling would be that scalability is more important for this, which would
> discourage this.
>

That's true, it would have to have very good locality of reference to
be of use. In that case it is not always going to dirty the cachelines
because you now only have to make your hash table size appropriate for
your _working set_ rather than the entire set - if the working set is
small enough and you make your hash say 4 times bigger than it, then
you might expect to often hit the right lock at the head of the list.

The other thing is: if the alternative is a 1MB hash, then that may
result in more capacity cache misses than invalidate misses. Not to
mention TLB misses if it is vmalloced :(

Anyway you're absolutely right -- this is only going to work under
select types of hash loads, so lots of testing would be required.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-10 00:21:39

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

E-mail subject line change, because the old one has worn thin ;)

On Wed, Mar 08, 2006 at 10:26:09PM -0800, Daniel Phillips wrote:
> I don't know how you got your statistics, but I went to the bother of
> writing a proc interface to dump the bucket counts.
The initial set of numbers I posted was collected by hacking the dlm to dump
bucket lengths at about 20,000 locks. I have since written some user space
code to model the dlm hash and output chain lengths. This allowed me to test
hash distributions at a rapid rate. As a data set, I used debugfs.ocfs2 to
collect lock names from a file system with a fully untarred kernel tree. The
program outputs in a format suitable for import into gnuplot. If interested,
you can get the program and a sample data set from:

http://oss.oracle.com/~mfasheh/dlm_hash/distribution_modeling/

Standard disclaimers apply - that code is a hack and wasn't edited to be
particularly performant, error resistant or pretty. It's also clearly not
intended for testing the actual performance of the hash (just distribution
output). Anyway, it turns out that full_name_hash() didn't have the severe
spikes in distribution that my original hack to compare only the last parts
of a lockres name had. This makes it inappropriate for general use, even
though it resulted in a performance gain on our micro-benchmark.

> So if we improve the hash function, 128K buckets (512K table size) is
> the right number, given a steady-state number of hash resources around
> 128K. This is pretty much independent of load: if you run light loads
> long enough, you will eventually fill up the hash table.
Your hash sizes are still ridiculously large. All my data shows that we need
to increase the hash size by much much less. At the following location you
will find a series of files detailing the results of 10 untar runs with
various hash allocations. The short story is that we really only need
an allocation on the order of a few pages.

http://oss.oracle.com/~mfasheh/dlm_hash/untar_timings/

If you average up the untar times you'll get something close to this:
1 page: 23 seconds
2 pages: 18 seconds
4 pages: 16 seconds
6 pages: 15 seconds
8 pages: 14 seconds
16 pages: 14 seconds
32 pages: 14 seconds
64 pages: 14 seconds

PAGE_SIZE on this system is 4096

So our largest performance gain is by just adding a page, and things seem to
top out at 6-8 pages. I will likely have a patch in the next few days or so
which will allocate a small array of pages at dlm startup. The bottom line
is that the default is extremely likely to be on the order of a few pages.
Eventually this will also be user configurable.

> Of course, if we take a critical look at your locking strategy we might
> find some fat to cut there too. Could I possibly interest you in writing
> up a tech note on your global locking strategy?
Sure, but it'll take a while. I've already got one OCFS2 related paper to
write. Perhaps I'll be able to kill two birds with one stone.

By the way, an interesting thing happened when I recently switched disk
arrays - the fluctuations in untar times disappeared. The new array is much
nicer, while the old one was basically Just A Bunch Of Disks. Also, sync
times dropped dramatically.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-03-10 01:14:12

by be-news06

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

Mark Fasheh <[email protected]> wrote:
> Your hash sizes are still ridiculously large.

How long are those entries in the buckets kept? I mean if I untar a tree the
files are only locked while extracted, afterwards they are owner-less... (I
must admint I dont understand ocfs2 very deeply, but maybe explaining why so
many active locks need to be cached might help to find an optimized way.

> By the way, an interesting thing happened when I recently switched disk
> arrays - the fluctuations in untar times disappeared. The new array is much
> nicer, while the old one was basically Just A Bunch Of Disks. Also, sync
> times dropped dramatically.

Writeback Cache?

Gruss
Bernd

2006-03-10 02:33:16

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

On Wed, Mar 08, 2006 at 10:26:09PM -0800, Daniel Phillips wrote:
> After untarring four kernel trees, number of locks hits a peak of 128K.
> With 64K buckets the hash, a typical region of the table looks like:
>
> 3 0 3 6 1 2 2 1 2 6 1 0 1 2 1 0 0 3 1 2 2 3 1 2 5 3 2 2 1 0 3 1
> 1 1 3 5 2 2 1 0 2 3 3 1 3 0 5 2 2 3 0 2 1 3 1 2 4 2 0 2 5 1 4 3
> 5 3 3 3 3 1 4 1 2 1 2 6 2 1 3 0 2 2 2 8 1 2 2 2 3 1 0 1 3 2 1 1
> 1 2 2 2 2 3 2 0 2 2 2 5 2 3 2 1 1 2 6 6 1 2 2 4 2 0 3 0 3 3 3 0
> 2 2 1 1 2 3 2 0 2 3 3 1 3 3 3 1 4 2 8 3 2 2 2 4 3 0 1 2 3 4 2 0
> 3 1 0 1 2 2 3 1 4 2 1 1 3 3 4 3 3 3 4 2 1 4 2 1 5 2 1 3 1 2 3 2
> 1 0 1 5 3 2 1 2 3 0 1 1 2 3 4 4 4 1 3 1 4 3 2 2 4 4 1 3 1 0 0 1
> 3 1 1 3 0 3 0 1 1 1 1 1 3 4 4 2 4 3 4 2 3 3 0 3 4 2 1 5 4 1 3 1
> 1 0 1 0 1 4 1 2 1 4 2 0 2 2 5 2 1 1 1 2 3 6 4 5 5 1 1 2 3 1 5 1
> 3 0 1 0 3 3 2 0 2 1 2 1 0 4 3 2 1 0 1 0 2 7 1 3 2 1 1 2 4 1 3 1
> 2 2 3 1 3 3 1 2 0 2 1 3 1 2 0 4 4 1 2 1 2 3 3 6 0 5 2 1 1 0 3 0
> 1 0 2 0 4 3 2 1 0 0 2 0 1 4 2 4 5 1 0 1 3 2 2 1 1 3 2 3 0 2 1 1
> 3 0 0 0 2 5 3 1 0 2 0 1 0 0 2 0 4 2 1 2 4 3 0 1 2 4 1 3 0 0 1 4
>
> A poor distribution as you already noticed[1].

How did you decide that? The distribution of bucket sizes is:

0: 58
1: 108
2: 105
3: 82
4: 37
5: 16
6: 7
7: 1
8: 2

which, without running any statistics, looks pretty close to a binomial
distribution:

0 52.884
1 109.21
2 112.63
3 77.349
4 39.793
5 16.358
6 5.5972
7 1.6397
8 0.41980

so it's probably about what you'd get if the hash function were choosing
buckets uniformly at random, which is the best I'd think you could do
without special knowledge of the inputs.

--b.

2006-03-10 05:14:36

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

Nick Piggin wrote:
> Andi Kleen wrote:
>
>> On Thursday 09 March 2006 08:43, Nick Piggin wrote:
>>
>>
>>> Just interested: do the locks have any sort of locality of lookup?
>>> If so, then have you tried moving hot (ie. the one you've just found,
>>> or newly inserted) hash entries to the head of the hash list?
>>>
>>> In applications with really good locality you can sometimes get away
>>> with small hash tables (10s even 100s of collisions on average) without
>>> taking too big a hit this way, because your entries basically get sorted
>>> LRU for you.
>>
>>
>>
>> LRU hashes have really bad cache behaviour though if that is not the case
>> because you possibily need to bounce around the hash heads as DIRTY
>> cache lines instead of keeping them in SHARED state.
>> My feeling would be that scalability is more important for this, which
>> would
>> discourage this.
>>
>
> That's true, it would have to have very good locality of reference to
> be of use. In that case it is not always going to dirty the cachelines
> because you now only have to make your hash table size appropriate for
> your _working set_ rather than the entire set - if the working set is
> small enough and you make your hash say 4 times bigger than it, then
> you might expect to often hit the right lock at the head of the list.
>

OTOH, I suspect it actually isn't all that bad. There is already a
shared lock there, which will definitely have its cacheline invalidated.

So adding an extra cacheline bounce is not like the bad problem of going
from perfect scalability (no shared cachelines) to a single shared cacheline.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-10 07:10:45

by Joel Becker

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

Please don't drop the ocfs2-devel Cc: I'm bouncing this there.

On Fri, Mar 10, 2006 at 02:14:08AM +0100, Bernd Eckenfels wrote:
> Mark Fasheh <[email protected]> wrote:
> > Your hash sizes are still ridiculously large.
>
> How long are those entries in the buckets kept? I mean if I untar a tree the
> files are only locked while extracted, afterwards they are owner-less... (I
> must admint I dont understand ocfs2 very deeply, but maybe explaining why so
> many active locks need to be cached might help to find an optimized way.
>
> > By the way, an interesting thing happened when I recently switched disk
> > arrays - the fluctuations in untar times disappeared. The new array is much
> > nicer, while the old one was basically Just A Bunch Of Disks. Also, sync
> > times dropped dramatically.
>
> Writeback Cache?
>
> Gruss
> Bernd
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--

"Depend on the rabbit's foot if you will, but remember, it didn't
help the rabbit."
- R. E. Shay

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2006-03-10 10:27:26

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance bugs of doom

J. Bruce Fields wrote:
> On Wed, Mar 08, 2006 at 10:26:09PM -0800, Daniel Phillips wrote:
>>A poor distribution as you already noticed[1].
>
> How did you decide that?

I looked at it and jumped to the wrong conclusion :-)

When I actually simulated it I found that the distribution is in fact
not much different from what I get from rand. So much for trying to
pin the blame on the hash function. Next lets try to pin the blame
on vmalloc. (Spoiler: it's not vmalloc's fault either.)

Regards,

Daniel

2006-03-10 11:17:59

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

Mark Fasheh wrote:
> Your hash sizes are still ridiculously large. All my data shows that we need
> to increase the hash size by much much less. At the following location you
> will find a series of files detailing the results of 10 untar runs with
> various hash allocations. The short story is that we really only need
> an allocation on the order of a few pages.
>
> http://oss.oracle.com/~mfasheh/dlm_hash/untar_timings/

There is a very obvious drop going from 16 to 32 pages, then you seem to
drop into the noise. So if you agree that 32 is a "few" pages I think
we're in the right ballpark. I've been testing with 64 pages lately, with
satisfactory results.

Funny your systimes are so much higher than mine. You don't seem to have
the massive failure to start writeout that I see here. Did you fix
something? Also, I don't see that wildly flucuating real time. Fixed
something else? I'm still running 2.6.16-pre3.

> If you average up the untar times you'll get something close to this:
> 1 page: 23 seconds
> 2 pages: 18 seconds
> 4 pages: 16 seconds
> 6 pages: 15 seconds
> 8 pages: 14 seconds
> 16 pages: 14 seconds
> 32 pages: 14 seconds
> 64 pages: 14 seconds

Ahem, we were talking about systime. Sure, I've noticed that systime often
gets buried by disk latency, but not always, and even when it does it still
generates heat and slows down other things that might be running at the same
time. A database for example :-)

> PAGE_SIZE on this system is 4096
>
> So our largest performance gain is by just adding a page, and things seem to
> top out at 6-8 pages. I will likely have a patch in the next few days or so
> which will allocate a small array of pages at dlm startup.

No need, here is a functioning patch, slice and dice to your heart's content.
I wrote this to test the theory I've often heard that vmallocs are more costly
to use than vectors full of pages because of tlb pressure. Balderdash, I say.

With hashvec, (64K buckets)

real user sys
29.62 23.87 3.02
28.80 24.16 3.07
50.95 24.11 3.04
28.17 23.95 3.20
61.21 23.89 3.16
28.61 23.88 3.30

With vmalloc (64K buckets)

real user sys
29.67 23.98 2.88
28.35 23.96 3.13
52.29 24.16 2.89
28.39 24.07 2.97
65.13 24.08 3.01
28.08 24.02 3.16

Pretty close race - vmalloc is slightly faster if anything. Note: I presume
that gcc has compiled shifts and masks for my table arithmetic but I did not
check. Anyway, this patch does eliminate the possibility of running around on
vmalloc fragmentation.

>>Of course, if we take a critical look at your locking strategy we might
>>find some fat to cut there too. Could I possibly interest you in writing
>>up a tech note on your global locking strategy?
>
> Sure, but it'll take a while. I've already got one OCFS2 related paper to
> write. Perhaps I'll be able to kill two birds with one stone.

We're not picky, how about some early notes?

> By the way, an interesting thing happened when I recently switched disk
> arrays - the fluctuations in untar times disappeared. The new array is much
> nicer, while the old one was basically Just A Bunch Of Disks. Also, sync
> times dropped dramatically.

Interesting, but maybe you'd better treat the bad old disk as a blessing
in disguise and let it help you find that bug.

Regards,

Daniel

diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2.6.16/fs/ocfs2/dlm/dlmcommon.h
--- 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2006-03-02 19:05:13.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmcommon.h 2006-03-10 01:12:45.000000000 -0800
@@ -36,8 +36,10 @@
#define DLM_LOCK_RES_OWNER_UNKNOWN O2NM_MAX_NODES
#define DLM_THREAD_SHUFFLE_INTERVAL 5 // flush everything every 5 passes
#define DLM_THREAD_MS 200 // flush at least every 200 ms
-
-#define DLM_HASH_BUCKETS (PAGE_SIZE / sizeof(struct hlist_head))
+#define DLM_HASH_SIZE (1 << 18)
+#define DLM_HASH_PAGES (DLM_HASH_SIZE / PAGE_SIZE)
+#define DLM_BUCKETS_PER_PAGE (PAGE_SIZE / sizeof(struct hlist_head))
+#define DLM_HASH_BUCKETS (DLM_HASH_PAGES * DLM_BUCKETS_PER_PAGE)

enum dlm_ast_type {
DLM_AST = 0,
@@ -85,7 +87,7 @@ enum dlm_ctxt_state {
struct dlm_ctxt
{
struct list_head list;
- struct hlist_head *lockres_hash;
+ struct hlist_head **lockres_hash;
struct list_head dirty_list;
struct list_head purge_list;
struct list_head pending_asts;
@@ -132,6 +134,11 @@ struct dlm_ctxt
struct list_head dlm_eviction_callbacks;
};

+static inline struct hlist_head *lockres_hash(struct dlm_ctxt *dlm, unsigned i)
+{
+ return dlm->lockres_hash[(i / DLM_BUCKETS_PER_PAGE) % DLM_HASH_PAGES] + i % DLM_BUCKETS_PER_PAGE;
+}
+
/* these keventd work queue items are for less-frequently
* called functions that cannot be directly called from the
* net message handlers for some reason, usually because
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmdebug.c 2.6.16/fs/ocfs2/dlm/dlmdebug.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmdebug.c 2006-03-02 16:47:47.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmdebug.c 2006-03-10 00:31:26.000000000 -0800
@@ -130,7 +130,7 @@ void dlm_dump_lock_resources(struct dlm_

spin_lock(&dlm->spinlock);
for (i=0; i<DLM_HASH_BUCKETS; i++) {
- bucket = &(dlm->lockres_hash[i]);
+ bucket = lockres_hash(dlm, i);
hlist_for_each_entry(res, iter, bucket, hash_node)
dlm_print_one_lock_resource(res);
}
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2.6.16/fs/ocfs2/dlm/dlmdomain.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2006-03-02 16:47:47.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmdomain.c 2006-03-10 01:08:46.000000000 -0800
@@ -33,6 +33,7 @@
#include <linux/spinlock.h>
#include <linux/delay.h>
#include <linux/err.h>
+#include <linux/vmalloc.h>

#include "cluster/heartbeat.h"
#include "cluster/nodemanager.h"
@@ -49,6 +50,33 @@
#define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_DOMAIN)
#include "cluster/masklog.h"

+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+
+void free_pagevec(void **vec, int pages)
+{
+ while (pages--)
+ free_page((unsigned long)vec[pages]);
+ kfree(vec);
+}
+
+void **alloc_pagevec(int pages)
+{
+ void **vec = kmalloc(pages * sizeof(void *), GFP_KERNEL);
+ int i;
+
+ if (!vec)
+ return NULL;
+
+ for (i = 0; i < pages; i++)
+ if (!(vec[i] = (void *)__get_free_page(GFP_KERNEL)))
+ goto lame;
+ return vec;
+lame:
+ free_pagevec(vec, i);
+ return NULL;
+}
+
/*
*
* spinlock lock ordering: if multiple locks are needed, obey this ordering:
@@ -91,7 +119,7 @@ void __dlm_insert_lockres(struct dlm_ctx

q = &res->lockname;
q->hash = full_name_hash(q->name, q->len);
- bucket = &(dlm->lockres_hash[q->hash % DLM_HASH_BUCKETS]);
+ bucket = lockres_hash(dlm, q->hash);

/* get a reference for our hashtable */
dlm_lockres_get(res);
@@ -114,7 +142,7 @@ struct dlm_lock_resource * __dlm_lookup_

hash = full_name_hash(name, len);

- bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]);
+ bucket = lockres_hash(dlm, hash);

/* check for pre-existing lock */
hlist_for_each(iter, bucket) {
@@ -194,7 +222,7 @@ static int dlm_wait_on_domain_helper(con
static void dlm_free_ctxt_mem(struct dlm_ctxt *dlm)
{
if (dlm->lockres_hash)
- free_page((unsigned long) dlm->lockres_hash);
+ free_pagevec((void **)dlm->lockres_hash, DLM_HASH_PAGES);

if (dlm->name)
kfree(dlm->name);
@@ -223,6 +251,7 @@ static void dlm_ctxt_release(struct kref

wake_up(&dlm_domain_events);

+ remove_proc_entry("ocfs2hash", NULL);
dlm_free_ctxt_mem(dlm);

spin_lock(&dlm_domain_lock);
@@ -304,8 +333,8 @@ static void dlm_migrate_all_locks(struct
restart:
spin_lock(&dlm->spinlock);
for (i = 0; i < DLM_HASH_BUCKETS; i++) {
- while (!hlist_empty(&dlm->lockres_hash[i])) {
- res = hlist_entry(dlm->lockres_hash[i].first,
+ while (!hlist_empty(lockres_hash(dlm, i))) {
+ res = hlist_entry(lockres_hash(dlm, i)->first,
struct dlm_lock_resource, hash_node);
/* need reference when manually grabbing lockres */
dlm_lockres_get(res);
@@ -1191,7 +1220,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
goto leave;
}

- dlm->lockres_hash = (struct hlist_head *) __get_free_page(GFP_KERNEL);
+ dlm->lockres_hash = (struct hlist_head **)alloc_pagevec(DLM_HASH_PAGES);
if (!dlm->lockres_hash) {
mlog_errno(-ENOMEM);
kfree(dlm->name);
@@ -1200,8 +1229,8 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
goto leave;
}

- for (i=0; i<DLM_HASH_BUCKETS; i++)
- INIT_HLIST_HEAD(&dlm->lockres_hash[i]);
+ for (i = 0; i < DLM_HASH_BUCKETS; i++)
+ INIT_HLIST_HEAD(lockres_hash(dlm, i));

strcpy(dlm->name, domain);
dlm->key = key;
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmrecovery.c 2.6.16/fs/ocfs2/dlm/dlmrecovery.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmrecovery.c 2006-03-02 19:05:13.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmrecovery.c 2006-03-10 00:30:58.000000000 -0800
@@ -1721,7 +1721,7 @@ static void dlm_finish_local_lockres_rec
* the RECOVERING state and set the owner
* if necessary */
for (i = 0; i < DLM_HASH_BUCKETS; i++) {
- bucket = &(dlm->lockres_hash[i]);
+ bucket = lockres_hash(dlm, i);
hlist_for_each_entry(res, hash_iter, bucket, hash_node) {
if (res->state & DLM_LOCK_RES_RECOVERING) {
if (res->owner == dead_node) {
@@ -1879,7 +1879,7 @@ static void dlm_do_local_recovery_cleanu
* need to be fired as a result.
*/
for (i = 0; i < DLM_HASH_BUCKETS; i++) {
- bucket = &(dlm->lockres_hash[i]);
+ bucket = lockres_hash(dlm, i);
hlist_for_each_entry(res, iter, bucket, hash_node) {
/* always prune any $RECOVERY entries for dead nodes,
* otherwise hangs can occur during later recovery */

2006-03-10 18:23:58

by Zach Brown

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

> Pretty close race - vmalloc is slightly faster if anything.

I don't think that test tells us anything interesting about the relative
load on the TLB. What would be interesting is seeing the effect
vmalloc()ed hashes have on a concurrently running load that puts heavy
pressure on the TLB.

- z

2006-03-10 21:13:29

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

Going back to that mysterious idle time fluctuation problem, I just noticed
something interesting here:

> With hashvec, (64K buckets)
>
> real user sys
> 29.62 23.87 3.02
> 28.80 24.16 3.07
> 50.95 24.11 3.04
> 28.17 23.95 3.20
> 61.21 23.89 3.16
> 28.61 23.88 3.30
>
> With vmalloc (64K buckets)
>
> real user sys
> 29.67 23.98 2.88
> 28.35 23.96 3.13
> 52.29 24.16 2.89
> 28.39 24.07 2.97
> 65.13 24.08 3.01
> 28.08 24.02 3.16

Look, the real time artifacts show up in the same places in the two
different runs. There was a reboot before each run. There is a sync
before and after each trial. Each trial was initiated by hand, so this
isn't about something else going on in the machine at the same time.

Got to be a clue in there somewhere.

Regards,

Daniel

2006-03-10 21:13:46

[permalink] [raw]

Subject: Re: [Ocfs2-devel] Ocfs2 performance

Zach Brown wrote:
>>Pretty close race - vmalloc is slightly faster if anything.
>
> I don't think that test tells us anything interesting about the relative
> load on the TLB. What would be interesting is seeing the effect
> vmalloc()ed hashes have on a concurrently running load that puts heavy
> pressure on the TLB.

Bzip2 isn't that kind of load?

Regards,

Daniel

2006-03-11 01:09:20