I was actually discussing this with Linus a while back, and finally
got around to testing it out now that I have a modern CPU to measure
it on! CCing linux-arch because it would be interesting to know
whether your tuned functions do better than gcc or not (I would
suspect not).
BTW. patch and numbers are on top of my scaling series, just for
an idea of what it does, I just want to generate some interesting
discussion.
If people are interested in running benchmarks, I'll be pushing out
a new update soon, after some more testing and debugging here.
The standard memcmp function on a Westmere system shows up hot in
profiles in the `git diff` workload (both parallel and single threaded),
and it is likely due to the costs associated with trapping into
microcode, and little opportunity to improve memory access (dentry
name is not likely to take up more than a cacheline).
So replace it with an open-coded byte comparison. This increases code
size by 24 bytes in the critical __d_lookup_rcu function, but the
speedup is huge, averaging 10 runs of each:
git diff st user sys elapsed CPU
before 1.15 2.57 3.82 97.1
after 1.14 2.35 3.61 96.8
git diff mt user sys elapsed CPU
before 1.27 3.85 1.46 349
after 1.26 3.54 1.43 333
Elapsed time for single threaded git diff at 95.0% confidence:
-0.21 +/- 0.01
-5.45% +/- 0.24%
Parallel case doesn't gain much, but that's because userspace runs
out of work to feed it -- efficiency is way up, though.
Signed-off-by: Nick Piggin <[email protected]>
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2010-12-09 05:07:19.000000000 +1100
+++ linux-2.6/fs/dcache.c 2010-12-09 17:37:30.000000000 +1100
@@ -1423,7 +1423,7 @@ static struct dentry *__d_instantiate_un
goto next;
if (qstr->len != len)
goto next;
- if (memcmp(qstr->name, name, len))
+ if (dentry_memcmp(qstr->name, name, len))
goto next;
__dget_dlock(alias);
spin_unlock(&alias->d_lock);
@@ -1771,7 +1771,7 @@ struct dentry *__d_lookup_rcu(struct den
} else {
if (tlen != len)
continue;
- if (memcmp(tname, str, tlen))
+ if (dentry_memcmp(tname, str, tlen))
continue;
}
/*
@@ -1901,7 +1901,7 @@ struct dentry *__d_lookup(struct dentry
} else {
if (tlen != len)
goto next;
- if (memcmp(tname, str, tlen))
+ if (dentry_memcmp(tname, str, tlen))
goto next;
}
Index: linux-2.6/include/linux/dcache.h
===================================================================
--- linux-2.6.orig/include/linux/dcache.h 2010-12-09 05:07:52.000000000 +1100
+++ linux-2.6/include/linux/dcache.h 2010-12-09 05:08:36.000000000 +1100
@@ -47,6 +47,20 @@ struct dentry_stat_t {
};
extern struct dentry_stat_t dentry_stat;
+static inline int dentry_memcmp(const unsigned char *cs,
+ const unsigned char *ct, size_t count)
+{
+ while (count) {
+ int ret = (*cs != *ct);
+ if (ret)
+ return ret;
+ cs++;
+ ct++;
+ count--;
+ }
+ return 0;
+}
+
/* Name hashing routines. Initial hash value */
/* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
#define init_name_hash() 0
On Thu, Dec 09, 2010 at 06:09:38PM +1100, Nick Piggin wrote:
> I was actually discussing this with Linus a while back, and finally
> got around to testing it out now that I have a modern CPU to measure
> it on! CCing linux-arch because it would be interesting to know
> whether your tuned functions do better than gcc or not (I would
> suspect not).
>
> BTW. patch and numbers are on top of my scaling series, just for
> an idea of what it does, I just want to generate some interesting
> discussion.
>
> If people are interested in running benchmarks, I'll be pushing out
> a new update soon, after some more testing and debugging here.
>
> The standard memcmp function on a Westmere system shows up hot in
> profiles in the `git diff` workload (both parallel and single threaded),
> and it is likely due to the costs associated with trapping into
> microcode, and little opportunity to improve memory access (dentry
> name is not likely to take up more than a cacheline).
>
> So replace it with an open-coded byte comparison. This increases code
> size by 24 bytes in the critical __d_lookup_rcu function, but the
> speedup is huge, averaging 10 runs of each:
>
> git diff st user sys elapsed CPU
> before 1.15 2.57 3.82 97.1
> after 1.14 2.35 3.61 96.8
>
> git diff mt user sys elapsed CPU
> before 1.27 3.85 1.46 349
> after 1.26 3.54 1.43 333
>
> Elapsed time for single threaded git diff at 95.0% confidence:
> -0.21 +/- 0.01
> -5.45% +/- 0.24%
Nice.
[..]
> +static inline int dentry_memcmp(const unsigned char *cs,
> + const unsigned char *ct, size_t count)
> +{
> + while (count) {
> + int ret = (*cs != *ct);
> + if (ret)
> + return ret;
> + cs++;
> + ct++;
> + count--;
> + }
> + return 0;
> +}
we have a memcmp() in lib/string.c. Maybe reuse it from there?
--
Regards/Gruss,
Boris.
On Thu, Dec 09, 2010 at 02:37:09PM +0100, Borislav Petkov wrote:
> On Thu, Dec 09, 2010 at 06:09:38PM +1100, Nick Piggin wrote:
> > +static inline int dentry_memcmp(const unsigned char *cs,
> > + const unsigned char *ct, size_t count)
> > +{
> > + while (count) {
> > + int ret = (*cs != *ct);
> > + if (ret)
> > + return ret;
> > + cs++;
> > + ct++;
> > + count--;
> > + }
> > + return 0;
> > +}
>
> we have a memcmp() in lib/string.c. Maybe reuse it from there?
Well I think I prefer it to be inline so it doesn't clobber registers,
and also so that it isn't required to carry an integer value to return
(only boolean) if that might help.
I might also change it a bit and take advantage of the fact we won't
have a zero sized count -- this could move the loop branch to after the
values are loaded, and would mean that the load will get underway
even if the branch is mispredicted...
On Thu, Dec 09, 2010 at 06:09:38PM +1100, Nick Piggin wrote:
> So replace it with an open-coded byte comparison. This increases code
> size by 24 bytes in the critical __d_lookup_rcu function, but the
Actually, if the loop assumes len is non zero (which is the case for
dentry compare), then the bloat is only 8 bytes, so not a problem.
Also got numbers versus vanilla kernel, out of interest.
> speedup is huge, averaging 10 runs of each:
>
> git diff st user sys elapsed CPU
vanilla 1.19 3.21 4.47 98.0
> before 1.15 2.57 3.82 97.1
> after 1.14 2.35 3.61 96.8
>
> git diff mt user sys elapsed CPU
vanilla 1.57 45.75 3.60 1312
> before 1.27 3.85 1.46 349
> after 1.26 3.54 1.43 333
>
Single thread elapsed time improvment vanilla vs vfs 19.23%. Not quite
as big as the AMD fam10h speedup, that's probably because Westmere does
atomics so damn quickly.
Multi thread numbers are no surprise.
Nick Piggin:
> The standard memcmp function on a Westmere system shows up hot in
> profiles in the `git diff` workload (both parallel and single threaded),
> and it is likely due to the costs associated with trapping into
> microcode, and little opportunity to improve memory access (dentry
> name is not likely to take up more than a cacheline).
Let me make sure.
What you are pointing out is
- asm("repe; cmpsb") may grab CPU long time, and can be a hazard for
scaling.
- by breaking it into pieces, the chances to scale will increase.
Right?
Anyway this appraoch replacing smallest code by larger but faster code
is interesting.
How about mixing 'unsigned char *' and 'unsigned long *' in referencing
the given strings?
For example,
int f(const unsigned char *cs, const unsigned char *ct, size_t count)
{
int ret;
union {
const unsigned long *l;
const unsigned char *c;
} s, t;
/* this macro is your dentry_memcmp() actually */
#define cmp(s, t, c, step) \
do { \
while ((c) >= (step)) { \
ret = (*(s) != *(t)); \
if (ret) \
return ret; \
(s)++; \
(t)++; \
(c) -= (step); \
} \
} while (0)
s.c = cs;
t.c = ct;
cmp(s.l, t.l, count, sizeof(*s.l));
cmp(s.c, t.c, count, sizeof(*s.c));
return 0;
}
What I am thinking here is,
- in load and compare, there is no difference between 'char*' and
'long*', probably.
- obviously 'step by sizeof(long)' will reduce the number of repeats.
- but I am not sure whether the length of string is generally longer
than 4 (or 8) or not.
J. R. Okajima
On Fri, Dec 10, 2010 at 11:23:17PM +0900, J. R. Okajima wrote:
>
> Nick Piggin:
> > The standard memcmp function on a Westmere system shows up hot in
> > profiles in the `git diff` workload (both parallel and single threaded),
> > and it is likely due to the costs associated with trapping into
> > microcode, and little opportunity to improve memory access (dentry
> > name is not likely to take up more than a cacheline).
>
> Let me make sure.
> What you are pointing out is
> - asm("repe; cmpsb") may grab CPU long time, and can be a hazard for
> scaling.
> - by breaking it into pieces, the chances to scale will increase.
> Right?
It's not scaling but just single threaded performance. gcc turns memcmp
into rep cmp, which has quite a long latency, so it's not appripriate
for short strings.
> Anyway this appraoch replacing smallest code by larger but faster code
> is interesting.
> How about mixing 'unsigned char *' and 'unsigned long *' in referencing
> the given strings?
> For example,
>
> int f(const unsigned char *cs, const unsigned char *ct, size_t count)
> {
> int ret;
> union {
> const unsigned long *l;
> const unsigned char *c;
> } s, t;
>
> /* this macro is your dentry_memcmp() actually */
> #define cmp(s, t, c, step) \
> do { \
> while ((c) >= (step)) { \
> ret = (*(s) != *(t)); \
> if (ret) \
> return ret; \
> (s)++; \
> (t)++; \
> (c) -= (step); \
> } \
> } while (0)
>
> s.c = cs;
> t.c = ct;
> cmp(s.l, t.l, count, sizeof(*s.l));
> cmp(s.c, t.c, count, sizeof(*s.c));
> return 0;
> }
>
> What I am thinking here is,
> - in load and compare, there is no difference between 'char*' and
> 'long*', probably.
> - obviously 'step by sizeof(long)' will reduce the number of repeats.
> - but I am not sure whether the length of string is generally longer
> than 4 (or 8) or not.
The comparison is no longer an issue, so I think the added complexity
is not going to be worth it -- think about average length of directory
entry name, the average is maybe 12? In the kernel tree it's 11.
If we really wanted to do this, we'd round name lengths up to nearest
sizeof(long) (which should be the case already, but we'd do it
explicitly), zero fill the last bytes, and do a long compare loop. I
doubt it would be noticable though.
Nick Piggin:
> It's not scaling but just single threaded performance. gcc turns memcmp
> into rep cmp, which has quite a long latency, so it's not appripriate
> for short strings.
Honestly speaking I doubt how this 'long *' approach is effective
(Of course it never means that your result (by 'char *') is doubtful).
But is the "rep cmp has quite a long latency" issue generic for all x86
architecture, or Westmere system specific?
J. R. Okajima
On Mon, Dec 13, 2010 at 6:29 PM, J. R. Okajima <[email protected]> wrote:
>
> Nick Piggin:
>> It's not scaling but just single threaded performance. gcc turns memcmp
>> into rep cmp, which has quite a long latency, so it's not appripriate
>> for short strings.
>
> Honestly speaking I doubt how this 'long *' approach is effective
> (Of course it never means that your result (by 'char *') is doubtful).
Well, let's see what turns up. We certainly can try the long *
approach. I suspect on architectures where byte loads are
very slow, gcc will block the loop into larger loads, so it should
be no worse than a normal memcmp call, but if we do explicit
padding we can avoid all the problems associated with tail
handling.
Doing name padding and long * comparison will be practically
free (because slab allocator will align out to sizeof(long long)
anyway), so if any architecture prefers to do the long loads, I'd
be interested to hear and we could whip up a patch.
> But is the "rep cmp has quite a long latency" issue generic for all x86
> architecture, or Westmere system specific?
I don't believe it is Westmere specific. Intel and AMD have
been improving these instructions in the past few years, so
Westmere is probably as good or better than any.
That said, rep cmp may not be as heavily optimized as the
set and copy string instructions.
In short, I think the change should be suitable for all x86 CPUs,
but I would like to hear more opinions or see numbers for other
cores.
Thanks,
Nick
Nick Piggin:
> Well, let's see what turns up. We certainly can try the long *
> approach. I suspect on architectures where byte loads are
> very slow, gcc will block the loop into larger loads, so it should
> be no worse than a normal memcmp call, but if we do explicit
> padding we can avoid all the problems associated with tail
> handling.
Thank you for your reply.
But unfortunately I am afraid that I cannot understand what you wrote
clearly due to my poor English. What I understood is,
- I suggested 'long *' approach
- You wrote "not bad and possible, but may not be worth"
- I agreed "the approach may not be effective"
And you gave deeper consideration, but the result is unchaged which
means "'long *' approach may not be worth". Am I right?
> In short, I think the change should be suitable for all x86 CPUs,
> but I would like to hear more opinions or see numbers for other
> cores.
I'd like to hear from other x86 experts too.
Also I noticed that memcmp for x86_32 is defined as __builtin_memcmp
(for x86_64 is "rep cmp"). Why does x86_64 doesn't use __builtin_memcmp?
Is it really worse?
J. R. Okajima
On Wed, Dec 15, 2010 at 6:01 AM, J. R. Okajima <[email protected]> wrote:
>
> Nick Piggin:
>> Well, let's see what turns up. We certainly can try the long *
>> approach. I suspect on architectures where byte loads are
>> very slow, gcc will block the loop into larger loads, so it should
>> be no worse than a normal memcmp call, but if we do explicit
>> padding we can avoid all the problems associated with tail
>> handling.
>
> Thank you for your reply.
> But unfortunately I am afraid that I cannot understand what you wrote
> clearly due to my poor English. What I understood is,
> - I suggested 'long *' approach
> - You wrote "not bad and possible, but may not be worth"
> - I agreed "the approach may not be effective"
> And you gave deeper consideration, but the result is unchaged which
> means "'long *' approach may not be worth". Am I right?
What I meant is that a "normal" memcmp that does long * memory
operations is not a good idea, because it requires code and branches
to handle the tail of the string. When average string lengths are less
than 16 bytes, it hardly seems wothwhile. It will just get more
mispredicts and bigger icache footprint.
However instead of a normal memcmp, we could actually pad dentry
names out to sizeof(long) with zeros, and take advantage of that with
a memcmp that does not have to handle tails -- it would operate
entirely with longs.
That would avoid icache and branch regressions, and might speed up
the operation on some architectures. I just doubted whether it would
show an improvement to be worth doing at all. If it does, I'm all for it.
>> In short, I think the change should be suitable for all x86 CPUs,
>> but I would like to hear more opinions or see numbers for other
>> cores.
>
> I'd like to hear from other x86 experts too.
> Also I noticed that memcmp for x86_32 is defined as __builtin_memcmp
> (for x86_64 is "rep cmp"). Why does x86_64 doesn't use __builtin_memcmp?
> Is it really worse?
>
>
> J. R. Okajima
>
On Mon, Dec 13, 2010 at 07:25:05PM +1100, Nick Piggin wrote:
>On Mon, Dec 13, 2010 at 6:29 PM, J. R. Okajima <[email protected]> wrote:
>> But is the "rep cmp has quite a long latency" issue generic for all x86
>> architecture, or Westmere system specific?
>
>I don't believe it is Westmere specific. Intel and AMD have
>been improving these instructions in the past few years, so
>Westmere is probably as good or better than any.
>
>That said, rep cmp may not be as heavily optimized as the
>set and copy string instructions.
>
>In short, I think the change should be suitable for all x86 CPUs,
>but I would like to hear more opinions or see numbers for other
>cores.
>
How about other arch? If this is only for x86, shouldn't it be
protected by CONFIG_X86?
Thanks.
On Wed, Dec 15, 2010 at 12:38:40PM +0800, Am?rico Wang wrote:
> On Mon, Dec 13, 2010 at 07:25:05PM +1100, Nick Piggin wrote:
> >On Mon, Dec 13, 2010 at 6:29 PM, J. R. Okajima <[email protected]> wrote:
> >> But is the "rep cmp has quite a long latency" issue generic for all x86
> >> architecture, or Westmere system specific?
> >
> >I don't believe it is Westmere specific. Intel and AMD have
> >been improving these instructions in the past few years, so
> >Westmere is probably as good or better than any.
> >
> >That said, rep cmp may not be as heavily optimized as the
> >set and copy string instructions.
> >
> >In short, I think the change should be suitable for all x86 CPUs,
> >but I would like to hear more opinions or see numbers for other
> >cores.
> >
>
> How about other arch? If this is only for x86, shouldn't it be
> protected by CONFIG_X86?
That's what I would like to know, but I suspect that for very short
strings we are dealing with, the custom loop will be fine for
everybody.
Thank you for explanation.
I understood about padding zero tail bytes as an improvement or a
variation of the 'long *' approach. But I didn't write about it in my
previous mail. If I had written, you might not need to describe again.
I have also compared "repe; cmpsb" and __builtin_memcmp "in userspace".
And __builtin_memcmp shows faster and it looks like an approach of
mixing pointers. Although the code size grows, in source it will be
simple #define and it is already done in x86_32. Do you (or any other
x86 experts) think it is worth to try?
J. R. Okajima
Nick Piggin:
> What I meant is that a "normal" memcmp that does long * memory
> operations is not a good idea, because it requires code and branches
> to handle the tail of the string. When average string lengths are less
> than 16 bytes, it hardly seems wothwhile. It will just get more
> mispredicts and bigger icache footprint.
>
> However instead of a normal memcmp, we could actually pad dentry
> names out to sizeof(long) with zeros, and take advantage of that with
> a memcmp that does not have to handle tails -- it would operate
> entirely with longs.
>
> That would avoid icache and branch regressions, and might speed up
> the operation on some architectures. I just doubted whether it would
> show an improvement to be worth doing at all. If it does, I'm all for it.
On Tue, Dec 14, 2010 at 9:54 PM, Nick Piggin <[email protected]> wrote:
>
> That's what I would like to know, but I suspect that for very short
> strings we are dealing with, the custom loop will be fine for
> everybody.
Yeah, I can pretty much guarantee it for the common case of short
strings. Most path components are short enough that a "clever"
memcmp() is simply likely going to be slower than doing things
byte-per-byte.
That's especially true if we then in the future end up making it do a
long-by-long compare instead of the byte-by-byte one.
Linus
On 12/15/2010 06:06 AM, Nick Piggin wrote:
>
> However instead of a normal memcmp, we could actually pad dentry
> names out to sizeof(long) with zeros, and take advantage of that with
> a memcmp that does not have to handle tails -- it would operate
> entirely with longs.
>
> That would avoid icache and branch regressions, and might speed up
> the operation on some architectures. I just doubted whether it would
> show an improvement to be worth doing at all. If it does, I'm all for it.
>
I agree that the byte-compare or long-compare should give you very close
results in modern pipeline CPUs. But surly 12 increments-and-test should
show up against 3 (or even 2). I would say it must be a better plan.
BTW the long approach if you assume that the beginning of the string
is long aligned than it is only a matter of comparing the last byte
with a mask, no branches. But I'm not saying, just make sure they are
padded.
Just my $0.017
Thanks
Boaz
From: Boaz Harrosh <[email protected]>
Date: Wed, 15 Dec 2010 15:15:09 +0200
> I agree that the byte-compare or long-compare should give you very close
> results in modern pipeline CPUs. But surly 12 increments-and-test should
> show up against 3 (or even 2). I would say it must be a better plan.
For strings of these lengths the setup code necessary to initialize
the inner loop and the tail code to handle the sub-word ending cases
eliminate whatever gains there are.
I know this as I've been hacking on assembler optimized strcmp() and
memcmp() in my spare time over the past year or so.
On Wed, Dec 8, 2010 at 11:09 PM, Nick Piggin <[email protected]> wrote:
> +static inline int dentry_memcmp(const unsigned char *cs,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const unsigned char *ct, size_t count)
> +{
> + ? ? ? while (count) {
> + ? ? ? ? ? ? ? int ret = (*cs != *ct);
> + ? ? ? ? ? ? ? if (ret)
> + ? ? ? ? ? ? ? ? ? ? ? return ret;
> + ? ? ? ? ? ? ? cs++;
> + ? ? ? ? ? ? ? ct++;
> + ? ? ? ? ? ? ? count--;
> + ? ? ? }
> + ? ? ? return 0;
> +}
Since you are proposing a routine that only compares file
names - I wonder whether it would be faster to start at the
end and work backwards? If the filenames are the same,
it makes no difference - you have to look at all the bytes.
But if they are different you might bail out earlier. There
are many applications that stick a common prefix onto
the start of filenames (just look in "/lib" !), but I think it is
less common to add a suffix (longer than "." single letter).
-Tony
On Wed, Dec 15, 2010 at 03:09:26PM -0800, Tony Luck wrote:
> On Wed, Dec 8, 2010 at 11:09 PM, Nick Piggin <[email protected]> wrote:
> > +static inline int dentry_memcmp(const unsigned char *cs,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const unsigned char *ct, size_t count)
> > +{
> > + ? ? ? while (count) {
> > + ? ? ? ? ? ? ? int ret = (*cs != *ct);
> > + ? ? ? ? ? ? ? if (ret)
> > + ? ? ? ? ? ? ? ? ? ? ? return ret;
> > + ? ? ? ? ? ? ? cs++;
> > + ? ? ? ? ? ? ? ct++;
> > + ? ? ? ? ? ? ? count--;
> > + ? ? ? }
> > + ? ? ? return 0;
> > +}
>
> Since you are proposing a routine that only compares file
> names - I wonder whether it would be faster to start at the
> end and work backwards? If the filenames are the same,
> it makes no difference - you have to look at all the bytes.
> But if they are different you might bail out earlier. There
> are many applications that stick a common prefix onto
> the start of filenames (just look in "/lib" !), but I think it is
> less common to add a suffix (longer than "." single letter).
That's true, and an interesting point. However I have managed to fit
the first 8 bytes of the name (in the case of shortnames) into the
same single dentry cacheline that is used for path walking. So that
might negate some of the benefits of walking backwards.
I would encourage anybody to grab the branch and try out any tweaks,
though...
On 12/15/2010 08:00 PM, David Miller wrote:
> From: Boaz Harrosh <[email protected]>
> Date: Wed, 15 Dec 2010 15:15:09 +0200
>
>> I agree that the byte-compare or long-compare should give you very close
>> results in modern pipeline CPUs. But surly 12 increments-and-test should
>> show up against 3 (or even 2). I would say it must be a better plan.
>
> For strings of these lengths the setup code necessary to initialize
> the inner loop and the tail code to handle the sub-word ending cases
> eliminate whatever gains there are.
>
You miss understood me. I'm saying that we know the beggining of the
string is aligned and Nick offered to pad the last long, so surly
a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3
should give you an optimization?
> I know this as I've been hacking on assembler optimized strcmp() and
> memcmp() in my spare time over the past year or so.
Boaz
On Thu, Dec 16, 2010 at 8:53 PM, Boaz Harrosh <[email protected]> wrote:
> On 12/15/2010 08:00 PM, David Miller wrote:
>> From: Boaz Harrosh <[email protected]>
>> Date: Wed, 15 Dec 2010 15:15:09 +0200
>>
>>> I agree that the byte-compare or long-compare should give you very close
>>> results in modern pipeline CPUs. But surly 12 increments-and-test should
>>> show up against 3 (or even 2). I would say it must be a better plan.
>>
>> For strings of these lengths the setup code necessary to initialize
>> the inner loop and the tail code to handle the sub-word ending cases
>> eliminate whatever gains there are.
>>
>
> You miss understood me. I'm saying that we know the beggining of the
> string is aligned and Nick offered to pad the last long, so surly
> a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3
> should give you an optimization?
Masking is still going to take a bit more code, but perhaps. We're talking
a handful of cycles here, so if you add a branch mispredict, or icache
miss, you'll kill your savings.
This is what I've got at the moment, which adds only 8 bytes over the
rep cmp variant, in the __d_lookup_rcu function.
static inline int dentry_memcmp(const unsigned char *cs,
const unsigned char *ct, size_t count)
{
int ret;
do {
ret = (*cs != *ct);
if (ret)
break;
cs++;
ct++;
count--;
} while (count);
return ret;
}
Now if we pad and zero fill the dentry name, then we can compare with
the path string, but we still have to mask that guy (unfortunately, I
didn't consider that earlier) so it's not trivial and adds quite a bit of code
size and branches:
static inline int dentry_memcmp_long(const unsigned char *cs,
const unsigned char *ct, ssize_t count)
{
const unsigned long *ls = (const unsigned long *)cs;
const unsigned long *lt = (const unsigned long *)ct;
int ret;
do {
unsigned long t = *lt;
unsigned long s = *ls;
int shift = 0;
if (count < 8)
shift = (8 - count) * 8;
t = t & (0xffffffffffffffff >> shift);
ret = (s != t);
if (ret)
break;
ls++;
lt++;
count-=8;
} while (count > 0);
return ret;
}
Something like this should work on little endian. You'd have to coax gcc to
generate a cmov to get rid of that branch I think, because it won't be
predictable for small string lengths. But then you have to think about
icache...
On 12/16/2010 03:13 PM, Nick Piggin wrote:
> On Thu, Dec 16, 2010 at 8:53 PM, Boaz Harrosh <[email protected]> wrote:
>> On 12/15/2010 08:00 PM, David Miller wrote:
>>> From: Boaz Harrosh <[email protected]>
>>> Date: Wed, 15 Dec 2010 15:15:09 +0200
>>>
>>>> I agree that the byte-compare or long-compare should give you very close
>>>> results in modern pipeline CPUs. But surly 12 increments-and-test should
>>>> show up against 3 (or even 2). I would say it must be a better plan.
>>>
>>> For strings of these lengths the setup code necessary to initialize
>>> the inner loop and the tail code to handle the sub-word ending cases
>>> eliminate whatever gains there are.
>>>
>>
>> You miss understood me. I'm saying that we know the beggining of the
>> string is aligned and Nick offered to pad the last long, so surly
>> a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3
>> should give you an optimization?
>
> Masking is still going to take a bit more code, but perhaps. We're talking
> a handful of cycles here, so if you add a branch mispredict, or icache
> miss, you'll kill your savings.
>
> This is what I've got at the moment, which adds only 8 bytes over the
> rep cmp variant, in the __d_lookup_rcu function.
>
> static inline int dentry_memcmp(const unsigned char *cs,
> const unsigned char *ct, size_t count)
> {
> int ret;
> do {
> ret = (*cs != *ct);
> if (ret)
> break;
> cs++;
> ct++;
> count--;
> } while (count);
> return ret;
> }
>
> Now if we pad and zero fill the dentry name, then we can compare with
> the path string, but we still have to mask that guy (unfortunately, I
> didn't consider that earlier) so it's not trivial and adds quite a bit of code
> size and branches:
>
> static inline int dentry_memcmp_long(const unsigned char *cs,
> const unsigned char *ct, ssize_t count)
> {
> const unsigned long *ls = (const unsigned long *)cs;
> const unsigned long *lt = (const unsigned long *)ct;
>
> int ret;
> do {
> unsigned long t = *lt;
> unsigned long s = *ls;
> int shift = 0;
>
> if (count < 8)
> shift = (8 - count) * 8;
> t = t & (0xffffffffffffffff >> shift);
> ret = (s != t);
> if (ret)
> break;
> ls++;
> lt++;
> count-=8;
> } while (count > 0);
> return ret;
> }
>
> Something like this should work on little endian. You'd have to coax gcc to
> generate a cmov to get rid of that branch I think, because it won't be
> predictable for small string lengths. But then you have to think about
> icache...
static inline int dentry_memcmp_long(const unsigned char *cs,
const unsigned char *ct, ssize_t count)
{
int ret;
const unsigned long *ls = (const unsigned long *)cs;
const unsigned long *lt = (const unsigned long *)ct;
while (count > 8) {
ret = (*cs != *ct);
if (ret)
break;
cs++;
ct++;
count-=8;
}
if (count) {
unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
ret = (*cs != t)
}
return ret;
}
Same as yours but just to avoid the branch inside the loop
and slightly smaller code.
BTW: On some ARCHs ++foo is faster than foo++
and Also, is there a reason not to write the above loop as:
while(c-- && (ret = (*cs++ != *ct++)))
;
Thanks
Boaz
On Fri, Dec 17, 2010 at 1:03 AM, Boaz Harrosh <[email protected]> wrote:
> On 12/16/2010 03:13 PM, Nick Piggin wrote:
> static inline int dentry_memcmp_long(const unsigned char *cs,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const unsigned char *ct, ssize_t count)
> {
> ? ? ? ?int ret;
> ? ? ? ?const unsigned long *ls = (const unsigned long *)cs;
> ? ? ? ?const unsigned long *lt = (const unsigned long *)ct;
>
> ? ? ? ?while (count > 8) {
> ? ? ? ? ? ? ? ?ret = (*cs != *ct);
> ? ? ? ? ? ? ? ?if (ret)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?cs++;
> ? ? ? ? ? ? ? ?ct++;
> ? ? ? ? ? ? ? ?count-=8;
> ? ? ? ?}
> ? ? ? ?if (count) {
> ? ? ? ? ? ? ? ?unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
> ? ? ? ? ? ? ? ?ret = (*cs != t)
> ? ? ? ?}
>
> ? ? ? ?return ret;
> }
>
> Same as yours but just to avoid the branch inside the loop
> and slightly smaller code.
That's true, it should make the branch more predictable too. Well,
maybe. I'm going to leave it as-is for now, but I would welcome any
test results or ideas for improvements.
> BTW: On some ARCHs ++foo is faster than foo++
> ? ? and Also, is there a reason not to write the above loop as:
>
> ? ? while(c-- && (ret = (*cs++ != *ct++)))
> ? ? ? ? ? ? ? ?;
gcc should turn it into the same thing these days. I prefer avoiding
expressions with side effects in control statements (other than a
simple counter).
Thanks,
Nick
On Thu, Dec 16, 2010 at 1:53 AM, Boaz Harrosh <[email protected]> wrote:
>
> You miss understood me. I'm saying that we know the beggining of the
> string is aligned and Nick offered to pad the last long, so surly
> a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3
> should give you an optimization?
Sadly, right now we don't know that the string is necessarily even aligned.
Yes, it's always aligned in a dentry, because it's either the inline
short string, or it's the longer string we explicitly allocated to the
dentry.
But when we do name compares in __d_lookup, only one part of that is a
dentry. The other is a qstr, and the name there is not aligned. In
fact, it's not even NUL-terminated. It's the data directly from the
path itself.
So we can certainly do compares a "long" at a time, but it's not
entirely trivial. And just making the dentries be aligned and
null-padded is not enough. Most likely, you'd have to make the dentry
name compare function do an unaligned load from the qstr part, and
then do the masking.
Which is likely still the best performance on something like x86 where
unaligned loads are cheap, but on other architectures it might be less
so.
Linus
From: Boaz Harrosh <[email protected]>
Date: Thu, 16 Dec 2010 11:53:09 +0200
> You miss understood me. I'm saying that we know the beggining of the
> string is aligned and Nick offered to pad the last long, so surly
> a shift by 2 (or 3) + the reduction of the 12 dec-and-test to 3
> should give you an optimization?
Indeed, we could do something interesting in the case where both
strings are tail padded like that.
> static inline int dentry_memcmp_long(const unsigned char *cs,
> const unsigned char *ct, ssize_t count)
> {
> int ret;
> const unsigned long *ls = (const unsigned long *)cs;
> const unsigned long *lt = (const unsigned long *)ct;
>
> while (count > 8) {
> ret = (*cs != *ct);
> if (ret)
> break;
> cs++;
> ct++;
> count-=8;
> }
> if (count) {
> unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
> ret = (*cs != t)
> }
>
> return ret;
> }
First, let's get the code right, and use correct types, but also, there
are some tricks to reduce the masking cost.
As long as you have to mask one string, *and* don't have to worry about
running off the end of mapped memory, there's no additional cost to
masking both in the loop. Just test (a ^ b) & mask.
#if 1 /* Table lookup */
static unsigned long const dentry_memcmp_mask[8] = {
#if defined(__LITTLE_ENDIAN)
0xff, 0xffff, 0xffffff, 0xffffffff,
#if sizeof (unsigned long) > 4
0xffffffffff, 0xffffffffffff, 0xffffffffffffff, 0xffffffffffffffff
#endif
#elif defined(__BIG_ENDIAN)
#if sizeof (unsigned long) == 4
0xff000000, 0xffff0000, 0xffffff00, 0xffffffff
#else
0xff00000000000000, 0xffff000000000000,
0xffffff0000000000, 0xffffffff00000000,
0xffffffffff000000, 0xffffffffffff0000,
0xffffffffffffff00, 0xffffffffffffffff
#endif
#else
#error undefined endianness
#endif
};
#define dentry_memcmp_mask(count) dentry_memcmp_mask[(count)-1]
#else /* In-line code */
#if defined(__LITTLE_ENDIAN)
#define dentry_memcmp_mask(count) (-1ul >> (sizeof 1ul - (count)) * 8)
#elif defined(__BIG_ENDIAN)
#define dentry_memcmp_mask(count) (-1ul << (sizeof 1ul - (count)) * 8)
#else
#error undefined endianness
#endif
static inline bool dentry_memcmp_long(unsigned char const *cs,
unsigned char const *ct, ssize_t count)
{
unsigned long const *ls = (unsigned long const *)cs;
unsigned long const *lt = (unsigned long const *)ct;
while (count > sizeof *cs) {
if (*cs != *ct)
return true;
cs++;
ct++;
count -= sizeof *cs;
}
/* Last 1..8 bytes */
return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;
}
If your string is at least 8 bytes long, and the processor has fast unaligned
loads, you can skip the mask entirely by redundantly comparing some bytes
(although the code bloat is probably undesirable for inline code):
static inline bool dentry_memcmp_long(const unsigned char *cs,
const unsigned char *ct, ssize_t count)
{
unsigned long const *ls = (unsigned long const *)cs;
unsigned long const *lt = (unsigned long const *)ct;
if (count < sizeof *cs)
return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;
while (count > sizeof *cs) {
if (*cs != *ct)
return true;
cs++;
ct++;
count -= sizeof *cs;
}
cs = (unsigned long const *)((char const *)cs + count - sizeof *cs);
ct = (unsigned long const *)((char const *)ct + count - sizeof *ct);
return *cs != *ct;
}
On 12/19/2010 12:54 AM, George Spelvin wrote:
>> static inline int dentry_memcmp_long(const unsigned char *cs,
>> const unsigned char *ct, ssize_t count)
>> {
>> int ret;
>> const unsigned long *ls = (const unsigned long *)cs;
>> const unsigned long *lt = (const unsigned long *)ct;
>>
>> while (count > 8) {
>> ret = (*cs != *ct);
>> if (ret)
>> break;
>> cs++;
>> ct++;
>> count-=8;
>> }
>> if (count) {
>> unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
>> ret = (*cs != t)
>> }
>>
>> return ret;
>> }
>
> First, let's get the code right, and use correct types, but also, there
> are some tricks to reduce the masking cost.
>
> As long as you have to mask one string, *and* don't have to worry about
> running off the end of mapped memory, there's no additional cost to
> masking both in the loop. Just test (a ^ b) & mask.
>
> #if 1 /* Table lookup */
>
> static unsigned long const dentry_memcmp_mask[8] = {
> #if defined(__LITTLE_ENDIAN)
> 0xff, 0xffff, 0xffffff, 0xffffffff,
> #if sizeof (unsigned long) > 4
> 0xffffffffff, 0xffffffffffff, 0xffffffffffffff, 0xffffffffffffffff
> #endif
> #elif defined(__BIG_ENDIAN)
> #if sizeof (unsigned long) == 4
> 0xff000000, 0xffff0000, 0xffffff00, 0xffffffff
> #else
> 0xff00000000000000, 0xffff000000000000,
> 0xffffff0000000000, 0xffffffff00000000,
> 0xffffffffff000000, 0xffffffffffff0000,
> 0xffffffffffffff00, 0xffffffffffffffff
> #endif
> #else
> #error undefined endianness
> #endif
> };
>
> #define dentry_memcmp_mask(count) dentry_memcmp_mask[(count)-1]
>
> #else /* In-line code */
>
> #if defined(__LITTLE_ENDIAN)
> #define dentry_memcmp_mask(count) (-1ul >> (sizeof 1ul - (count)) * 8)
> #elif defined(__BIG_ENDIAN)
> #define dentry_memcmp_mask(count) (-1ul << (sizeof 1ul - (count)) * 8)
> #else
> #error undefined endianness
>
> #endif
>
Thanks Yes that fixes the _ENDIANness problem. and unsigned maths
I was considering the table as well. The table might be better or not,
considering the pipe-line and the multiple-instructions-per-clock. x86
specially likes the operating on cost values. It should be tested
with both #ifs
> static inline bool dentry_memcmp_long(unsigned char const *cs,
> unsigned char const *ct, ssize_t count)
> {
> unsigned long const *ls = (unsigned long const *)cs;
> unsigned long const *lt = (unsigned long const *)ct;
>
> while (count > sizeof *cs) {
> if (*cs != *ct)
> return true;
> cs++;
> ct++;
> count -= sizeof *cs;
> }
> /* Last 1..8 bytes */
what if at this point count == 0, I think you need the if (count) still
> return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;
OK I like it, my version could be equally fast if the compiler
would use cmov but not all ARChs have it.
> }
>
> If your string is at least 8 bytes long, and the processor has fast unaligned
> loads, you can skip the mask entirely by redundantly comparing some bytes
> (although the code bloat is probably undesirable for inline code):
>
> static inline bool dentry_memcmp_long(const unsigned char *cs,
> const unsigned char *ct, ssize_t count)
> {
> unsigned long const *ls = (unsigned long const *)cs;
> unsigned long const *lt = (unsigned long const *)ct;
>
> if (count < sizeof *cs)
> return ((*ct ^ *cs) & dentry_memcmp_mask(count)) != 0;
>
> while (count > sizeof *cs) {
> if (*cs != *ct)
> return true;
> cs++;
> ct++;
> count -= sizeof *cs;
> }
> cs = (unsigned long const *)((char const *)cs + count - sizeof *cs);
> ct = (unsigned long const *)((char const *)ct + count - sizeof *ct);
> return *cs != *ct;
> }
As Linus said, All this is mute, the qstr part of the compare is not long aligned
at the start. So I guess we come up loosing.
Thanks
Boaz
On Sun, Dec 19, 2010 at 9:54 AM, George Spelvin <[email protected]> wrote:
>> static inline int dentry_memcmp_long(const unsigned char *cs,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const unsigned char *ct, ssize_t count)
>> {
>> ? ? ? int ret;
>> ? ? ? const unsigned long *ls = (const unsigned long *)cs;
>> ? ? ? const unsigned long *lt = (const unsigned long *)ct;
>>
>> ? ? ? while (count > 8) {
>> ? ? ? ? ? ? ? ret = (*cs != *ct);
>> ? ? ? ? ? ? ? if (ret)
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? cs++;
>> ? ? ? ? ? ? ? ct++;
>> ? ? ? ? ? ? ? count-=8;
>> ? ? ? }
>> ? ? ? if (count) {
>> ? ? ? ? ? ? ? unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
>> ? ? ? ? ? ? ? ret = (*cs != t)
>> ? ? ? }
>>
>> ? ? ? return ret;
>> }
>
> First, let's get the code right, and use correct types, but also, there
You still used the wrong vars in the loop.
> are some tricks to reduce the masking cost.
>
> As long as you have to mask one string, *and* don't have to worry about
> running off the end of mapped memory, there's no additional cost to
> masking both in the loop. ?Just test (a ^ b) & mask.
Using a lookup table I considered, but maybe not well enough. It is
another cacheline, but common to all lookups. So it could well be
worth it, let's keep your code around...
The big problem for CPUs that don't do well on this type of code is
what the string goes through during the entire syscall.
First, a byte-by-byte strcpy_from_user of the whole name string to
kernel space. Then a byte-by-byte chunking and hashing component
paths according to '/'. Then a byte-by-byte memcmp against the
dentry name.
I'd love to do everything with 8 byte loads, do the component
separation and hashing at the same time as copy from user, and
have the padded and aligned component strings and their hash
available... but complexity.
On my Westmere system, time to do a stat is 640 cycles plus?10
cycles for every byte in the string (this cost holds perfectly
from?1 byte name up to 32 byte names in my test range).
`git diff` average path name strings are 31 bytes, although this
is much less cache friendly, and over several components (my
test is just a single component).
But still, even if the base cost were doubled, it may still
spend 20% or so kernel cycles in name string handling.
This 8 byte memcpy takes my microbenchmark down to 8?cycles per
byte, so it may get several more % on git diff.
A careful thinking about the initial strcpy_from_user, and
hashing code could shave another few cycles off it. Well
worth investigating I think.
> First, a byte-by-byte strcpy_from_user of the whole name string to
> kernel space. Then a byte-by-byte chunking and hashing component
> paths according to '/'. Then a byte-by-byte memcmp against the
> dentry name.
Well, you've put your finger on the obvious place to to the aligned copy,
if you want. Looking into it, I note that __d_lookup() does a 32-bit
hash compare before doing a string compare, so the memcmp should almost
always succeed. (Failures are due to hash collisions and rename races.)
> I'd love to do everything with 8 byte loads, do the component
> separation and hashing at the same time as copy from user, and
> have the padded and aligned component strings and their hash
> available... but complexity.
It actually doesn't seem that difficult to do in fs/namei.c:do_getname
via a heavily hacked strncpy_from_user. Call it strhash_from_user, and
it copies the string, while accumulating the hash and the length,
until it hits a nul or /. It has to return the length, the hash,
and the termination condition: nul, /, or space exhausted.
Then you could arrange for each component to be padded so that it
starts aligned.
(The hash and length can be stored directly in the qstr. The length is
only otherwise needed for components that end in /, so you could return
a magic negative length in those cases.)
If it matters, Bob Jenkins wrote a "one-at-a-time" hash function
that is actually sligthy less work per character than the current
partial_name_hash. (Two shifts, two adds and a *11 gets turned into
two shifts, two adds, and and an XOR)
static inline u32 __attribute__((const))
partial_name_hash(unsigned char c, u32 hash)
{
hash += c;
hash += hash << 10;
hash ^= hash >> 6;
return hash;
}
static inline u32 end_name_hash(u32 hash)
{
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
(I note that there's zero reason for the current partial_name_hash to
use an unsigned long intermediate value, since the high 32 bits have no
effect on the final result. It just forces unnecessary REX prefixes.)
The main problem with initializing all the qstr structures early is that
it can lead to an oddly dynamic PATH_MAX and/or a type of kernel
DOS by passing in paths with many short directory names that would need
maximal padding.
> On my Westmere system, time to do a stat is 640 cycles plus 10
> cycles for every byte in the string (this cost holds perfectly
> from 1 byte name up to 32 byte names in my test range).
> `git diff` average path name strings are 31 bytes, although this
> is much less cache friendly, and over several components (my
> test is just a single component).
Thank you for this useful real-world number!
I hadn't realized the base cost was so low.
On Sun, Dec 19, 2010 at 12:06:46PM -0500, George Spelvin wrote:
> > First, a byte-by-byte strcpy_from_user of the whole name string to
> > kernel space. Then a byte-by-byte chunking and hashing component
> > paths according to '/'. Then a byte-by-byte memcmp against the
> > dentry name.
>
> Well, you've put your finger on the obvious place to to the aligned copy,
> if you want.
Yes, this would be a nice place to align things and parse '/' compoents,
although in fact it doesn't strictly have to be a byte-by-byte copy
(although doing long loads would put a lot more complexity in fault
handling and parsing).
> Looking into it, I note that __d_lookup() does a 32-bit
> hash compare before doing a string compare, so the memcmp should almost
> always succeed. (Failures are due to hash collisions and rename races.)
>
> > I'd love to do everything with 8 byte loads, do the component
> > separation and hashing at the same time as copy from user, and
> > have the padded and aligned component strings and their hash
> > available... but complexity.
>
> It actually doesn't seem that difficult to do in fs/namei.c:do_getname
> via a heavily hacked strncpy_from_user. Call it strhash_from_user, and
> it copies the string, while accumulating the hash and the length,
> until it hits a nul or /. It has to return the length, the hash,
> and the termination condition: nul, /, or space exhausted.
>
> Then you could arrange for each component to be padded so that it
> starts aligned.
>
> (The hash and length can be stored directly in the qstr. The length is
> only otherwise needed for components that end in /, so you could return
> a magic negative length in those cases.)
Probably, due to bloat caused by the hash for small names, and that
some filesystems do their own hashing, we should still hash in the
same place. However we can do sizeof(long) at a time hashing, which
should be nice and fast too.
> The main problem with initializing all the qstr structures early is that
> it can lead to an oddly dynamic PATH_MAX and/or a type of kernel
> DOS by passing in paths with many short directory names that would need
> maximal padding.
If the results are compelling enough, this could be worked around
some way. Simple realloc probably, seeing as it would be a rare case.