2010-11-01 05:36:44

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy

On 2010年10月31日 04:23, Ingo Molnar wrote:
>
> * Hitoshi Mitake<[email protected]> wrote:
>
>> This patch adds new file: mem-memcpy-x86-64-asm.S
>> for x86-64 specific memcpy() benchmarking.
>> Added new benchmarks are,
>> x86-64-rep: memcpy() implemented with rep instruction
>> x86-64-unrolled: unrolled memcpy()
>>
>> Original idea of including the source files of kernel
>> for benchmarking is suggested by Ingo Molnar.
>> This is more effective than write-once programs for quantitative
>> evaluation of in-kernel, little and leaf functions called high frequently.
>> Because perf bench is in kernel source tree and executing it
>> on various hardwares, especially new model CPUs, is easy.
>>
>> This way can also be used for other functions of kernel e.g. checksum functions.
>>
>> Example of usage on Core i3 M330:
>>
>> | % ./perf bench mem memcpy -l 500MB
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
>> |
>> | 578.732506 MB/Sec
>> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
>> |
>> | 738.184980 MB/Sec
>> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>> | # Running mem/memcpy benchmark...
>> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
>> |
>> | 767.483269 MB/Sec
>>
>> This shows clearly that unrolled memcpy() is efficient
>> than rep version and glibc's one :)
>
> Hey, really cool output :-)
>
> Might also make sense to measure Ma Ling's patched version?

Does Ma Ling's patched version mean,

http://marc.info/?l=linux-kernel&m=128652296500989&w=2

the memcpy applied the patch of the URL?
(It seems that this patch was written by Miao Xie.)

I'll include the result of patched version in the next post.

>
>> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
>> # added by this patch. But I think it is no problem.
>
> You should put these:
>
> +#ifdef ARCH_X86_64
> +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
> +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
> +#endif
>
> into a .h file - a new one if needed.
>
> That will make both checkpatch and me happier ;-)
>

OK, I'll separate these files.

BTW, I found really interesting evaluation result.
Current results of "perf bench mem memcpy" include
the overhead of page faults because the measured memcpy()
is the first access to allocated memory area.

I tested the another version of perf bench mem memcpy,
which does memcpy() before measured memcpy() for removing
the overhead come from page faults.

And this is the result:

% ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...

4.608340 GB/Sec

% ./perf bench mem memcpy -l 500MB
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...

4.856442 GB/Sec

% ./perf bench mem memcpy -l 500MB -r x86-64-rep
# Running mem/memcpy benchmark...
# Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...

6.024445 GB/Sec

The relation of scores reversed!
I cannot explain the cause of this result, and
this is really interesting phenomenon.

So I'd like to add new command line option,
like "--pre-page-faults" to perf bench mem memcpy,
for doing memcpy() before measured memcpy().

How do you think about this idea?

Thanks,


2010-11-01 09:03:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy


* Hitoshi Mitake <[email protected]> wrote:

> On 2010年10月31日 04:23, Ingo Molnar wrote:
> >
> >* Hitoshi Mitake<[email protected]> wrote:
> >
> >>This patch adds new file: mem-memcpy-x86-64-asm.S
> >>for x86-64 specific memcpy() benchmarking.
> >>Added new benchmarks are,
> >> x86-64-rep: memcpy() implemented with rep instruction
> >> x86-64-unrolled: unrolled memcpy()
> >>
> >>Original idea of including the source files of kernel
> >>for benchmarking is suggested by Ingo Molnar.
> >>This is more effective than write-once programs for quantitative
> >>evaluation of in-kernel, little and leaf functions called high frequently.
> >>Because perf bench is in kernel source tree and executing it
> >>on various hardwares, especially new model CPUs, is easy.
> >>
> >>This way can also be used for other functions of kernel e.g. checksum functions.
> >>
> >>Example of usage on Core i3 M330:
> >>
> >>| % ./perf bench mem memcpy -l 500MB
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
> >>|
> >>| 578.732506 MB/Sec
> >>| % ./perf bench mem memcpy -l 500MB -r x86-64-rep
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
> >>|
> >>| 738.184980 MB/Sec
> >>| % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
> >>| # Running mem/memcpy benchmark...
> >>| # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
> >>|
> >>| 767.483269 MB/Sec
> >>
> >>This shows clearly that unrolled memcpy() is efficient
> >>than rep version and glibc's one :)
> >
> >Hey, really cool output :-)
> >
> >Might also make sense to measure Ma Ling's patched version?
>
> Does Ma Ling's patched version mean,
>
> http://marc.info/?l=linux-kernel&m=128652296500989&w=2
>
> the memcpy applied the patch of the URL?
> (It seems that this patch was written by Miao Xie.)
>
> I'll include the result of patched version in the next post.

(Indeed it is Miao Xie - sorry!)

> >># checkpatch.pl warns about two externs in bench/mem-memcpy.c
> >># added by this patch. But I think it is no problem.
> >
> >You should put these:
> >
> > +#ifdef ARCH_X86_64
> > +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
> > +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
> > +#endif
> >
> >into a .h file - a new one if needed.
> >
> >That will make both checkpatch and me happier ;-)
> >
>
> OK, I'll separate these files.
>
> BTW, I found really interesting evaluation result.
> Current results of "perf bench mem memcpy" include
> the overhead of page faults because the measured memcpy()
> is the first access to allocated memory area.
>
> I tested the another version of perf bench mem memcpy,
> which does memcpy() before measured memcpy() for removing
> the overhead come from page faults.
>
> And this is the result:
>
> % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...
>
> 4.608340 GB/Sec
>
> % ./perf bench mem memcpy -l 500MB
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...
>
> 4.856442 GB/Sec
>
> % ./perf bench mem memcpy -l 500MB -r x86-64-rep
> # Running mem/memcpy benchmark...
> # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...
>
> 6.024445 GB/Sec
>
> The relation of scores reversed!
> I cannot explain the cause of this result, and
> this is really interesting phenomenon.

Interesting indeed, and it would be nice to analyse that! (It should be possible,
using various PMU metrics in a clever way, to figure out what's happening inside the
CPU, right?)

> So I'd like to add new command line option,
> like "--pre-page-faults" to perf bench mem memcpy,
> for doing memcpy() before measured memcpy().
>
> How do you think about this idea?

Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for
things like this.)

An even better solution would be to output _both_ results by default, so that people
can see both characteristics at a glance?

Thanks,

Ingo

2010-11-05 17:06:01

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy

On 2010年11月01日 18:02, Ingo Molnar wrote:
>
> * Hitoshi Mitake<[email protected]> wrote:
>
>> On 2010年10月31日 04:23, Ingo Molnar wrote:
>>>
>>> * Hitoshi Mitake<[email protected]> wrote:
>>>
>>>> This patch adds new file: mem-memcpy-x86-64-asm.S
>>>> for x86-64 specific memcpy() benchmarking.
>>>> Added new benchmarks are,
>>>> x86-64-rep: memcpy() implemented with rep instruction
>>>> x86-64-unrolled: unrolled memcpy()
>>>>
>>>> Original idea of including the source files of kernel
>>>> for benchmarking is suggested by Ingo Molnar.
>>>> This is more effective than write-once programs for quantitative
>>>> evaluation of in-kernel, little and leaf functions called high frequently.
>>>> Because perf bench is in kernel source tree and executing it
>>>> on various hardwares, especially new model CPUs, is easy.
>>>>
>>>> This way can also be used for other functions of kernel e.g. checksum functions.
>>>>
>>>> Example of usage on Core i3 M330:
>>>>
>>>> | % ./perf bench mem memcpy -l 500MB
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f911f94c010 to 0x7f913ed4d010 ...
>>>> |
>>>> | 578.732506 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7fb4b6fe4010 to 0x7fb4d63e5010 ...
>>>> |
>>>> | 738.184980 MB/Sec
>>>> | % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>>>> | # Running mem/memcpy benchmark...
>>>> | # Copying 500MB Bytes from 0x7f6f2e668010 to 0x7f6f4da69010 ...
>>>> |
>>>> | 767.483269 MB/Sec
>>>>
>>>> This shows clearly that unrolled memcpy() is efficient
>>>> than rep version and glibc's one :)
>>>
>>> Hey, really cool output :-)
>>>
>>> Might also make sense to measure Ma Ling's patched version?
>>
>> Does Ma Ling's patched version mean,
>>
>> http://marc.info/?l=linux-kernel&m=128652296500989&w=2
>>
>> the memcpy applied the patch of the URL?
>> (It seems that this patch was written by Miao Xie.)
>>
>> I'll include the result of patched version in the next post.
>
> (Indeed it is Miao Xie - sorry!)
>
>>>> # checkpatch.pl warns about two externs in bench/mem-memcpy.c
>>>> # added by this patch. But I think it is no problem.
>>>
>>> You should put these:
>>>
>>> +#ifdef ARCH_X86_64
>>> +extern void *memcpy_x86_64_unrolled(void *to, const void *from, size_t len);
>>> +extern void *memcpy_x86_64_rep(void *to, const void *from, size_t len);
>>> +#endif
>>>
>>> into a .h file - a new one if needed.
>>>
>>> That will make both checkpatch and me happier ;-)
>>>
>>
>> OK, I'll separate these files.
>>
>> BTW, I found really interesting evaluation result.
>> Current results of "perf bench mem memcpy" include
>> the overhead of page faults because the measured memcpy()
>> is the first access to allocated memory area.
>>
>> I tested the another version of perf bench mem memcpy,
>> which does memcpy() before measured memcpy() for removing
>> the overhead come from page faults.
>>
>> And this is the result:
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-unrolled
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f19d488f010 to 0x7f19f3c90010 ...
>>
>> 4.608340 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f696c3cc010 to 0x7f698b7cd010 ...
>>
>> 4.856442 GB/Sec
>>
>> % ./perf bench mem memcpy -l 500MB -r x86-64-rep
>> # Running mem/memcpy benchmark...
>> # Copying 500MB Bytes from 0x7f45d6cff010 to 0x7f45f6100010 ...
>>
>> 6.024445 GB/Sec
>>
>> The relation of scores reversed!
>> I cannot explain the cause of this result, and
>> this is really interesting phenomenon.
>
> Interesting indeed, and it would be nice to analyse that! (It should be possible,
> using various PMU metrics in a clever way, to figure out what's happening inside the
> CPU, right?)
>
>> So I'd like to add new command line option,
>> like "--pre-page-faults" to perf bench mem memcpy,
>> for doing memcpy() before measured memcpy().
>>
>> How do you think about this idea?
>
> Agreed. (Maybe name it --prefault, as 'prefaulting' is the term we generally use for
> things like this.)
>
> An even better solution would be to output _both_ results by default, so that people
> can see both characteristics at a glance?

Outputting both result of prefaulted and non prefaulted will be useful,
but this might be not good for using from scripts.
So I'll implement --prefault option first. If there is request
for outputting both, I'll consider to modify default output.

# Please wait about the result of Miao Xie's patch,
# benchmarking memcpy() of unaligned memory area is
# a little difficult

Thanks,
Hitoshi

2010-11-10 09:12:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy


* Hitoshi Mitake <[email protected]> wrote:

> > An even better solution would be to output _both_ results by default, so that
> > people can see both characteristics at a glance?
>
> Outputting both result of prefaulted and non prefaulted will be useful, but this
> might be not good for using from scripts. So I'll implement --prefault option
> first. If there is request for outputting both, I'll consider to modify default
> output.

Ok - it should definitely be easily scriptable. The default can be have both flags
enabled and both results written to the output.

People will try 'perf bench x86' to see performance at a glance - so printing all
the tests we have is a good idea.

Thanks,

Ingo

2010-11-12 15:01:57

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: [PATCH 2/2] perf bench: add x86-64 specific benchmarks to perf bench mem memcpy

On 2010年11月10日 18:12, Ingo Molnar wrote:
>
> * Hitoshi Mitake<[email protected]> wrote:
>
>>> An even better solution would be to output _both_ results by default, so that
>>> people can see both characteristics at a glance?
>>
>> Outputting both result of prefaulted and non prefaulted will be useful, but this
>> might be not good for using from scripts. So I'll implement --prefault option
>> first. If there is request for outputting both, I'll consider to modify default
>> output.
>
> Ok - it should definitely be easily scriptable. The default can be have both flags
> enabled and both results written to the output.
>
> People will try 'perf bench x86' to see performance at a glance - so printing all
> the tests we have is a good idea.

OK, I added --no-prefault and --only-prefault to perf bench mem memcpy.
As you told, printing both of them is convenient.

I send the updated patch later.

Thanks,

2010-11-12 15:02:45

by Hitoshi Mitake

[permalink] [raw]
Subject: [PATCH] perf bench: print both of prefaulted and no prefaulted results

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
for printing single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 634.969014 MB/Sec
| 4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
tools/perf/bench/mem-memcpy.c | 215 +++++++++++++++++++++++++++++------------
1 files changed, 152 insertions(+), 63 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index be31ddb..61b6ead 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -25,7 +25,8 @@ static const char *length_str = "1MB";
static const char *routine = "default";
static bool use_clock;
static int clock_fd;
-static bool prefault;
+static bool only_prefault;
+static bool no_prefault;

static const struct option options[] = {
OPT_STRING('l', "length", &length_str, "1MB",
@@ -35,15 +36,19 @@ static const struct option options[] = {
"Specify routine to copy"),
OPT_BOOLEAN('c', "clock", &use_clock,
"Use CPU clock for measuring"),
- OPT_BOOLEAN('p', "prefault", &prefault,
- "Cause page faults before memcpy()"),
+ OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+ "Show only the result with page faults before memcpy()"),
+ OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+ "Show only the result without page faults before memcpy()"),
OPT_END()
};

+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
struct routine {
const char *name;
const char *desc;
- void * (*fn)(void *dst, const void *src, size_t len);
+ memcpy_t fn;
};

struct routine routines[] = {
@@ -92,29 +97,98 @@ static double timeval2double(struct timeval *ts)
(double)ts->tv_usec / (double)1000000;
}

+static void alloc_mem(void **dst, void **src, size_t length)
+{
+ *dst = zalloc(length);
+ if (!dst)
+ die("memory allocation failed - maybe length is too large?\n");
+
+ *src = zalloc(length);
+ if (!src)
+ die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+ u64 clock_start = 0ULL, clock_end = 0ULL;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ clock_start = get_clock();
+ fn(dst, src, len);
+ clock_end = get_clock();
+
+ free(src);
+ free(dst);
+ return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+ struct timeval tv_start, tv_end, tv_diff;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ BUG_ON(gettimeofday(&tv_start, NULL));
+ fn(dst, src, len);
+ BUG_ON(gettimeofday(&tv_end, NULL));
+
+ timersub(&tv_end, &tv_start, &tv_diff);
+
+ free(src);
+ free(dst);
+ return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do { \
+ if (x < K) \
+ printf(" %14lf B/Sec", x); \
+ else if (x < K * K) \
+ printf(" %14lfd KB/Sec", x / K); \
+ else if (x < K * K * K) \
+ printf(" %14lf MB/Sec", x / K / K); \
+ else \
+ printf(" %14lf GB/Sec", x / K / K / K); \
+ } while (0)
+
int bench_mem_memcpy(int argc, const char **argv,
const char *prefix __used)
{
int i;
- void *dst, *src;
- size_t length;
- double bps = 0.0;
- struct timeval tv_start, tv_end, tv_diff;
- u64 clock_start, clock_end, clock_diff;
+ size_t len;
+ double result_bps[2];
+ u64 result_clock[2];

- clock_start = clock_end = clock_diff = 0ULL;
argc = parse_options(argc, argv, options,
bench_mem_memcpy_usage, 0);

- tv_diff.tv_sec = 0;
- tv_diff.tv_usec = 0;
- length = (size_t)perf_atoll((char *)length_str);
+ if (use_clock)
+ init_clock();
+
+ len = (size_t)perf_atoll((char *)length_str);

- if ((s64)length <= 0) {
+ result_clock[0] = result_clock[1] = 0ULL;
+ result_bps[0] = result_bps[1] = 0.0;
+
+ if ((s64)len <= 0) {
fprintf(stderr, "Invalid length:%s\n", length_str);
return 1;
}

+ /* same to without specifying either of prefault and no-prefault */
+ if (only_prefault && no_prefault)
+ only_prefault = no_prefault = false;
+
for (i = 0; routines[i].name; i++) {
if (!strcmp(routines[i].name, routine))
break;
@@ -129,65 +203,80 @@ int bench_mem_memcpy(int argc, const char **argv,
return 1;
}

- dst = zalloc(length);
- if (!dst)
- die("memory allocation failed - maybe length is too large?\n");
-
- src = zalloc(length);
- if (!src)
- die("memory allocation failed - maybe length is too large?\n");
-
- if (bench_format == BENCH_FORMAT_DEFAULT) {
- printf("# Copying %s Bytes from %p to %p ...\n\n",
- length_str, src, dst);
- }
-
-
- if (prefault)
- routines[i].fn(dst, src, length);
-
- if (use_clock) {
- init_clock();
- clock_start = get_clock();
- } else {
- BUG_ON(gettimeofday(&tv_start, NULL));
- }
+ if (bench_format == BENCH_FORMAT_DEFAULT)
+ printf("# Copying %s Bytes ...\n\n", length_str);

- routines[i].fn(dst, src, length);
-
- if (use_clock) {
- clock_end = get_clock();
- clock_diff = clock_end - clock_start;
+ if (!only_prefault && !no_prefault) {
+ /* show both of results */
+ if (use_clock) {
+ result_clock[0] =
+ do_memcpy_clock(routines[i].fn, len, false);
+ result_clock[1] =
+ do_memcpy_clock(routines[i].fn, len, true);
+ } else {
+ result_bps[0] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, false);
+ result_bps[1] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, true);
+ }
} else {
- BUG_ON(gettimeofday(&tv_end, NULL));
- timersub(&tv_end, &tv_start, &tv_diff);
- bps = (double)((double)length / timeval2double(&tv_diff));
+ if (use_clock) {
+ result_clock[pf] =
+ do_memcpy_clock(routines[i].fn,
+ len, only_prefault);
+ } else {
+ result_bps[pf] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, only_prefault);
+ }
}

switch (bench_format) {
case BENCH_FORMAT_DEFAULT:
- if (use_clock) {
- printf(" %14lf Clock/Byte\n",
- (double)clock_diff / (double)length);
- } else {
- if (bps < K)
- printf(" %14lf B/Sec\n", bps);
- else if (bps < K * K)
- printf(" %14lfd KB/Sec\n", bps / 1024);
- else if (bps < K * K * K)
- printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
- else {
- printf(" %14lf GB/Sec\n",
- bps / 1024 / 1024 / 1024);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte\n",
+ (double)result_clock[0]
+ / (double)len);
+ printf(" %14lf Clock/Byte (with prefault)\n",
+ (double)result_clock[1]
+ / (double)len);
+ } else {
+ print_bps(result_bps[0]);
+ printf("\n");
+ print_bps(result_bps[1]);
+ printf(" (with prefault)\n");
}
+ } else {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte",
+ (double)result_clock[pf]
+ / (double)len);
+ } else
+ print_bps(result_bps[pf]);
+
+ printf("%s\n", only_prefault ? " (with prefault)" : "");
}
break;
case BENCH_FORMAT_SIMPLE:
- if (use_clock) {
- printf("%14lf\n",
- (double)clock_diff / (double)length);
- } else
- printf("%lf\n", bps);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf("%lf %lf\n",
+ (double)result_clock[0] / (double)len,
+ (double)result_clock[1] / (double)len);
+ } else {
+ printf("%lf %lf\n",
+ result_bps[0], result_bps[1]);
+ }
+ } else {
+ if (use_clock) {
+ printf("%lf\n", (double)result_clock[pf]
+ / (double)len);
+ } else
+ printf("%lf\n", result_bps[pf]);
+ }
break;
default:
/* reaching this means there's some disaster: */
--
1.7.1.1

2010-11-18 07:58:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf bench: print both of prefaulted and no prefaulted results


* Hitoshi Mitake <[email protected]> wrote:

> After applying this patch, perf bench mem memcpy prints
> both of prefualted and without prefaulted score of memcpy().
>
> New options --no-prefault and --only-prefault are added
> for printing single result, mainly for scripting usage.

Ok. Mind resending the whole series once all review feedback has been incorporated?

Thanks,

Ingo

2010-11-25 07:04:19

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: [PATCH] perf bench: print both of prefaulted and no prefaulted results


Really sorry for my late reply..

On 11/18/10 16:58, Ingo Molnar wrote:
>
> * Hitoshi Mitake<[email protected]> wrote:
>
>> After applying this patch, perf bench mem memcpy prints
>> both of prefualted and without prefaulted score of memcpy().
>>
>> New options --no-prefault and --only-prefault are added
>> for printing single result, mainly for scripting usage.
>
> Ok. Mind resending the whole series once all review feedback has been
incorporated?
>

OK, I'll send the patch series for prefaulting and
porting memcpy_64.S to perf bench later.
This series do some dirty things especially in Makefile
of perf and defining ENTRY(). So I'd like to hear your comment.
Could you review these?

And I have another problem. I cannot see the name of
memcpy based on rep prefix because the symbol of it is ".Lmemcpy_c".
It seems that the symbol name start from "." cannot be seen
from other object files. So I have to seek the way to
find the name of rep memcpy...

Thanks,
Hitoshi

2010-11-25 07:05:13

by Hitoshi Mitake

[permalink] [raw]
Subject: [PATCH v2 1/2] perf bench: print both of prefaulted and no prefaulted results

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Example of usage:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 634.969014 MB/Sec
| 4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
---
tools/perf/bench/mem-memcpy.c | 219 ++++++++++++++++++++++++++++++-----------
1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..db82021 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -12,6 +12,7 @@
#include "../util/parse-options.h"
#include "../util/header.h"
#include "bench.h"
+#include "mem-memcpy-arch.h"

#include <stdio.h>
#include <stdlib.h>
@@ -23,8 +24,10 @@

static const char *length_str = "1MB";
static const char *routine = "default";
-static bool use_clock = false;
+static bool use_clock;
static int clock_fd;
+static bool only_prefault;
+static bool no_prefault;

static const struct option options[] = {
OPT_STRING('l', "length", &length_str, "1MB",
@@ -34,19 +37,33 @@ static const struct option options[] = {
"Specify routine to copy"),
OPT_BOOLEAN('c', "clock", &use_clock,
"Use CPU clock for measuring"),
+ OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+ "Show only the result with page faults before memcpy()"),
+ OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+ "Show only the result without page faults before memcpy()"),
OPT_END()
};

+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
struct routine {
const char *name;
const char *desc;
- void * (*fn)(void *dst, const void *src, size_t len);
+ memcpy_t fn;
};

struct routine routines[] = {
{ "default",
"Default memcpy() provided by glibc",
memcpy },
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) { name, desc, fn },
+#include "mem-memcpy-x86-64-asm-def.h"
+#undef MEMCPY_FN
+
+#endif
+
{ NULL,
NULL,
NULL }
@@ -89,29 +106,98 @@ static double timeval2double(struct timeval *ts)
(double)ts->tv_usec / (double)1000000;
}

+static void alloc_mem(void **dst, void **src, size_t length)
+{
+ *dst = zalloc(length);
+ if (!dst)
+ die("memory allocation failed - maybe length is too large?\n");
+
+ *src = zalloc(length);
+ if (!src)
+ die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+ u64 clock_start = 0ULL, clock_end = 0ULL;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ clock_start = get_clock();
+ fn(dst, src, len);
+ clock_end = get_clock();
+
+ free(src);
+ free(dst);
+ return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+ struct timeval tv_start, tv_end, tv_diff;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ BUG_ON(gettimeofday(&tv_start, NULL));
+ fn(dst, src, len);
+ BUG_ON(gettimeofday(&tv_end, NULL));
+
+ timersub(&tv_end, &tv_start, &tv_diff);
+
+ free(src);
+ free(dst);
+ return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do { \
+ if (x < K) \
+ printf(" %14lf B/Sec", x); \
+ else if (x < K * K) \
+ printf(" %14lfd KB/Sec", x / K); \
+ else if (x < K * K * K) \
+ printf(" %14lf MB/Sec", x / K / K); \
+ else \
+ printf(" %14lf GB/Sec", x / K / K / K); \
+ } while (0)
+
int bench_mem_memcpy(int argc, const char **argv,
const char *prefix __used)
{
int i;
- void *dst, *src;
- size_t length;
- double bps = 0.0;
- struct timeval tv_start, tv_end, tv_diff;
- u64 clock_start, clock_end, clock_diff;
+ size_t len;
+ double result_bps[2];
+ u64 result_clock[2];

- clock_start = clock_end = clock_diff = 0ULL;
argc = parse_options(argc, argv, options,
bench_mem_memcpy_usage, 0);

- tv_diff.tv_sec = 0;
- tv_diff.tv_usec = 0;
- length = (size_t)perf_atoll((char *)length_str);
+ if (use_clock)
+ init_clock();
+
+ len = (size_t)perf_atoll((char *)length_str);

- if ((s64)length <= 0) {
+ result_clock[0] = result_clock[1] = 0ULL;
+ result_bps[0] = result_bps[1] = 0.0;
+
+ if ((s64)len <= 0) {
fprintf(stderr, "Invalid length:%s\n", length_str);
return 1;
}

+ /* same to without specifying either of prefault and no-prefault */
+ if (only_prefault && no_prefault)
+ only_prefault = no_prefault = false;
+
for (i = 0; routines[i].name; i++) {
if (!strcmp(routines[i].name, routine))
break;
@@ -126,61 +212,80 @@ int bench_mem_memcpy(int argc, const char **argv,
return 1;
}

- dst = zalloc(length);
- if (!dst)
- die("memory allocation failed - maybe length is too large?\n");
-
- src = zalloc(length);
- if (!src)
- die("memory allocation failed - maybe length is too large?\n");
-
- if (bench_format == BENCH_FORMAT_DEFAULT) {
- printf("# Copying %s Bytes from %p to %p ...\n\n",
- length_str, src, dst);
- }
-
- if (use_clock) {
- init_clock();
- clock_start = get_clock();
- } else {
- BUG_ON(gettimeofday(&tv_start, NULL));
- }
-
- routines[i].fn(dst, src, length);
+ if (bench_format == BENCH_FORMAT_DEFAULT)
+ printf("# Copying %s Bytes ...\n\n", length_str);

- if (use_clock) {
- clock_end = get_clock();
- clock_diff = clock_end - clock_start;
+ if (!only_prefault && !no_prefault) {
+ /* show both of results */
+ if (use_clock) {
+ result_clock[0] =
+ do_memcpy_clock(routines[i].fn, len, false);
+ result_clock[1] =
+ do_memcpy_clock(routines[i].fn, len, true);
+ } else {
+ result_bps[0] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, false);
+ result_bps[1] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, true);
+ }
} else {
- BUG_ON(gettimeofday(&tv_end, NULL));
- timersub(&tv_end, &tv_start, &tv_diff);
- bps = (double)((double)length / timeval2double(&tv_diff));
+ if (use_clock) {
+ result_clock[pf] =
+ do_memcpy_clock(routines[i].fn,
+ len, only_prefault);
+ } else {
+ result_bps[pf] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, only_prefault);
+ }
}

switch (bench_format) {
case BENCH_FORMAT_DEFAULT:
- if (use_clock) {
- printf(" %14lf Clock/Byte\n",
- (double)clock_diff / (double)length);
- } else {
- if (bps < K)
- printf(" %14lf B/Sec\n", bps);
- else if (bps < K * K)
- printf(" %14lfd KB/Sec\n", bps / 1024);
- else if (bps < K * K * K)
- printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
- else {
- printf(" %14lf GB/Sec\n",
- bps / 1024 / 1024 / 1024);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte\n",
+ (double)result_clock[0]
+ / (double)len);
+ printf(" %14lf Clock/Byte (with prefault)\n",
+ (double)result_clock[1]
+ / (double)len);
+ } else {
+ print_bps(result_bps[0]);
+ printf("\n");
+ print_bps(result_bps[1]);
+ printf(" (with prefault)\n");
}
+ } else {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte",
+ (double)result_clock[pf]
+ / (double)len);
+ } else
+ print_bps(result_bps[pf]);
+
+ printf("%s\n", only_prefault ? " (with prefault)" : "");
}
break;
case BENCH_FORMAT_SIMPLE:
- if (use_clock) {
- printf("%14lf\n",
- (double)clock_diff / (double)length);
- } else
- printf("%lf\n", bps);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf("%lf %lf\n",
+ (double)result_clock[0] / (double)len,
+ (double)result_clock[1] / (double)len);
+ } else {
+ printf("%lf %lf\n",
+ result_bps[0], result_bps[1]);
+ }
+ } else {
+ if (use_clock) {
+ printf("%lf\n", (double)result_clock[pf]
+ / (double)len);
+ } else
+ printf("%lf\n", result_bps[pf]);
+ }
break;
default:
/* reaching this means there's some disaster: */
--
1.6.5.2

2010-11-25 07:05:11

by Hitoshi Mitake

[permalink] [raw]
Subject: [PATCH v2 2/2] perf bench: port arch/x86/lib/memcpy_64.S to perf bench mem memcpy

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem memcpy
for benchmarking memcpy() in userland with tricky and dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are dummy (but do a little work) for
including memcpy_64.S without modification to it (e.g. defining ENTRY()).

This makes checkpatch.pl angry like this:
\#177: FILE: tools/perf/util/include/linux/linkage.h:7:
+#define ENTRY(name) \
+ .globl name; \
+ name:

WARNING: labels should not be indented
\#179: FILE: tools/perf/util/include/linux/linkage.h:9:
+ name:

because checkpatch.pl treat this file as the file written in C.
But I think this can be forgived because original include/linux/linkage.h
is doing the similar thing.

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
---
tools/perf/Makefile | 11 +++++++++++
tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++
tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++
tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++
tools/perf/util/include/asm/cpufeature.h | 9 +++++++++
tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++
tools/perf/util/include/linux/linkage.h | 13 +++++++++++++
7 files changed, 62 insertions(+), 0 deletions(-)
create mode 100644 tools/perf/bench/mem-memcpy-arch.h
create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm-def.h
create mode 100644 tools/perf/bench/mem-memcpy-x86-64-asm.S
create mode 100644 tools/perf/util/include/asm/cpufeature.h
create mode 100644 tools/perf/util/include/asm/dwarf2.h
create mode 100644 tools/perf/util/include/linux/linkage.h

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 2d414b3..b3e6bc6 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
ARCH := x86
endif
ifeq ($(ARCH),x86_64)
+ RAW_ARCH := x86_64
ARCH := x86
+ ARCH_CFLAGS := -DARCH_X86_64
+ ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
endif

# CFLAGS and LDFLAGS are for the users to override from the command line.
@@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
LIB_H += util/include/linux/rbtree.h
LIB_H += util/include/linux/string.h
LIB_H += util/include/linux/types.h
+LIB_H += util/include/linux/linkage.h
LIB_H += util/include/asm/asm-offsets.h
LIB_H += util/include/asm/bug.h
LIB_H += util/include/asm/byteorder.h
@@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
LIB_H += util/include/asm/system.h
LIB_H += util/include/asm/uaccess.h
LIB_H += util/include/dwarf-regs.h
+LIB_H += util/include/asm/dwarf2.h
+LIB_H += util/include/asm/cpufeature.h
LIB_H += perf.h
LIB_H += util/cache.h
LIB_H += util/callchain.h
@@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
LIB_H += util/probe-event.h
LIB_H += util/pstack.h
LIB_H += util/cpumap.h
+LIB_H += $(ARCH_INCLUDE)

LIB_OBJS += $(OUTPUT)util/abspath.o
LIB_OBJS += $(OUTPUT)util/alias.o
@@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
# Benchmark modules
BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
+ifeq ($(RAW_ARCH),x86_64)
+BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
+endif
BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o

BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
@@ -909,6 +919,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
LIB_OBJS += $(COMPAT_OBJS)

ALL_CFLAGS += $(BASIC_CFLAGS)
+ALL_CFLAGS += $(ARCH_CFLAGS)
ALL_LDFLAGS += $(BASIC_LDFLAGS)

export TAR INSTALL DESTDIR SHELL_PATH
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
new file mode 100644
index 0000000..a72e36c
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -0,0 +1,12 @@
+
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) \
+ extern void *fn(void *, const void *, size_t);
+
+#include "mem-memcpy-x86-64-asm-def.h"
+
+#undef MEMCPY_FN
+
+#endif
+
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
new file mode 100644
index 0000000..d588b87
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -0,0 +1,4 @@
+
+MEMCPY_FN(__memcpy,
+ "x86-64-unrolled",
+ "unrolled memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S b/tools/perf/bench/mem-memcpy-x86-64-asm.S
new file mode 100644
index 0000000..a57b66e
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
@@ -0,0 +1,2 @@
+
+#include "../../../arch/x86/lib/memcpy_64.S"
diff --git a/tools/perf/util/include/asm/cpufeature.h b/tools/perf/util/include/asm/cpufeature.h
new file mode 100644
index 0000000..acffd5e
--- /dev/null
+++ b/tools/perf/util/include/asm/cpufeature.h
@@ -0,0 +1,9 @@
+
+#ifndef PERF_CPUFEATURE_H
+#define PERF_CPUFEATURE_H
+
+/* cpufeature.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define X86_FEATURE_REP_GOOD 0
+
+#endif /* PERF_CPUFEATURE_H */
diff --git a/tools/perf/util/include/asm/dwarf2.h b/tools/perf/util/include/asm/dwarf2.h
new file mode 100644
index 0000000..bb4198e
--- /dev/null
+++ b/tools/perf/util/include/asm/dwarf2.h
@@ -0,0 +1,11 @@
+
+#ifndef PERF_DWARF2_H
+#define PERF_DWARF2_H
+
+/* dwarf2.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+
+#endif /* PERF_DWARF2_H */
+
diff --git a/tools/perf/util/include/linux/linkage.h b/tools/perf/util/include/linux/linkage.h
new file mode 100644
index 0000000..06387cf
--- /dev/null
+++ b/tools/perf/util/include/linux/linkage.h
@@ -0,0 +1,13 @@
+
+#ifndef PERF_LINUX_LINKAGE_H_
+#define PERF_LINUX_LINKAGE_H_
+
+/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
+
+#define ENTRY(name) \
+ .globl name; \
+ name:
+
+#define ENDPROC(name)
+
+#endif /* PERF_LINUX_LINKAGE_H_ */
--
1.6.5.2

2010-11-26 10:31:58

by Hitoshi Mitake

[permalink] [raw]
Subject: [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'

Commit-ID: ea7872b9d6a81101f6ba0ec141544a62fea35876
Gitweb: http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
Author: Hitoshi Mitake <[email protected]>
AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'

This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
memcpy for benchmarking memcpy() in userland with tricky and
dirty way.

util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
util/include/linux/linkage.h are mostly dummy files with small
wrappers, so that we are able to include memcpy_64.S
unmodified.

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: [email protected]
Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/perf/Makefile | 11 +++++++++++
tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++
tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++
tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++
tools/perf/util/include/asm/cpufeature.h | 9 +++++++++
tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++
tools/perf/util/include/linux/linkage.h | 13 +++++++++++++
7 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 74b684d..e0db197 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
ARCH := x86
endif
ifeq ($(ARCH),x86_64)
+ RAW_ARCH := x86_64
ARCH := x86
+ ARCH_CFLAGS := -DARCH_X86_64
+ ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
endif

# CFLAGS and LDFLAGS are for the users to override from the command line.
@@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
LIB_H += util/include/linux/rbtree.h
LIB_H += util/include/linux/string.h
LIB_H += util/include/linux/types.h
+LIB_H += util/include/linux/linkage.h
LIB_H += util/include/asm/asm-offsets.h
LIB_H += util/include/asm/bug.h
LIB_H += util/include/asm/byteorder.h
@@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
LIB_H += util/include/asm/system.h
LIB_H += util/include/asm/uaccess.h
LIB_H += util/include/dwarf-regs.h
+LIB_H += util/include/asm/dwarf2.h
+LIB_H += util/include/asm/cpufeature.h
LIB_H += perf.h
LIB_H += util/cache.h
LIB_H += util/callchain.h
@@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
LIB_H += util/probe-event.h
LIB_H += util/pstack.h
LIB_H += util/cpumap.h
+LIB_H += $(ARCH_INCLUDE)

LIB_OBJS += $(OUTPUT)util/abspath.o
LIB_OBJS += $(OUTPUT)util/alias.o
@@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
# Benchmark modules
BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
+ifeq ($(RAW_ARCH),x86_64)
+BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
+endif
BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o

BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
@@ -898,6 +908,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
LIB_OBJS += $(COMPAT_OBJS)

ALL_CFLAGS += $(BASIC_CFLAGS)
+ALL_CFLAGS += $(ARCH_CFLAGS)
ALL_LDFLAGS += $(BASIC_LDFLAGS)

export TAR INSTALL DESTDIR SHELL_PATH
diff --git a/tools/perf/bench/mem-memcpy-arch.h b/tools/perf/bench/mem-memcpy-arch.h
new file mode 100644
index 0000000..a72e36c
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-arch.h
@@ -0,0 +1,12 @@
+
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) \
+ extern void *fn(void *, const void *, size_t);
+
+#include "mem-memcpy-x86-64-asm-def.h"
+
+#undef MEMCPY_FN
+
+#endif
+
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
new file mode 100644
index 0000000..d588b87
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
@@ -0,0 +1,4 @@
+
+MEMCPY_FN(__memcpy,
+ "x86-64-unrolled",
+ "unrolled memcpy() in arch/x86/lib/memcpy_64.S")
diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S b/tools/perf/bench/mem-memcpy-x86-64-asm.S
new file mode 100644
index 0000000..a57b66e
--- /dev/null
+++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
@@ -0,0 +1,2 @@
+
+#include "../../../arch/x86/lib/memcpy_64.S"
diff --git a/tools/perf/util/include/asm/cpufeature.h b/tools/perf/util/include/asm/cpufeature.h
new file mode 100644
index 0000000..acffd5e
--- /dev/null
+++ b/tools/perf/util/include/asm/cpufeature.h
@@ -0,0 +1,9 @@
+
+#ifndef PERF_CPUFEATURE_H
+#define PERF_CPUFEATURE_H
+
+/* cpufeature.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define X86_FEATURE_REP_GOOD 0
+
+#endif /* PERF_CPUFEATURE_H */
diff --git a/tools/perf/util/include/asm/dwarf2.h b/tools/perf/util/include/asm/dwarf2.h
new file mode 100644
index 0000000..bb4198e
--- /dev/null
+++ b/tools/perf/util/include/asm/dwarf2.h
@@ -0,0 +1,11 @@
+
+#ifndef PERF_DWARF2_H
+#define PERF_DWARF2_H
+
+/* dwarf2.h ... dummy header file for including arch/x86/lib/memcpy_64.S */
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+
+#endif /* PERF_DWARF2_H */
+
diff --git a/tools/perf/util/include/linux/linkage.h b/tools/perf/util/include/linux/linkage.h
new file mode 100644
index 0000000..06387cf
--- /dev/null
+++ b/tools/perf/util/include/linux/linkage.h
@@ -0,0 +1,13 @@
+
+#ifndef PERF_LINUX_LINKAGE_H_
+#define PERF_LINUX_LINKAGE_H_
+
+/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
+
+#define ENTRY(name) \
+ .globl name; \
+ name:
+
+#define ENDPROC(name)
+
+#endif /* PERF_LINUX_LINKAGE_H_ */

2010-11-26 10:31:59

by Hitoshi Mitake

[permalink] [raw]
Subject: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

Commit-ID: 49ce8fc651794878189fd5f273228832cdfb5be9
Gitweb: http://git.kernel.org/tip/49ce8fc651794878189fd5f273228832cdfb5be9
Author: Hitoshi Mitake <[email protected]>
AuthorDate: Thu, 25 Nov 2010 16:04:52 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 26 Nov 2010 08:15:57 +0100

perf bench: Print both of prefaulted and no prefaulted results by default

After applying this patch, perf bench mem memcpy prints
both of prefualted and without prefaulted score of memcpy().

New options --no-prefault and --only-prefault are added
to print single result, mainly for scripting usage.

Usage example:

| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 634.969014 MB/Sec
| 4.828062 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --only-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 4.705192 GB/Sec (with prefault)
| mitake@X201i:~/linux/.../tools/perf% ./perf bench mem memcpy -l 500MB --no-prefault
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 642.725568 MB/Sec

Signed-off-by: Hitoshi Mitake <[email protected]>
Cc: [email protected]
Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andi Kleen <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/perf/bench/mem-memcpy.c | 219 ++++++++++++++++++++++++++++++-----------
1 files changed, 162 insertions(+), 57 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 38dae74..db82021 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -12,6 +12,7 @@
#include "../util/parse-options.h"
#include "../util/header.h"
#include "bench.h"
+#include "mem-memcpy-arch.h"

#include <stdio.h>
#include <stdlib.h>
@@ -23,8 +24,10 @@

static const char *length_str = "1MB";
static const char *routine = "default";
-static bool use_clock = false;
+static bool use_clock;
static int clock_fd;
+static bool only_prefault;
+static bool no_prefault;

static const struct option options[] = {
OPT_STRING('l', "length", &length_str, "1MB",
@@ -34,19 +37,33 @@ static const struct option options[] = {
"Specify routine to copy"),
OPT_BOOLEAN('c', "clock", &use_clock,
"Use CPU clock for measuring"),
+ OPT_BOOLEAN('o', "only-prefault", &only_prefault,
+ "Show only the result with page faults before memcpy()"),
+ OPT_BOOLEAN('n', "no-prefault", &no_prefault,
+ "Show only the result without page faults before memcpy()"),
OPT_END()
};

+typedef void *(*memcpy_t)(void *, const void *, size_t);
+
struct routine {
const char *name;
const char *desc;
- void * (*fn)(void *dst, const void *src, size_t len);
+ memcpy_t fn;
};

struct routine routines[] = {
{ "default",
"Default memcpy() provided by glibc",
memcpy },
+#ifdef ARCH_X86_64
+
+#define MEMCPY_FN(fn, name, desc) { name, desc, fn },
+#include "mem-memcpy-x86-64-asm-def.h"
+#undef MEMCPY_FN
+
+#endif
+
{ NULL,
NULL,
NULL }
@@ -89,29 +106,98 @@ static double timeval2double(struct timeval *ts)
(double)ts->tv_usec / (double)1000000;
}

+static void alloc_mem(void **dst, void **src, size_t length)
+{
+ *dst = zalloc(length);
+ if (!dst)
+ die("memory allocation failed - maybe length is too large?\n");
+
+ *src = zalloc(length);
+ if (!src)
+ die("memory allocation failed - maybe length is too large?\n");
+}
+
+static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
+{
+ u64 clock_start = 0ULL, clock_end = 0ULL;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ clock_start = get_clock();
+ fn(dst, src, len);
+ clock_end = get_clock();
+
+ free(src);
+ free(dst);
+ return clock_end - clock_start;
+}
+
+static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
+{
+ struct timeval tv_start, tv_end, tv_diff;
+ void *src = NULL, *dst = NULL;
+
+ alloc_mem(&src, &dst, len);
+
+ if (prefault)
+ fn(dst, src, len);
+
+ BUG_ON(gettimeofday(&tv_start, NULL));
+ fn(dst, src, len);
+ BUG_ON(gettimeofday(&tv_end, NULL));
+
+ timersub(&tv_end, &tv_start, &tv_diff);
+
+ free(src);
+ free(dst);
+ return (double)((double)len / timeval2double(&tv_diff));
+}
+
+#define pf (no_prefault ? 0 : 1)
+
+#define print_bps(x) do { \
+ if (x < K) \
+ printf(" %14lf B/Sec", x); \
+ else if (x < K * K) \
+ printf(" %14lfd KB/Sec", x / K); \
+ else if (x < K * K * K) \
+ printf(" %14lf MB/Sec", x / K / K); \
+ else \
+ printf(" %14lf GB/Sec", x / K / K / K); \
+ } while (0)
+
int bench_mem_memcpy(int argc, const char **argv,
const char *prefix __used)
{
int i;
- void *dst, *src;
- size_t length;
- double bps = 0.0;
- struct timeval tv_start, tv_end, tv_diff;
- u64 clock_start, clock_end, clock_diff;
+ size_t len;
+ double result_bps[2];
+ u64 result_clock[2];

- clock_start = clock_end = clock_diff = 0ULL;
argc = parse_options(argc, argv, options,
bench_mem_memcpy_usage, 0);

- tv_diff.tv_sec = 0;
- tv_diff.tv_usec = 0;
- length = (size_t)perf_atoll((char *)length_str);
+ if (use_clock)
+ init_clock();
+
+ len = (size_t)perf_atoll((char *)length_str);

- if ((s64)length <= 0) {
+ result_clock[0] = result_clock[1] = 0ULL;
+ result_bps[0] = result_bps[1] = 0.0;
+
+ if ((s64)len <= 0) {
fprintf(stderr, "Invalid length:%s\n", length_str);
return 1;
}

+ /* same to without specifying either of prefault and no-prefault */
+ if (only_prefault && no_prefault)
+ only_prefault = no_prefault = false;
+
for (i = 0; routines[i].name; i++) {
if (!strcmp(routines[i].name, routine))
break;
@@ -126,61 +212,80 @@ int bench_mem_memcpy(int argc, const char **argv,
return 1;
}

- dst = zalloc(length);
- if (!dst)
- die("memory allocation failed - maybe length is too large?\n");
-
- src = zalloc(length);
- if (!src)
- die("memory allocation failed - maybe length is too large?\n");
-
- if (bench_format == BENCH_FORMAT_DEFAULT) {
- printf("# Copying %s Bytes from %p to %p ...\n\n",
- length_str, src, dst);
- }
-
- if (use_clock) {
- init_clock();
- clock_start = get_clock();
- } else {
- BUG_ON(gettimeofday(&tv_start, NULL));
- }
-
- routines[i].fn(dst, src, length);
+ if (bench_format == BENCH_FORMAT_DEFAULT)
+ printf("# Copying %s Bytes ...\n\n", length_str);

- if (use_clock) {
- clock_end = get_clock();
- clock_diff = clock_end - clock_start;
+ if (!only_prefault && !no_prefault) {
+ /* show both of results */
+ if (use_clock) {
+ result_clock[0] =
+ do_memcpy_clock(routines[i].fn, len, false);
+ result_clock[1] =
+ do_memcpy_clock(routines[i].fn, len, true);
+ } else {
+ result_bps[0] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, false);
+ result_bps[1] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, true);
+ }
} else {
- BUG_ON(gettimeofday(&tv_end, NULL));
- timersub(&tv_end, &tv_start, &tv_diff);
- bps = (double)((double)length / timeval2double(&tv_diff));
+ if (use_clock) {
+ result_clock[pf] =
+ do_memcpy_clock(routines[i].fn,
+ len, only_prefault);
+ } else {
+ result_bps[pf] =
+ do_memcpy_gettimeofday(routines[i].fn,
+ len, only_prefault);
+ }
}

switch (bench_format) {
case BENCH_FORMAT_DEFAULT:
- if (use_clock) {
- printf(" %14lf Clock/Byte\n",
- (double)clock_diff / (double)length);
- } else {
- if (bps < K)
- printf(" %14lf B/Sec\n", bps);
- else if (bps < K * K)
- printf(" %14lfd KB/Sec\n", bps / 1024);
- else if (bps < K * K * K)
- printf(" %14lf MB/Sec\n", bps / 1024 / 1024);
- else {
- printf(" %14lf GB/Sec\n",
- bps / 1024 / 1024 / 1024);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte\n",
+ (double)result_clock[0]
+ / (double)len);
+ printf(" %14lf Clock/Byte (with prefault)\n",
+ (double)result_clock[1]
+ / (double)len);
+ } else {
+ print_bps(result_bps[0]);
+ printf("\n");
+ print_bps(result_bps[1]);
+ printf(" (with prefault)\n");
}
+ } else {
+ if (use_clock) {
+ printf(" %14lf Clock/Byte",
+ (double)result_clock[pf]
+ / (double)len);
+ } else
+ print_bps(result_bps[pf]);
+
+ printf("%s\n", only_prefault ? " (with prefault)" : "");
}
break;
case BENCH_FORMAT_SIMPLE:
- if (use_clock) {
- printf("%14lf\n",
- (double)clock_diff / (double)length);
- } else
- printf("%lf\n", bps);
+ if (!only_prefault && !no_prefault) {
+ if (use_clock) {
+ printf("%lf %lf\n",
+ (double)result_clock[0] / (double)len,
+ (double)result_clock[1] / (double)len);
+ } else {
+ printf("%lf %lf\n",
+ result_bps[0], result_bps[1]);
+ }
+ } else {
+ if (use_clock) {
+ printf("%lf\n", (double)result_clock[pf]
+ / (double)len);
+ } else
+ printf("%lf\n", result_bps[pf]);
+ }
break;
default:
/* reaching this means there's some disaster: */

2010-11-29 13:26:50

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: [tip:perf/core] perf bench: Add feature that measures the performance of the arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'

On 2010年11月26日 19:31, tip-bot for Hitoshi Mitake wrote:
> Commit-ID: ea7872b9d6a81101f6ba0ec141544a62fea35876
> Gitweb:
http://git.kernel.org/tip/ea7872b9d6a81101f6ba0ec141544a62fea35876
> Author: Hitoshi Mitake<[email protected]>
> AuthorDate: Thu, 25 Nov 2010 16:04:53 +0900
> Committer: Ingo Molnar<[email protected]>
> CommitDate: Fri, 26 Nov 2010 08:15:57 +0100
>
> perf bench: Add feature that measures the performance of the
arch/x86/lib/memcpy_64.S memcpy routines via 'perf bench mem'
>
> This patch ports arch/x86/lib/memcpy_64.S to perf bench mem
> memcpy for benchmarking memcpy() in userland with tricky and
> dirty way.
>
> util/include/asm/cpufeature.h, util/include/asm/dwarf2.h, and
> util/include/linux/linkage.h are mostly dummy files with small
> wrappers, so that we are able to include memcpy_64.S
> unmodified.
>
> Signed-off-by: Hitoshi Mitake<[email protected]>
> Cc: [email protected]
> Cc: Miao Xie<[email protected]>
> Cc: Ma Ling<[email protected]>
> Cc: Zhao Yakui<[email protected]>
> Cc: Peter Zijlstra<[email protected]>
> Cc: Arnaldo Carvalho de Melo<[email protected]>
> Cc: Paul Mackerras<[email protected]>
> Cc: Frederic Weisbecker<[email protected]>
> Cc: Steven Rostedt<[email protected]>
> Cc: Andi Kleen<[email protected]>
>
LKML-Reference:<[email protected]>
> Signed-off-by: Ingo Molnar<[email protected]>
> ---
> tools/perf/Makefile | 11 +++++++++++
> tools/perf/bench/mem-memcpy-arch.h | 12 ++++++++++++
> tools/perf/bench/mem-memcpy-x86-64-asm-def.h | 4 ++++
> tools/perf/bench/mem-memcpy-x86-64-asm.S | 2 ++
> tools/perf/util/include/asm/cpufeature.h | 9 +++++++++
> tools/perf/util/include/asm/dwarf2.h | 11 +++++++++++
> tools/perf/util/include/linux/linkage.h | 13 +++++++++++++
> 7 files changed, 62 insertions(+), 0 deletions(-)
>
> diff --git a/tools/perf/Makefile b/tools/perf/Makefile
> index 74b684d..e0db197 100644
> --- a/tools/perf/Makefile
> +++ b/tools/perf/Makefile
> @@ -185,7 +185,10 @@ ifeq ($(ARCH),i386)
> ARCH := x86
> endif
> ifeq ($(ARCH),x86_64)
> + RAW_ARCH := x86_64
> ARCH := x86
> + ARCH_CFLAGS := -DARCH_X86_64
> + ARCH_INCLUDE = ../../arch/x86/lib/memcpy_64.S
> endif
>
> # CFLAGS and LDFLAGS are for the users to override from the command
line.
> @@ -375,6 +378,7 @@ LIB_H += util/include/linux/prefetch.h
> LIB_H += util/include/linux/rbtree.h
> LIB_H += util/include/linux/string.h
> LIB_H += util/include/linux/types.h
> +LIB_H += util/include/linux/linkage.h
> LIB_H += util/include/asm/asm-offsets.h
> LIB_H += util/include/asm/bug.h
> LIB_H += util/include/asm/byteorder.h
> @@ -383,6 +387,8 @@ LIB_H += util/include/asm/swab.h
> LIB_H += util/include/asm/system.h
> LIB_H += util/include/asm/uaccess.h
> LIB_H += util/include/dwarf-regs.h
> +LIB_H += util/include/asm/dwarf2.h
> +LIB_H += util/include/asm/cpufeature.h
> LIB_H += perf.h
> LIB_H += util/cache.h
> LIB_H += util/callchain.h
> @@ -417,6 +423,7 @@ LIB_H += util/probe-finder.h
> LIB_H += util/probe-event.h
> LIB_H += util/pstack.h
> LIB_H += util/cpumap.h
> +LIB_H += $(ARCH_INCLUDE)
>
> LIB_OBJS += $(OUTPUT)util/abspath.o
> LIB_OBJS += $(OUTPUT)util/alias.o
> @@ -472,6 +479,9 @@ BUILTIN_OBJS += $(OUTPUT)builtin-bench.o
> # Benchmark modules
> BUILTIN_OBJS += $(OUTPUT)bench/sched-messaging.o
> BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
> +ifeq ($(RAW_ARCH),x86_64)
> +BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
> +endif
> BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
>
> BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
> @@ -898,6 +908,7 @@ BASIC_CFLAGS += -DSHA1_HEADER='$(SHA1_HEADER_SQ)' \
> LIB_OBJS += $(COMPAT_OBJS)
>
> ALL_CFLAGS += $(BASIC_CFLAGS)
> +ALL_CFLAGS += $(ARCH_CFLAGS)
> ALL_LDFLAGS += $(BASIC_LDFLAGS)
>
> export TAR INSTALL DESTDIR SHELL_PATH
> diff --git a/tools/perf/bench/mem-memcpy-arch.h
b/tools/perf/bench/mem-memcpy-arch.h
> new file mode 100644
> index 0000000..a72e36c
> --- /dev/null
> +++ b/tools/perf/bench/mem-memcpy-arch.h
> @@ -0,0 +1,12 @@
> +
> +#ifdef ARCH_X86_64
> +
> +#define MEMCPY_FN(fn, name, desc) \
> + extern void *fn(void *, const void *, size_t);
> +
> +#include "mem-memcpy-x86-64-asm-def.h"
> +
> +#undef MEMCPY_FN
> +
> +#endif
> +
> diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
> new file mode 100644
> index 0000000..d588b87
> --- /dev/null
> +++ b/tools/perf/bench/mem-memcpy-x86-64-asm-def.h
> @@ -0,0 +1,4 @@
> +
> +MEMCPY_FN(__memcpy,
> + "x86-64-unrolled",
> + "unrolled memcpy() in arch/x86/lib/memcpy_64.S")
> diff --git a/tools/perf/bench/mem-memcpy-x86-64-asm.S
b/tools/perf/bench/mem-memcpy-x86-64-asm.S
> new file mode 100644
> index 0000000..a57b66e
> --- /dev/null
> +++ b/tools/perf/bench/mem-memcpy-x86-64-asm.S
> @@ -0,0 +1,2 @@
> +
> +#include "../../../arch/x86/lib/memcpy_64.S"
> diff --git a/tools/perf/util/include/asm/cpufeature.h
b/tools/perf/util/include/asm/cpufeature.h
> new file mode 100644
> index 0000000..acffd5e
> --- /dev/null
> +++ b/tools/perf/util/include/asm/cpufeature.h
> @@ -0,0 +1,9 @@
> +
> +#ifndef PERF_CPUFEATURE_H
> +#define PERF_CPUFEATURE_H
> +
> +/* cpufeature.h ... dummy header file for including
arch/x86/lib/memcpy_64.S */
> +
> +#define X86_FEATURE_REP_GOOD 0
> +
> +#endif /* PERF_CPUFEATURE_H */
> diff --git a/tools/perf/util/include/asm/dwarf2.h
b/tools/perf/util/include/asm/dwarf2.h
> new file mode 100644
> index 0000000..bb4198e
> --- /dev/null
> +++ b/tools/perf/util/include/asm/dwarf2.h
> @@ -0,0 +1,11 @@
> +
> +#ifndef PERF_DWARF2_H
> +#define PERF_DWARF2_H
> +
> +/* dwarf2.h ... dummy header file for including
arch/x86/lib/memcpy_64.S */
> +
> +#define CFI_STARTPROC
> +#define CFI_ENDPROC
> +
> +#endif /* PERF_DWARF2_H */
> +
> diff --git a/tools/perf/util/include/linux/linkage.h
b/tools/perf/util/include/linux/linkage.h
> new file mode 100644
> index 0000000..06387cf
> --- /dev/null
> +++ b/tools/perf/util/include/linux/linkage.h
> @@ -0,0 +1,13 @@
> +
> +#ifndef PERF_LINUX_LINKAGE_H_
> +#define PERF_LINUX_LINKAGE_H_
> +
> +/* linkage.h ... for including arch/x86/lib/memcpy_64.S */
> +
> +#define ENTRY(name) \
> + .globl name; \
> + name:
> +
> +#define ENDPROC(name)
> +
> +#endif /* PERF_LINUX_LINKAGE_H_ */
>

Thanks for your applying, Ingo!

BTW, I have a question.
Why does the symbol name of rep prefix memcpy() start from '.'?
The symbol name starts from '.' like ".Lmemcpy_c" cannot seen as
symbol name after compile.

I couldn't find the reason why .Lmemcpy_c has to start from '.'.
For example, clear_page in arch/x86/lib/clear_page_64.S
doesn't start from '.' but it is alternative function.

If there is no special reason, I'd like to rename it.

Thanks,
Hitoshi