2008-03-14 12:14:37

by Alan D. Brunelle

[permalink] [raw]
Subject: IO CPU affinity test results

Good morning Jens -

I had two machines running the latest patches hang last night:

o 2-way AMD64 - I inadvertently left the patched kernel running, and I was moving a ton of data (100+GB) back up over the net to this node. It hard hung (believe it or not) about 99% of the way through. Hard hang, wouldn't respond to anything.

o 4-way IA64 - I was performing a simple test: [mkfs / mount / untar linux sources / make allnoconfig / make -j 5 / umount] repeatedly switching rq_affinity to 0/1 between each run. After 22 passes it had a hard hang with rq_affinity set to 1.

Of course, there is no way of knowing if either hang had anything to do with the patches, but it seems a bit ominous as RQ=1 was set in both cases.

This same test worked fine for 30 passes on a 2-way AMD64 box, with the following results:

Part RQ MIN AVG MAX Dev
----- -- ------ ------ ------ ------
mkfs 0 41.656 41.862 42.086 0.141
mkfs 1 41.618 41.909 42.270 0.192

untar 0 18.055 19.611 20.906 0.720
untar 1 18.523 19.905 21.988 0.738

make 0 50.480 50.991 51.752 0.340
make 1 49.819 50.442 51.000 0.292

comb 0 110.433 112.464 114.176 0.932
comb 1 110.694 112.256 114.683 0.948

psys 0 10.28% 10.91% 11.29% 0.243
psys 1 10.21% 11.05% 11.80% 0.350


All results are in seconds (as measured by Python's time.time()), except for the psys - which was the average of mpstat's %sys column over the life of the whole run. The mkfs part consisted of [mkfs -t ext2 ; sync ; sync], untar [mount; untar linux sources; umount; sync; sync], make [mount; make allnoconfig; make -j 3; umount; sync; sync], and comb is the combined times of the mkfs, untar and make parts.

So, in a nutshell, we saw slightly better overall performance, but not conclusively, and we saw slightly elevated %system time to accomplish the task.

On the 4-way, results were much worse: the final data shown before the system hung showed the rq=1 passes taking significantly longer, albeit at lower %system. I'm going to try the runs again, but I have a feeling that the latest "clean" patch based upon Nick's single call mechanism is a step backwards.

Alan


2008-03-14 12:23:56

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO CPU affinity test results

Alan D. Brunelle wrote:
> Good morning Jens -
>
> I had two machines running the latest patches hang last night:
>
> o 2-way AMD64 - I inadvertently left the patched kernel running, and I was moving a ton of data (100+GB) back up over the net to this node. It hard hung (believe it or not) about 99% of the way through. Hard hang, wouldn't respond to anything.
>
> o 4-way IA64 - I was performing a simple test: [mkfs / mount / untar linux sources / make allnoconfig / make -j 5 / umount] repeatedly switching rq_affinity to 0/1 between each run. After 22 passes it had a hard hang with rq_affinity set to 1.
>
> Of course, there is no way of knowing if either hang had anything to do with the patches, but it seems a bit ominous as RQ=1 was set in both cases.
>
> This same test worked fine for 30 passes on a 2-way AMD64 box, with the following results:
>
> Part RQ MIN AVG MAX Dev
> ----- -- ------ ------ ------ ------
> mkfs 0 41.656 41.862 42.086 0.141
> mkfs 1 41.618 41.909 42.270 0.192
>
> untar 0 18.055 19.611 20.906 0.720
> untar 1 18.523 19.905 21.988 0.738
>
> make 0 50.480 50.991 51.752 0.340
> make 1 49.819 50.442 51.000 0.292
>
> comb 0 110.433 112.464 114.176 0.932
> comb 1 110.694 112.256 114.683 0.948
>
> psys 0 10.28% 10.91% 11.29% 0.243
> psys 1 10.21% 11.05% 11.80% 0.350
>
>
> All results are in seconds (as measured by Python's time.time()), except for the psys - which was the average of mpstat's %sys column over the life of the whole run. The mkfs part consisted of [mkfs -t ext2 ; sync ; sync], untar [mount; untar linux sources; umount; sync; sync], make [mount; make allnoconfig; make -j 3; umount; sync; sync], and comb is the combined times of the mkfs, untar and make parts.
>
> So, in a nutshell, we saw slightly better overall performance, but not conclusively, and we saw slightly elevated %system time to accomplish the task.
>
> On the 4-way, results were much worse: the final data shown before the system hung showed the rq=1 passes taking significantly longer, albeit at lower %system. I'm going to try the runs again, but I have a feeling that the latest "clean" patch based upon Nick's single call mechanism is a step backwards.
>
> Alan

I was able to go back and capture the results after 17 passes on the 4-way IA64 box (before it hung), and with rq=1 it shows a huge increase in time needed to do the combined tests - almost 18% longer, however with a reduction of about 24% less system time.

Part RQ MIN AVG MAX Dev
----- -- ------ ------ ------ ------
mkfs 0 18.543 19.055 19.514 0.285
mkfs 1 18.730 19.217 19.812 0.316

untar 0 17.119 21.396 43.868 8.025
untar 1 16.987 28.155 44.637 10.175

make 0 23.105 23.866 24.487 0.359
make 1 24.015 28.384 37.829 3.598

comb 0 59.610 64.317 86.733 7.896
comb 1 63.181 75.755 94.079 10.489

psys 0 10.35% 14.16% 16.28% 1.375
psys 1 6.89% 10.73% 14.30% 2.368

I'll try to snag some profile data to see what's up.

Alan
PS. Besides the AMD64/IA64 architectural difference, the underlying storage was different as well: U320 on the AMD64 and FC on the IA64, I don't know if that has anything to do with the different results seen on the two hosts.

2008-03-14 12:36:20

by Jens Axboe

[permalink] [raw]
Subject: Re: IO CPU affinity test results

On Fri, Mar 14 2008, Alan D. Brunelle wrote:
> Good morning Jens -
>
> I had two machines running the latest patches hang last night:
>
> o 2-way AMD64 - I inadvertently left the patched kernel running, and
> I was moving a ton of data (100+GB) back up over the net to this node.
> It hard hung (believe it or not) about 99% of the way through. Hard
> hang, wouldn't respond to anything.
>
> o 4-way IA64 - I was performing a simple test: [mkfs / mount / untar
> linux sources / make allnoconfig / make -j 5 / umount] repeatedly
> switching rq_affinity to 0/1 between each run. After 22 passes it had
> a hard hang with rq_affinity set to 1.
>
> Of course, there is no way of knowing if either hang had anything to
> do with the patches, but it seems a bit ominous as RQ=1 was set in
> both cases.
>
> This same test worked fine for 30 passes on a 2-way AMD64 box, with
> the following results:
>
> Part RQ MIN AVG MAX Dev
> ----- -- ------ ------ ------ ------
> mkfs 0 41.656 41.862 42.086 0.141
> mkfs 1 41.618 41.909 42.270 0.192
>
> untar 0 18.055 19.611 20.906 0.720
> untar 1 18.523 19.905 21.988 0.738
>
> make 0 50.480 50.991 51.752 0.340
> make 1 49.819 50.442 51.000 0.292
>
> comb 0 110.433 112.464 114.176 0.932
> comb 1 110.694 112.256 114.683 0.948
>
> psys 0 10.28% 10.91% 11.29% 0.243
> psys 1 10.21% 11.05% 11.80% 0.350
>
>
> All results are in seconds (as measured by Python's time.time()),
> except for the psys - which was the average of mpstat's %sys column
> over the life of the whole run. The mkfs part consisted of [mkfs -t
> ext2 ; sync ; sync], untar [mount; untar linux sources; umount; sync;
> sync], make [mount; make allnoconfig; make -j 3; umount; sync; sync],
> and comb is the combined times of the mkfs, untar and make parts.
>
> So, in a nutshell, we saw slightly better overall performance, but not
> conclusively, and we saw slightly elevated %system time to accomplish
> the task.

I think that is encouraging, for such a small setup. The make results
are particularly nice. The hangs are a bother, I have no good ideas on
why the occur. The fact that it happens on both archs indicates that
this is perhaps a generic problem, which is good. The code to support
this is relatively simple, so it should be possible to go over it with a
fine toothed comb and see if anything shows up.

You didn't get any watchdog triggers on the serial console, or anything
like that?

> On the 4-way, results were much worse: the final data shown before the
> system hung showed the rq=1 passes taking significantly longer, albeit
> at lower %system. I'm going to try the runs again, but I have a
> feeling that the latest "clean" patch based upon Nick's single call
> mechanism is a step backwards.

As to the hangs, yes it's definitely a step back. Cleanliness of code is
better though and it's something we can support, so I'm inclined to try
and make it work. The overhead of the cleaner approach should not be a
problem, it'll be a bit more costly than the ipi hack but not a lot.
Since the tuning and tweaks, we got rid of the allocations in the path,
so it should be about as fast as the original.

So once the base is stable, we can go and profile and squeeze some more
performance out of it :-)

--
Jens Axboe

2008-03-14 12:38:04

by Jens Axboe

[permalink] [raw]
Subject: Re: IO CPU affinity test results

On Fri, Mar 14 2008, Alan D. Brunelle wrote:
> Alan D. Brunelle wrote:
> > Good morning Jens -
> >
> > I had two machines running the latest patches hang last night:
> >
> > o 2-way AMD64 - I inadvertently left the patched kernel running, and I was moving a ton of data (100+GB) back up over the net to this node. It hard hung (believe it or not) about 99% of the way through. Hard hang, wouldn't respond to anything.
> >
> > o 4-way IA64 - I was performing a simple test: [mkfs / mount / untar linux sources / make allnoconfig / make -j 5 / umount] repeatedly switching rq_affinity to 0/1 between each run. After 22 passes it had a hard hang with rq_affinity set to 1.
> >
> > Of course, there is no way of knowing if either hang had anything to do with the patches, but it seems a bit ominous as RQ=1 was set in both cases.
> >
> > This same test worked fine for 30 passes on a 2-way AMD64 box, with the following results:
> >
> > Part RQ MIN AVG MAX Dev
> > ----- -- ------ ------ ------ ------
> > mkfs 0 41.656 41.862 42.086 0.141
> > mkfs 1 41.618 41.909 42.270 0.192
> >
> > untar 0 18.055 19.611 20.906 0.720
> > untar 1 18.523 19.905 21.988 0.738
> >
> > make 0 50.480 50.991 51.752 0.340
> > make 1 49.819 50.442 51.000 0.292
> >
> > comb 0 110.433 112.464 114.176 0.932
> > comb 1 110.694 112.256 114.683 0.948
> >
> > psys 0 10.28% 10.91% 11.29% 0.243
> > psys 1 10.21% 11.05% 11.80% 0.350
> >
> >
> > All results are in seconds (as measured by Python's time.time()), except for the psys - which was the average of mpstat's %sys column over the life of the whole run. The mkfs part consisted of [mkfs -t ext2 ; sync ; sync], untar [mount; untar linux sources; umount; sync; sync], make [mount; make allnoconfig; make -j 3; umount; sync; sync], and comb is the combined times of the mkfs, untar and make parts.
> >
> > So, in a nutshell, we saw slightly better overall performance, but not conclusively, and we saw slightly elevated %system time to accomplish the task.
> >
> > On the 4-way, results were much worse: the final data shown before the system hung showed the rq=1 passes taking significantly longer, albeit at lower %system. I'm going to try the runs again, but I have a feeling that the latest "clean" patch based upon Nick's single call mechanism is a step backwards.
> >
> > Alan
>
> I was able to go back and capture the results after 17 passes on the
> 4-way IA64 box (before it hung), and with rq=1 it shows a huge
> increase in time needed to do the combined tests - almost 18% longer,
> however with a reduction of about 24% less system time.
>
> Part RQ MIN AVG MAX Dev
> ----- -- ------ ------ ------ ------
> mkfs 0 18.543 19.055 19.514 0.285
> mkfs 1 18.730 19.217 19.812 0.316
>
> untar 0 17.119 21.396 43.868 8.025
> untar 1 16.987 28.155 44.637 10.175
>
> make 0 23.105 23.866 24.487 0.359
> make 1 24.015 28.384 37.829 3.598
>
> comb 0 59.610 64.317 86.733 7.896
> comb 1 63.181 75.755 94.079 10.489
>
> psys 0 10.35% 14.16% 16.28% 1.375
> psys 1 6.89% 10.73% 14.30% 2.368
>
> I'll try to snag some profile data to see what's up.

That definitely looks like a bug. One thing to try would be to drop the
test for whether to send the IPI or not, and just unconditionally do it.
For ia64, that would be in
arch/ia64/kernel/smp.c:__smp_call_function_single(), just change

if (ipi)
send_IPI_single(cpu, IPI_CALL_FUNC_SINGLE);

to

send_IPI_single(cpu, IPI_CALL_FUNC_SINGLE);

so that it always gets notified.

--
Jens Axboe

2008-03-14 13:06:00

by Jens Axboe

[permalink] [raw]
Subject: Re: IO CPU affinity test results

On Fri, Mar 14 2008, Jens Axboe wrote:
> I think that is encouraging, for such a small setup. The make results
> are particularly nice. The hangs are a bother, I have no good ideas on
> why the occur. The fact that it happens on both archs indicates that
> this is perhaps a generic problem, which is good. The code to support
> this is relatively simple, so it should be possible to go over it with a
> fine toothed comb and see if anything shows up.
>
> You didn't get any watchdog triggers on the serial console, or anything
> like that?

Here's something that may explain it - if interrupts aren't disabled
when generic_smp_call_function_single() is called, we could deadlock
on the dst->lock. I think that the IPI invoke will have interrupt
disabled, but I'm not 100% certain.

Can you see if this passes the muster?

diff --git a/kernel/smp.c b/kernel/smp.c
index 852abd3..65808df 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -24,12 +24,13 @@ void __cpuinit generic_init_call_single_data(void)
void generic_smp_call_function_single_interrupt(void)
{
struct call_single_queue *q;
+ unsigned long flags;
LIST_HEAD(list);

q = &__get_cpu_var(call_single_queue);
- spin_lock(&q->lock);
+ spin_lock_irqsave(&q->lock);
list_replace_init(&q->list, &list);
- spin_unlock(&q->lock);
+ spin_unlock_irqrestore(&q->lock);

while (!list_empty(&list)) {
struct call_single_data *data;

--
Jens Axboe

2008-03-14 13:41:32

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO CPU affinity test results

Jens Axboe wrote:
> On Fri, Mar 14 2008, Jens Axboe wrote:
>> I think that is encouraging, for such a small setup. The make results
>> are particularly nice. The hangs are a bother, I have no good ideas on
>> why the occur. The fact that it happens on both archs indicates that
>> this is perhaps a generic problem, which is good. The code to support
>> this is relatively simple, so it should be possible to go over it with a
>> fine toothed comb and see if anything shows up.
>>
>> You didn't get any watchdog triggers on the serial console, or anything
>> like that?
>
> Here's something that may explain it - if interrupts aren't disabled
> when generic_smp_call_function_single() is called, we could deadlock
> on the dst->lock. I think that the IPI invoke will have interrupt
> disabled, but I'm not 100% certain.
>
> Can you see if this passes the muster?
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 852abd3..65808df 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -24,12 +24,13 @@ void __cpuinit generic_init_call_single_data(void)
> void generic_smp_call_function_single_interrupt(void)
> {
> struct call_single_queue *q;
> + unsigned long flags;
> LIST_HEAD(list);
>
> q = &__get_cpu_var(call_single_queue);
> - spin_lock(&q->lock);
> + spin_lock_irqsave(&q->lock);
> list_replace_init(&q->list, &list);
> - spin_unlock(&q->lock);
> + spin_unlock_irqrestore(&q->lock);
>
> while (!list_empty(&list)) {
> struct call_single_data *data;
>

Well, putting in these two small patches (/always/ invoke send_IPI_single + this one, modified to include flags on the two spin calls), things look better after two runs (without profiling enabled), this was done on the 4-way ia64 box:

Part RQ MIN AVG MAX Dev
----- -- ------ ------ ------ ------
mkfs 0 18.151 18.311 18.472 0.227
mkfs 1 18.420 18.456 18.492 0.051

untar 0 18.260 18.420 18.581 0.228
untar 1 18.594 19.477 20.360 1.249

make 0 23.730 24.149 24.567 0.592
make 1 23.401 23.599 23.797 0.280

comb 0 60.141 60.881 61.620 1.046
comb 1 60.810 61.532 62.253 1.020

psys 0 4.16% 4.17% 4.18% 0.011
psys 1 3.80% 3.93% 4.06% 0.180

Will do a longer set of runs and report out on that. As you had noticed earlier Jens, the make results look good...

Alan

2008-03-14 15:55:22

by Alan D. Brunelle

[permalink] [raw]
Subject: Re: IO CPU affinity test results

4-way IA64, better, but not good after 30 runs (in single user mode):

Part RQ MIN AVG MAX Dev
----- -- ------ ------ ------ ------
mkfs 0 18.893 19.429 20.129 0.351
mkfs 1 18.912 19.347 19.934 0.258

untar 0 17.007 17.538 18.829 0.416
untar 1 16.980 17.329 18.369 0.284

make 0 22.050 23.000 23.546 0.400
make 1 22.416 23.141 23.637 0.349

comb 0 58.138 59.967 61.923 0.793
comb 1 58.428 59.817 60.750 0.608

psys 0 7.69% 8.77% 9.93% 0.443
psys 1 8.47% 14.20% 16.33% 2.566

Better run time w/ rq=1 (only 0.25%) but at a huge cost in %sys. Will do some profiling to see what's up.

Alan