On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
> On 2014-05-30 06:11, Shaohua Li wrote:
> >On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
> >>On 05/09/2014 08:12 AM, Jens Axboe wrote:
> >>>On 05/09/2014 03:17 AM, Matias Bj?rling wrote:
> >>>>With multi-million IOPS and multi-node workloads, the atomic_t in_flight
> >>>>tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
> >>>>counters to elevate.
> >>>
> >>>The part stats are a pain in the butt, I've tried to come up with a
> >>>great fix for them too. But I don't think the percpu conversion is
> >>>necessarily the right one. The summing is part of the hotpath, so percpu
> >>>counters aren't necessarily the right way to go. I don't have a better
> >>>answer right now, otherwise it would have been fixed :-)
> >>
> >>Actual data point - this slows my test down ~14% compared to the stock
> >>kernel. Also, if you experiment with this, you need to watch for the
> >>out-of-core users of the part stats (like DM).
> >
> >I had a try with Matias's patch. Performance actually boost significantly.
> >(there are other cache line issue though, eg, hd_struct_get). Jens, what did
> >you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
> >path. part_round_stats_single() uses it too, but it's a cold path too as we
> >simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
> >every 10ms instead of every jiffy?
>
> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
> cores). The problem is the one hot path of part_in_flight(), summing
> percpu for that is too expensive. On bigger systems than mine, it'd
> be even worse.
I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
part_in_flight() is called in any hot path.
Thanks,
Shaohua
It's in
blk_io_account_start
part_round_stats
part_round_state_single
part_in_flight
I like the granularity idea.
Thanks,
Matias
On Wed, Jun 4, 2014 at 12:39 PM, Shaohua Li <[email protected]> wrote:
> On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
>> On 2014-05-30 06:11, Shaohua Li wrote:
>> >On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>> >>On 05/09/2014 08:12 AM, Jens Axboe wrote:
>> >>>On 05/09/2014 03:17 AM, Matias Bjørling wrote:
>> >>>>With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>> >>>>tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>> >>>>counters to elevate.
>> >>>
>> >>>The part stats are a pain in the butt, I've tried to come up with a
>> >>>great fix for them too. But I don't think the percpu conversion is
>> >>>necessarily the right one. The summing is part of the hotpath, so percpu
>> >>>counters aren't necessarily the right way to go. I don't have a better
>> >>>answer right now, otherwise it would have been fixed :-)
>> >>
>> >>Actual data point - this slows my test down ~14% compared to the stock
>> >>kernel. Also, if you experiment with this, you need to watch for the
>> >>out-of-core users of the part stats (like DM).
>> >
>> >I had a try with Matias's patch. Performance actually boost significantly.
>> >(there are other cache line issue though, eg, hd_struct_get). Jens, what did
>> >you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
>> >path. part_round_stats_single() uses it too, but it's a cold path too as we
>> >simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
>> >every 10ms instead of every jiffy?
>>
>> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
>> cores). The problem is the one hot path of part_in_flight(), summing
>> percpu for that is too expensive. On bigger systems than mine, it'd
>> be even worse.
>
> I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
> part_in_flight() is called in any hot path.
>
> Thanks,
> Shaohua
On 2014-06-04 04:39, Shaohua Li wrote:
> On Fri, May 30, 2014 at 07:49:52AM -0600, Jens Axboe wrote:
>> On 2014-05-30 06:11, Shaohua Li wrote:
>>> On Fri, May 09, 2014 at 10:41:27AM -0600, Jens Axboe wrote:
>>>> On 05/09/2014 08:12 AM, Jens Axboe wrote:
>>>>> On 05/09/2014 03:17 AM, Matias Bj?rling wrote:
>>>>>> With multi-million IOPS and multi-node workloads, the atomic_t in_flight
>>>>>> tracking becomes a bottleneck. Change the in-flight accounting to per-cpu
>>>>>> counters to elevate.
>>>>>
>>>>> The part stats are a pain in the butt, I've tried to come up with a
>>>>> great fix for them too. But I don't think the percpu conversion is
>>>>> necessarily the right one. The summing is part of the hotpath, so percpu
>>>>> counters aren't necessarily the right way to go. I don't have a better
>>>>> answer right now, otherwise it would have been fixed :-)
>>>>
>>>> Actual data point - this slows my test down ~14% compared to the stock
>>>> kernel. Also, if you experiment with this, you need to watch for the
>>>> out-of-core users of the part stats (like DM).
>>>
>>> I had a try with Matias's patch. Performance actually boost significantly.
>>> (there are other cache line issue though, eg, hd_struct_get). Jens, what did
>>> you run? part_in_flight() has 3 usages. 2 are for status output, which are cold
>>> path. part_round_stats_single() uses it too, but it's a cold path too as we
>>> simple data every jiffy. Are you using HZ=1000? maybe we should simple the data
>>> every 10ms instead of every jiffy?
>>
>> I ran peak and normal benchmarks on a p320, on a 4 socket box (64
>> cores). The problem is the one hot path of part_in_flight(), summing
>> percpu for that is too expensive. On bigger systems than mine, it'd
>> be even worse.
>
> I run a null_blk test with 4 sockets, Matias has improvement. And I didn't find
> part_in_flight() is called in any hot path.
It's done for every IO completion, that is (by definition) a hot path. I
tested on two devices here, and it was definitely slower. And my system
only had just the right number of NR_CPUS, I suspect it'd be much worse
on bigger systems.
--
Jens Axboe
On 06/04/2014 05:29 AM, Matias Bjørling wrote:
> It's in
>
> blk_io_account_start
> part_round_stats
> part_round_state_single
> part_in_flight
>
> I like the granularity idea.
And similarly from blk_io_account_done() - which makes it even worse,
since it at both ends of the IO chain.
--
Jens Axboe
On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
> On 06/04/2014 05:29 AM, Matias Bj?rling wrote:
> > It's in
> >
> > blk_io_account_start
> > part_round_stats
> > part_round_state_single
> > part_in_flight
> >
> > I like the granularity idea.
>
> And similarly from blk_io_account_done() - which makes it even worse,
> since it at both ends of the IO chain.
But part_round_state_single is supposed to only call part_in_flight every
jiffery. Maybe we need something below:
1. set part->stamp immediately
2. fixed granularity
Untested though.
diff --git a/block/blk-core.c b/block/blk-core.c
index 40d6548..5f0acaa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
int inflight;
+ unsigned long old_stamp;
- if (now == part->stamp)
+ if (time_before(now, part->stamp + msecs_to_jiffies(10)))
return;
+ old_stamp = part->stamp;
+ part->stamp = now;
inflight = part_in_flight(part);
if (inflight) {
__part_stat_add(cpu, part, time_in_queue,
- inflight * (now - part->stamp));
- __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
+ inflight * (now - old_stamp));
+ __part_stat_add(cpu, part, io_ticks, (now - old_stamp));
}
- part->stamp = now;
}
/**
On 2014-06-04 20:09, Shaohua Li wrote:
> On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
>> On 06/04/2014 05:29 AM, Matias Bj?rling wrote:
>>> It's in
>>>
>>> blk_io_account_start
>>> part_round_stats
>>> part_round_state_single
>>> part_in_flight
>>>
>>> I like the granularity idea.
>>
>> And similarly from blk_io_account_done() - which makes it even worse,
>> since it at both ends of the IO chain.
>
> But part_round_state_single is supposed to only call part_in_flight every
> jiffery. Maybe we need something below:
> 1. set part->stamp immediately
> 2. fixed granularity
> Untested though.
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 40d6548..5f0acaa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
> unsigned long now)
> {
> int inflight;
> + unsigned long old_stamp;
>
> - if (now == part->stamp)
> + if (time_before(now, part->stamp + msecs_to_jiffies(10)))
> return;
> + old_stamp = part->stamp;
> + part->stamp = now;
>
> inflight = part_in_flight(part);
> if (inflight) {
> __part_stat_add(cpu, part, time_in_queue,
> - inflight * (now - part->stamp));
> - __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
> + inflight * (now - old_stamp));
> + __part_stat_add(cpu, part, io_ticks, (now - old_stamp));
> }
> - part->stamp = now;
> }
>
> /**
It'd be a good improvement, and one we should be able to do without
screwing anything up. It'd be identical to anyone running at HZ==100
right now.
So the above we can easily do, and arguably should just do. We wont see
real scaling in the IO stats path before we fixup the hd_struct
referencing as well, however.
--
Jens Axboe
On Wed, Jun 04, 2014 at 08:16:32PM -0600, Jens Axboe wrote:
> On 2014-06-04 20:09, Shaohua Li wrote:
> >On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
> >>On 06/04/2014 05:29 AM, Matias Bj?rling wrote:
> >>>It's in
> >>>
> >>>blk_io_account_start
> >>> part_round_stats
> >>> part_round_state_single
> >>> part_in_flight
> >>>
> >>>I like the granularity idea.
> >>
> >>And similarly from blk_io_account_done() - which makes it even worse,
> >>since it at both ends of the IO chain.
> >
> >But part_round_state_single is supposed to only call part_in_flight every
> >jiffery. Maybe we need something below:
> >1. set part->stamp immediately
> >2. fixed granularity
> >Untested though.
> >
> >
> >diff --git a/block/blk-core.c b/block/blk-core.c
> >index 40d6548..5f0acaa 100644
> >--- a/block/blk-core.c
> >+++ b/block/blk-core.c
> >@@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
> > unsigned long now)
> > {
> > int inflight;
> >+ unsigned long old_stamp;
> >
> >- if (now == part->stamp)
> >+ if (time_before(now, part->stamp + msecs_to_jiffies(10)))
> > return;
> >+ old_stamp = part->stamp;
> >+ part->stamp = now;
> >
> > inflight = part_in_flight(part);
> > if (inflight) {
> > __part_stat_add(cpu, part, time_in_queue,
> >- inflight * (now - part->stamp));
> >- __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
> >+ inflight * (now - old_stamp));
> >+ __part_stat_add(cpu, part, io_ticks, (now - old_stamp));
> > }
> >- part->stamp = now;
> > }
> >
> > /**
>
> It'd be a good improvement, and one we should be able to do without
> screwing anything up. It'd be identical to anyone running at HZ==100
> right now.
>
> So the above we can easily do, and arguably should just do. We wont
> see real scaling in the IO stats path before we fixup the hd_struct
> referencing as well, however.
That's true. maybe a percpu_ref works here.
Thanks,
Shaohua
On 2014-06-04 20:33, Shaohua Li wrote:
> On Wed, Jun 04, 2014 at 08:16:32PM -0600, Jens Axboe wrote:
>> On 2014-06-04 20:09, Shaohua Li wrote:
>>> On Wed, Jun 04, 2014 at 02:08:46PM -0600, Jens Axboe wrote:
>>>> On 06/04/2014 05:29 AM, Matias Bj?rling wrote:
>>>>> It's in
>>>>>
>>>>> blk_io_account_start
>>>>> part_round_stats
>>>>> part_round_state_single
>>>>> part_in_flight
>>>>>
>>>>> I like the granularity idea.
>>>>
>>>> And similarly from blk_io_account_done() - which makes it even worse,
>>>> since it at both ends of the IO chain.
>>>
>>> But part_round_state_single is supposed to only call part_in_flight every
>>> jiffery. Maybe we need something below:
>>> 1. set part->stamp immediately
>>> 2. fixed granularity
>>> Untested though.
>>>
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 40d6548..5f0acaa 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1270,17 +1270,19 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
>>> unsigned long now)
>>> {
>>> int inflight;
>>> + unsigned long old_stamp;
>>>
>>> - if (now == part->stamp)
>>> + if (time_before(now, part->stamp + msecs_to_jiffies(10)))
>>> return;
>>> + old_stamp = part->stamp;
>>> + part->stamp = now;
>>>
>>> inflight = part_in_flight(part);
>>> if (inflight) {
>>> __part_stat_add(cpu, part, time_in_queue,
>>> - inflight * (now - part->stamp));
>>> - __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
>>> + inflight * (now - old_stamp));
>>> + __part_stat_add(cpu, part, io_ticks, (now - old_stamp));
>>> }
>>> - part->stamp = now;
>>> }
>>>
>>> /**
>>
>> It'd be a good improvement, and one we should be able to do without
>> screwing anything up. It'd be identical to anyone running at HZ==100
>> right now.
>>
>> So the above we can easily do, and arguably should just do. We wont
>> see real scaling in the IO stats path before we fixup the hd_struct
>> referencing as well, however.
>
> That's true. maybe a percpu_ref works here.
Maybe, but it would require more than a direct replacement. The
hd_struct stuff currently relies on things like atomic_inc_not_zero(),
which would not be cheap to do. And this does happen for every new IO,
so can't be amortized over time like the part stats rounding.
--
Jens Axboe