On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
> But other systems (more dense?) showed increased cache-hit rate
> up to 20%, i.e. this one:
Hello Gentlemen,
Any feedback on this?
Thanks!
--
Regards,
Alexander Gordeev
[email protected]
On 2014-04-22 01:10, Alexander Gordeev wrote:
> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>> But other systems (more dense?) showed increased cache-hit rate
>> up to 20%, i.e. this one:
>
> Hello Gentlemen,
>
> Any feedback on this?
Sorry for dropping the ball on this. Improvements wrt when to steal, how
much, and from whom are sorely needed in percpu_ida. I'll do a bench
with this on a system that currently falls apart with it.
--
Jens Axboe
On 04/22/2014 08:03 AM, Jens Axboe wrote:
> On 2014-04-22 01:10, Alexander Gordeev wrote:
>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>> But other systems (more dense?) showed increased cache-hit rate
>>> up to 20%, i.e. this one:
>>
>> Hello Gentlemen,
>>
>> Any feedback on this?
>
> Sorry for dropping the ball on this. Improvements wrt when to steal, how
> much, and from whom are sorely needed in percpu_ida. I'll do a bench
> with this on a system that currently falls apart with it.
Ran some quick numbers with three kernels:
stock 3.15-rc2
limit 3.15-rc2 + steal limit patch (attached)
limit+ag 3.15-rc2 + steal limit + your topology patch
Two tests were run - the device has an effective queue depth limit of
255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both
exercise near-limit depth and over limit depth. 8 processes were used,
split into two groups. One group would always run on the local node, the
other would be run on the adjacent node (near) or on the far node (far).
Near + low
-----------
IOPS sys time
stock 1009.5K 55.78%
limit 1084.4K 54.47%
limit+ag 1058.1K 52.42%
Near + high
-----------
IOPS sys time
stock 949.1K 75.12%
limit 980.7K 64.74%
limit+ag 1010.1K 70.27%
Far + low
---------
IOPS sys time
stock 600.0K 72.28%
limit 761.7K 71.17%
limit+ag 762.5K 74.48%
Far + high
----------
IOPS sys time
stock 465.9K 91.66%
limit 716.2K 88.68%
limit+ag 758.0K 91.00%
One huge issue on this box is that it's a 4 socket/node machine, with 32
cores (64 threads). Combined with a 255 queue depth limit, the percpu
caching does not work well. I did not include stock+ag results, they
didn't change things very much for me. We simply have to limit the
stealing first, or we're still going to be hammering on percpu locks. If
we compare the top profiles from stock-far-high and limit+ag-far-high,
it looks pretty scary. Here's the stock one:
- 50,84% fio [kernel.kallsyms]
_raw_spin_lock
+ 89,83% percpu_ida_alloc
+ 6,03% mtip_queue_rq
+ 2,90% percpu_ida_free
so 50% of the system time spent acquiring a spinlock, with 90% of that
being percpu ida. The limit+ag variant looks like this:
- 32,93% fio [kernel.kallsyms]
_raw_spin_lock
+ 78,35% percpu_ida_alloc
+ 19,49% mtip_queue_rq
+ 1,21% __blk_mq_run_hw_queue
which is still pretty horrid and has plenty of room for improvement. I
think we need to make better decisions on the granularity of the tag
caching. If we ignore thread siblings, that'll double our effective
caching. If that's still not enough, I bet per-node/socket would be a
huge improvement.
--
Jens Axboe
Hi Jens,
On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <[email protected]> wrote:
> On 04/22/2014 08:03 AM, Jens Axboe wrote:
>> On 2014-04-22 01:10, Alexander Gordeev wrote:
>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>>> But other systems (more dense?) showed increased cache-hit rate
>>>> up to 20%, i.e. this one:
>>>
>>> Hello Gentlemen,
>>>
>>> Any feedback on this?
>>
>> Sorry for dropping the ball on this. Improvements wrt when to steal, how
>> much, and from whom are sorely needed in percpu_ida. I'll do a bench
>> with this on a system that currently falls apart with it.
>
> Ran some quick numbers with three kernels:
>
> stock 3.15-rc2
> limit 3.15-rc2 + steal limit patch (attached)
I am thinking/working on this sort of improving too, but my
idea is to compute tags->nr_max_cache by below:
nr_tags / hctx->max_nr_ctx
hctx->max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.
It is a bit complicated to compute hctx->max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.
If user-defined mapping callback needn't to be considered, the
hctx->max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().
In my null_blk test on a quad core SMP VM:
- 4 hw queue
- timer mode
With the above approach, tag allocation from local CPU can be
improved from:
5% -> 50% for boot CPU
30% -> 90% for non-boot CPU.
If no one objects the idea, I'd like to post a patch for review.
Thanks,
--
Ming Lei
On 2014-04-22 18:53, Ming Lei wrote:
> Hi Jens,
>
> On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe <[email protected]> wrote:
>> On 04/22/2014 08:03 AM, Jens Axboe wrote:
>>> On 2014-04-22 01:10, Alexander Gordeev wrote:
>>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>>>> But other systems (more dense?) showed increased cache-hit rate
>>>>> up to 20%, i.e. this one:
>>>>
>>>> Hello Gentlemen,
>>>>
>>>> Any feedback on this?
>>>
>>> Sorry for dropping the ball on this. Improvements wrt when to steal, how
>>> much, and from whom are sorely needed in percpu_ida. I'll do a bench
>>> with this on a system that currently falls apart with it.
>>
>> Ran some quick numbers with three kernels:
>>
>> stock 3.15-rc2
>> limit 3.15-rc2 + steal limit patch (attached)
>
> I am thinking/working on this sort of improving too, but my
> idea is to compute tags->nr_max_cache by below:
>
> nr_tags / hctx->max_nr_ctx
>
> hctx->max_nr_ctx means the max sw queues mapped to the
> hw queue, which need to be introduced in the approach, actually,
> the value should represent the CPU topology info.
>
> It is a bit complicated to compute hctx->max_nr_ctx because
> we need to take account into CPU hotplug and probable
> user-defined mapping callback.
We can always just update the caching info, that's not a big problem. We
update the mappings on those events anyway.
> If user-defined mapping callback needn't to be considered, the
> hctx->max_nr_ctx can be figured out before mapping sw
> queue in blk_mq_init_queue() by supposing each CPU is
> online first, once it is done, the map for offline CPU is cleared,
> then start to call blk_mq_map_swqueue().
I don't see how a user defined mapping would change things a whole lot.
It's just another point of updating the cache. Besides, user defined
mappings will be mostly (only?) for things like multiqueue, where the
caching info would likely remain static over a reconfigure.
> In my null_blk test on a quad core SMP VM:
>
> - 4 hw queue
> - timer mode
>
> With the above approach, tag allocation from local CPU can be
> improved from:
>
> 5% -> 50% for boot CPU
> 30% -> 90% for non-boot CPU.
>
> If no one objects the idea, I'd like to post a patch for review.
Sent it out, that can't hurt. I'll take a look at it, and give it a test
spin as well.
--
Jens Axboe
Hi Jens,
On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <[email protected]> wrote:
> On 2014-04-22 18:53, Ming Lei wrote:
>
>
>> In my null_blk test on a quad core SMP VM:
>>
>> - 4 hw queue
>> - timer mode
>>
>> With the above approach, tag allocation from local CPU can be
>> improved from:
>>
>> 5% -> 50% for boot CPU
>> 30% -> 90% for non-boot CPU.
>>
>> If no one objects the idea, I'd like to post a patch for review.
>
>
> Sent it out, that can't hurt. I'll take a look at it, and give it a test
> spin as well.
I have sent out the patch in below link, and appreciate much
you will take a look at it.
marc.info/?l=linux-kernel&m=139826944613225&w=2
Thanks,
--
Ming Lei
On 04/25/2014 03:10 AM, Ming Lei wrote:
> Hi Jens,
>
> On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe <[email protected]> wrote:
>> On 2014-04-22 18:53, Ming Lei wrote:
>>
>>
>>> In my null_blk test on a quad core SMP VM:
>>>
>>> - 4 hw queue
>>> - timer mode
>>>
>>> With the above approach, tag allocation from local CPU can be
>>> improved from:
>>>
>>> 5% -> 50% for boot CPU
>>> 30% -> 90% for non-boot CPU.
>>>
>>> If no one objects the idea, I'd like to post a patch for review.
>>
>>
>> Sent it out, that can't hurt. I'll take a look at it, and give it a test
>> spin as well.
>
> I have sent out the patch in below link, and appreciate much
> you will take a look at it.
>
> marc.info/?l=linux-kernel&m=139826944613225&w=2
Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.
I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.
That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.
--
Jens Axboe
Hi Jens,
On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <[email protected]> wrote:
> On 04/25/2014 03:10 AM, Ming Lei wrote:
>
> Sorry, I did run it the other day. It has little to no effect here, but
> that's mostly because there's so much other crap going on in there. The
> most effective way to currently make it work better, is just to ensure
> the caching pool is of a sane size.
Yes, that is just what the patch is doing, :-)
>From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.
>
> I've got an alternative tagging scheme that I think would be useful for
> the cases where the tag space to cpu ratio isn't big enough. So I think
> we'll retain percpu_ida for the cases where it can cache enough, and
> punt to an alternative scheme when not.
OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx->nr_ctx)?
>
> That doesn't mean we should not improve percpu_ida. There's quite a bit
> of low hanging fruit in there.
IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().
Thanks,
--
Ming Lei
On 2014-04-25 18:01, Ming Lei wrote:
> Hi Jens,
>
> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <[email protected]> wrote:
>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>
>> Sorry, I did run it the other day. It has little to no effect here, but
>> that's mostly because there's so much other crap going on in there. The
>> most effective way to currently make it work better, is just to ensure
>> the caching pool is of a sane size.
>
> Yes, that is just what the patch is doing, :-)
But it's not enough. For instance, my test case, it's 255 tags and 64
CPUs. We end up in cross-cpu spinlock nightmare mode.
> From percpu_ida view, it is easy to observe it can improve
> allocation performance. I have several patches to export
> these information by sysfs for monitoring percpu_ida
> performance.
Sounds good!
>> I've got an alternative tagging scheme that I think would be useful for
>> the cases where the tag space to cpu ratio isn't big enough. So I think
>> we'll retain percpu_ida for the cases where it can cache enough, and
>> punt to an alternative scheme when not.
>
> OK, care to comment on the patch or the idea of setting percpu cache
> size as (nr_tags / hctx->nr_ctx)?
I think it's a good idea. The problem is that for percpu_ida to be
effective, you need a bigger cache than the 3 I'd get above. If that
isn't the case, it performs poorly. I'm just not convinced the design
can ever work in the realm of realistic queue depths.
>> That doesn't mean we should not improve percpu_ida. There's quite a bit
>> of low hanging fruit in there.
>
> IMO percpu_max_size in percpu_ida is very important for the
> performance, and it might need to adjust dynamically according
> to the percpu allocation loading, but it is far more complicated
> to implement. And it might be the simplest way to fix the parameter
> before percpu_ida_init().
That's what I did, essentially. Ensuring that the percpu_max_size is at
least 8 makes it a whole lot better here. But still slower than a
regular simple bitmap, which makes me sad. A fairly straight forward
cmpxchg based scheme I tested here is around 20% faster than the bitmap
approach on a basic desktop machine, and around 35% faster on a
4-socket. Outside of NVMe, I can't think of cases where that approach
would not be faster than percpu_ida. That means all of SCSI, basically,
and the basic block drivers.
--
Jens Axboe
On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <[email protected]> wrote:
> On 2014-04-25 18:01, Ming Lei wrote:
>>
>> Hi Jens,
>>
>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <[email protected]> wrote:
>>>
>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>
>>> Sorry, I did run it the other day. It has little to no effect here, but
>>> that's mostly because there's so much other crap going on in there. The
>>> most effective way to currently make it work better, is just to ensure
>>> the caching pool is of a sane size.
>>
>>
>> Yes, that is just what the patch is doing, :-)
>
>
> But it's not enough.
Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.
> For instance, my test case, it's 255 tags and 64 CPUs.
> We end up in cross-cpu spinlock nightmare mode.
IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.
One of problems in blk-mq is that the 'set->queue_depth'
parameter from driver isn't scalable, maybe it is reasonable to
introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
can be computed as below:
max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
Another problem in blk-mq is that if it can be improved by computing
tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current
approach should be based on that there are parallel I/O
activity on each CPU, but I am wondering if it is the common
case in reality. Suppose there are N(N << online CPUs in
big machine) concurrent I/O on some of CPUs, percpu cache
can be increased a lot by (nr_tags / N).
>
>
>> From percpu_ida view, it is easy to observe it can improve
>> allocation performance. I have several patches to export
>> these information by sysfs for monitoring percpu_ida
>> performance.
>
>
> Sounds good!
If we need exporting percpu_ida allocation/free information via
sysfs for monitoring performance, percpu ida need a parent
kobject, which means it may be better to allocate percpu_ida
tag until sw/hw queue is initialized, like the patch does.
>
>
>>> I've got an alternative tagging scheme that I think would be useful for
>>> the cases where the tag space to cpu ratio isn't big enough. So I think
>>> we'll retain percpu_ida for the cases where it can cache enough, and
>>> punt to an alternative scheme when not.
>>
>>
>> OK, care to comment on the patch or the idea of setting percpu cache
>> size as (nr_tags / hctx->nr_ctx)?
>
>
> I think it's a good idea. The problem is that for percpu_ida to be
> effective, you need a bigger cache than the 3 I'd get above. If that isn't
> the case, it performs poorly. I'm just not convinced the design can ever
> work in the realm of realistic queue depths.
>
>
>
>>> That doesn't mean we should not improve percpu_ida. There's quite a bit
>>> of low hanging fruit in there.
>>
>>
>> IMO percpu_max_size in percpu_ida is very important for the
>> performance, and it might need to adjust dynamically according
>> to the percpu allocation loading, but it is far more complicated
>> to implement. And it might be the simplest way to fix the parameter
>> before percpu_ida_init().
>
>
> That's what I did, essentially. Ensuring that the percpu_max_size is at
> least 8 makes it a whole lot better here. But still slower than a regular
> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
> scheme I tested here is around 20% faster than the bitmap approach on a
> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
> I can't think of cases where that approach would not be faster than
> percpu_ida. That means all of SCSI, basically, and the basic block drivers.
If percpu_ida wants to beat bitmap allocation, the local cache hit
ratio has to keep high, in my tests, it can be got with enough local
cache size.
Thanks,
--
Ming Lei
On 04/29/2014 05:35 AM, Ming Lei wrote:
> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <[email protected]> wrote:
>> On 2014-04-25 18:01, Ming Lei wrote:
>>>
>>> Hi Jens,
>>>
>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <[email protected]> wrote:
>>>>
>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>
>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>> that's mostly because there's so much other crap going on in there. The
>>>> most effective way to currently make it work better, is just to ensure
>>>> the caching pool is of a sane size.
>>>
>>>
>>> Yes, that is just what the patch is doing, :-)
>>
>>
>> But it's not enough.
>
> Yes, the patch is only for cases of mutli hw queue and having
> offline CPUs existed.
>
>> For instance, my test case, it's 255 tags and 64 CPUs.
>> We end up in cross-cpu spinlock nightmare mode.
>
> IMO, the scaling problem for the above case might be
> caused by either current percpu ida design or blk-mq's
> usage on it.
That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).
> One of problems in blk-mq is that the 'set->queue_depth'
> parameter from driver isn't scalable, maybe it is reasonable to
> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
> can be computed as below:
>
> max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
>
> Another problem in blk-mq is that if it can be improved by computing
> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current
> approach should be based on that there are parallel I/O
> activity on each CPU, but I am wondering if it is the common
> case in reality. Suppose there are N(N << online CPUs in
> big machine) concurrent I/O on some of CPUs, percpu cache
> can be increased a lot by (nr_tags / N).
That would certainly help the common case, but it'd still be slow for
the cases where you DO have IO from lots of sources. If we consider 8-16
tags the minimum for balanced performance, than that doesn't take a
whole lot of CPUs to spread out the tag space. Just looking at a case
today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
"bigger" case of the Micron card, you still only have 255 active tags.
And we probably want to split that up into groups of 32, making the
problem even worse.
>> That's what I did, essentially. Ensuring that the percpu_max_size is at
>> least 8 makes it a whole lot better here. But still slower than a regular
>> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
>> scheme I tested here is around 20% faster than the bitmap approach on a
>> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
>> I can't think of cases where that approach would not be faster than
>> percpu_ida. That means all of SCSI, basically, and the basic block drivers.
>
> If percpu_ida wants to beat bitmap allocation, the local cache hit
> ratio has to keep high, in my tests, it can be got with enough local
> cache size.
Yes, that is exactly the issue, local cache hit must be high, and you
pretty much need a higher local cache count for that. And therein lies
the problem, you can't get that high local cache size for most common
cases. With enough tags we could, but that's not what most people will run.
--
Jens Axboe
On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe <[email protected]> wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <[email protected]> wrote:
>>> On 2014-04-25 18:01, Ming Lei wrote:
>>>>
>>>> Hi Jens,
>>>>
>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <[email protected]> wrote:
>>>>>
>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>>
>>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>>> that's mostly because there's so much other crap going on in there. The
>>>>> most effective way to currently make it work better, is just to ensure
>>>>> the caching pool is of a sane size.
>>>>
>>>>
>>>> Yes, that is just what the patch is doing, :-)
>>>
>>>
>>> But it's not enough.
>>
>> Yes, the patch is only for cases of mutli hw queue and having
>> offline CPUs existed.
>>
>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>> We end up in cross-cpu spinlock nightmare mode.
>>
>> IMO, the scaling problem for the above case might be
>> caused by either current percpu ida design or blk-mq's
>> usage on it.
>
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).
>
>> One of problems in blk-mq is that the 'set->queue_depth'
>> parameter from driver isn't scalable, maybe it is reasonable to
>> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
>> can be computed as below:
>>
>> max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
>>
>> Another problem in blk-mq is that if it can be improved by computing
>> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current
>> approach should be based on that there are parallel I/O
>> activity on each CPU, but I am wondering if it is the common
>> case in reality. Suppose there are N(N << online CPUs in
>> big machine) concurrent I/O on some of CPUs, percpu cache
>> can be increased a lot by (nr_tags / N).
>
> That would certainly help the common case, but it'd still be slow for
> the cases where you DO have IO from lots of sources. If we consider
It may be difficult to figure out a efficient solution for the unusual case.
8-16
> tags the minimum for balanced performance, than that doesn't take a
> whole lot of CPUs to spread out the tag space. Just looking at a case
> today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
> "bigger" case of the Micron card, you still only have 255 active tags.
> And we probably want to split that up into groups of 32, making the
> problem even worse.
Yes, that is a contradiction between having limited hardware
queue tags and requiring more local cpu cache. But it is really
a challenge to figure out an efficient approach in case that lots of
CPUs need to contend very limited resources.
Maybe blk-mq can learn network device: move the hw
queue_depth constraint into low level(such as, blk_mq_run_hw_queue()),
and keep adequate tags in the percpu pool, which means
nr_tags of percpu pool can be much bigger than queue_depth
for keeping enough percpu cache. When hw queue is full,
congestion control can be applied in blk_mq_alloc_request()
to avoid cross-cpu spinlock nightmare in percpu allocation/free.
But if device requires tag to be less than queue_depth, more
work is needed for the approach.
Thanks,
--
Ming Lei