2014-01-04 21:06:53

by Kent Overstreet

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote:
>
> steal_tags only happens when free tags is more than half of the total tags.
> This is too restrict and can cause live lock. I found one cpu has free tags,
> but other cpu can't steal (thread is bound to specific cpus), threads which
> wants to allocate tags are always sleeping. I found this when I run next patch,
> but this could happen without it I think.
>
> I did performance test too with null_blk. Two cases (each cpu has enough percpu
> tags, or total tags are limited) haven't performance changes.

This doesn't appear to me to fix anything wrong with the current code
(and it'll hurt performance) - we explicitly don't guarantee that all
the tags will be available for allocation at any given time, only half
of them. Can you explain more how this is being used where you're seeing
the issue? And I don't see the other patch in your patch series.


2014-01-05 13:18:27

by Shaohua Li

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Sat, Jan 04, 2014 at 01:08:04PM -0800, Kent Overstreet wrote:
> On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote:
> >
> > steal_tags only happens when free tags is more than half of the total tags.
> > This is too restrict and can cause live lock. I found one cpu has free tags,
> > but other cpu can't steal (thread is bound to specific cpus), threads which
> > wants to allocate tags are always sleeping. I found this when I run next patch,
> > but this could happen without it I think.
> >
> > I did performance test too with null_blk. Two cases (each cpu has enough percpu
> > tags, or total tags are limited) haven't performance changes.
>
> This doesn't appear to me to fix anything wrong with the current code
> (and it'll hurt performance)

I suspects this hurts performance too, but it doesn't actually in my test.

> - we explicitly don't guarantee that all
> the tags will be available for allocation at any given time, only half
> of them.

only half of the tags can be used? this is scaring. Of course we hope all tags
are available.

> Can you explain more how this is being used where you're seeing
> the issue? And I don't see the other patch in your patch series.

tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less
than half of total tags), but other cpus can't allocate tags, so I saw a live
lock.

Thanks,
Shaohua

2014-01-06 20:46:46

by Kent Overstreet

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote:
> On Sat, Jan 04, 2014 at 01:08:04PM -0800, Kent Overstreet wrote:
> > On Tue, Dec 31, 2013 at 11:38:27AM +0800, Shaohua Li wrote:
> > >
> > > steal_tags only happens when free tags is more than half of the total tags.
> > > This is too restrict and can cause live lock. I found one cpu has free tags,
> > > but other cpu can't steal (thread is bound to specific cpus), threads which
> > > wants to allocate tags are always sleeping. I found this when I run next patch,
> > > but this could happen without it I think.
> > >
> > > I did performance test too with null_blk. Two cases (each cpu has enough percpu
> > > tags, or total tags are limited) haven't performance changes.
> >
> > This doesn't appear to me to fix anything wrong with the current code
> > (and it'll hurt performance)
>
> I suspects this hurts performance too, but it doesn't actually in my test.
>
> > - we explicitly don't guarantee that all
> > the tags will be available for allocation at any given time, only half
> > of them.
>
> only half of the tags can be used? this is scaring. Of course we hope all tags
> are available.

No: that behaviour is explicitly documented and it's the behaviour we want.

There is an inherent performance tradeoff here between internal fragmentation
and performance: your patch tries to drive internal fragmentation to 0, but
under workloads that hit this, running on enough cores, this will have a VERY
high performance cost.

> > Can you explain more how this is being used where you're seeing
> > the issue? And I don't see the other patch in your patch series.
>
> tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less
> than half of total tags), but other cpus can't allocate tags, so I saw a live
> lock.

You're using it wrong - don't assume all nr_tags can be allocated. If you need
to guarantee that you'll always be able to allocate a certain number of tags,
you pass twice that to percpu_ida_init() instead.

Nacking this patch (and if you want me to look at it more you still to send me
the other patch in the series and show me your code that's using it).

2014-01-06 20:53:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On 01/06/2014 01:46 PM, Kent Overstreet wrote:
> On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote:

>>> - we explicitly don't guarantee that all
>>> the tags will be available for allocation at any given time, only half
>>> of them.
>>
>> only half of the tags can be used? this is scaring. Of course we hope all tags
>> are available.
>
> No: that behaviour is explicitly documented and it's the behaviour we want.

That is going to end horribly, for the cases (ie most cases) where we
really want to be able to use all tags. If we can't support that
reliably or in a fast manner with percpu ida, then that's pretty close
to a show stopper. And I suspect that will be the case for most use
cases. We can't just feasibly throw away half of the tag space, that's
crazy.

> There is an inherent performance tradeoff here between internal fragmentation
> and performance: your patch tries to drive internal fragmentation to 0, but
> under workloads that hit this, running on enough cores, this will have a VERY
> high performance cost.

For most workloads that are somewhat localized, we should be able to
support this without melting.

>>> Can you explain more how this is being used where you're seeing
>>> the issue? And I don't see the other patch in your patch series.
>>
>> tasks are bound to specific cpus. And I saw one cpu has a lot of free (but less
>> than half of total tags), but other cpus can't allocate tags, so I saw a live
>> lock.
>
> You're using it wrong - don't assume all nr_tags can be allocated. If you need
> to guarantee that you'll always be able to allocate a certain number of tags,
> you pass twice that to percpu_ida_init() instead.

But that wont work at all. To take a concrete example, lets assume the
device is some sort of NCQ. We have 0..31 tags. If we throw away half of
those tags, we're at 16. You can't double the tag space in software.

So what's the proposal?

--
Jens Axboe

2014-01-06 21:47:34

by Kent Overstreet

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Jan 06, 2014 at 01:52:19PM -0700, Jens Axboe wrote:
> On 01/06/2014 01:46 PM, Kent Overstreet wrote:
> > On Sun, Jan 05, 2014 at 09:13:00PM +0800, Shaohua Li wrote:
>
> >>> - we explicitly don't guarantee that all
> >>> the tags will be available for allocation at any given time, only half
> >>> of them.
> >>
> >> only half of the tags can be used? this is scaring. Of course we hope all tags
> >> are available.
> >
> > No: that behaviour is explicitly documented and it's the behaviour we want.
>
> That is going to end horribly, for the cases (ie most cases) where we
> really want to be able to use all tags. If we can't support that
> reliably or in a fast manner with percpu ida, then that's pretty close
> to a show stopper. And I suspect that will be the case for most use
> cases. We can't just feasibly throw away half of the tag space, that's
> crazy.

Sounds like we're coming at this from different use cases, maybe we're going to
need to add some flexibility for different use cases.

For background, I wrote this code for SSDs where the you get 16 bits for the tag
ID; we really don't want the queue to be anywhere near that deep so potentially
wasting half of however many tags you have is no big deal.

If you have a device where the max tag id is small enough that you really do
need to use the entire tag space... yeah, that's going to be an issue.

> But that wont work at all. To take a concrete example, lets assume the
> device is some sort of NCQ. We have 0..31 tags. If we throw away half of
> those tags, we're at 16. You can't double the tag space in software.

Ok, so I hadn't really given any thought to that kind of use case; insofar as I
had I would've been skeptical percpu tag allocation made sense for 32 different
tags at all.

We really don't want to screw over the users that aren't so constrained by the
size of their tag space; there really is a huge performance tradeoff here
(otherwise you're stealing tags and bouncing cachelines for _every_ tag
allocation when the queue is full, and your percpu tag allocation is no longer
very percpu).

I'm not sure what the best strategy is for NCQ-type max nr_tags, though -
thoughts?

Easy thing to do for now is just to add another parameter to percpu_ida_init()
for the number of tags that are allowed to sit unused on other cpu's freelists -
users that have large relatively unbounded nr_tags can set that to nr_tags / 2,
for NCQ you'd set it to 0.

I suspect we can do better for NCQ though, w.r.t. worst case performance.

2014-02-09 15:48:42

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Jan 06, 2014 at 01:47:26PM -0800, Kent Overstreet wrote:
> Ok, so I hadn't really given any thought to that kind of use case; insofar as I
> had I would've been skeptical percpu tag allocation made sense for 32 different
> tags at all.
>
> We really don't want to screw over the users that aren't so constrained by the
> size of their tag space; there really is a huge performance tradeoff here
> (otherwise you're stealing tags and bouncing cachelines for _every_ tag
> allocation when the queue is full, and your percpu tag allocation is no longer
> very percpu).
>
> I'm not sure what the best strategy is for NCQ-type max nr_tags, though -
> thoughts?
>
> Easy thing to do for now is just to add another parameter to percpu_ida_init()
> for the number of tags that are allowed to sit unused on other cpu's freelists -
> users that have large relatively unbounded nr_tags can set that to nr_tags / 2,
> for NCQ you'd set it to 0.
>
> I suspect we can do better for NCQ though, w.r.t. worst case performance.

Yeah, that was my first thought when I posted "percpu_ida: Allow variable
maximum number of cached tags" patch some few months ago. But I am back-
pedalling as it does not appear solves the fundamental problem - what is the
best threshold?

May be we can walk off with a per-cpu timeout that flushes batch nr of tags
from local caches to the pool? Each local allocation would restart the timer,
but once allocation requests stopped coming on a CPU the tags would not gather
dust in local caches.

--
Regards,
Alexander Gordeev
[email protected]

2014-02-10 10:32:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote:
> Yeah, that was my first thought when I posted "percpu_ida: Allow variable
> maximum number of cached tags" patch some few months ago. But I am back-
> pedalling as it does not appear solves the fundamental problem - what is the
> best threshold?
>
> May be we can walk off with a per-cpu timeout that flushes batch nr of tags
> from local caches to the pool? Each local allocation would restart the timer,
> but once allocation requests stopped coming on a CPU the tags would not gather
> dust in local caches.

We'll defintively need a fix to be able to allow the whole tag space.
For large numbers of tags per device the flush might work, but for
devices with low number of tags we need something more efficient. The
case of less tags than CPUs isn't that unusual either and we probably
want to switch to an allocator without per cpu allocations for them to
avoid all this. E.g. for many ATA devices we just have a single tag,
and many scsi drivers also only want single digit outstanding commands
per LUN.

2014-02-10 12:28:13

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 02:32:11AM -0800, Christoph Hellwig wrote:
> > May be we can walk off with a per-cpu timeout that flushes batch nr of tags
> > from local caches to the pool? Each local allocation would restart the timer,
> > but once allocation requests stopped coming on a CPU the tags would not gather
> > dust in local caches.
>
> We'll defintively need a fix to be able to allow the whole tag space.
> For large numbers of tags per device the flush might work, but for
> devices with low number of tags we need something more efficient. The
> case of less tags than CPUs isn't that unusual either and we probably
> want to switch to an allocator without per cpu allocations for them to
> avoid all this. E.g. for many ATA devices we just have a single tag,
> and many scsi drivers also only want single digit outstanding commands
> per LUN.

Do we really always need the pool for these classes of devices?

Pulling tags from local caches to the pool just to (near to) dry it at
the very next iteration does not seem beneficial. Not to mention caches
vs pool locking complexities.

--
Regards,
Alexander Gordeev
[email protected]

2014-02-10 15:47:51

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 01:29:42PM +0100, Alexander Gordeev wrote:
> > We'll defintively need a fix to be able to allow the whole tag space.
> > For large numbers of tags per device the flush might work, but for
> > devices with low number of tags we need something more efficient. The
> > case of less tags than CPUs isn't that unusual either and we probably
> > want to switch to an allocator without per cpu allocations for them to
> > avoid all this. E.g. for many ATA devices we just have a single tag,
> > and many scsi drivers also only want single digit outstanding commands
> > per LUN.
>
> Do we really always need the pool for these classes of devices?
>
> Pulling tags from local caches to the pool just to (near to) dry it at
> the very next iteration does not seem beneficial. Not to mention caches
> vs pool locking complexities.

And I meant here we do not scrap per cpu allocations.

--
Regards,
Alexander Gordeev
[email protected]

2014-02-10 16:16:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 04:49:17PM +0100, Alexander Gordeev wrote:
> > Do we really always need the pool for these classes of devices?
> >
> > Pulling tags from local caches to the pool just to (near to) dry it at
> > the very next iteration does not seem beneficial. Not to mention caches
> > vs pool locking complexities.
>
> And I meant here we do not scrap per cpu allocations.

I'm still a bit confused at what you're proposing. Part of the problem
might be that I'm not exactly familar with the allocator and am just
looking it from a consumer perspective.

2014-02-10 16:26:22

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock



On 02/10/2014 03:32 AM, Christoph Hellwig wrote:
> On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote:
>> Yeah, that was my first thought when I posted "percpu_ida: Allow variable
>> maximum number of cached tags" patch some few months ago. But I am back-
>> pedalling as it does not appear solves the fundamental problem - what is the
>> best threshold?
>>
>> May be we can walk off with a per-cpu timeout that flushes batch nr of tags
>> from local caches to the pool? Each local allocation would restart the timer,
>> but once allocation requests stopped coming on a CPU the tags would not gather
>> dust in local caches.
>
> We'll defintively need a fix to be able to allow the whole tag space.

Certainly. The current situation of effectively only allowing half the
tags (if spread) is pretty crappy with (by far) most hardware.

> For large numbers of tags per device the flush might work, but for
> devices with low number of tags we need something more efficient. The
> case of less tags than CPUs isn't that unusual either and we probably
> want to switch to an allocator without per cpu allocations for them to
> avoid all this. E.g. for many ATA devices we just have a single tag,
> and many scsi drivers also only want single digit outstanding commands
> per LUN.

Even for cases where you have as many (or more) CPUs than tags, per-cpu
allocation is not necessarily a bad idea. It's a rare case where you
have all the CPUs touching the device at the same time, after all.

--
Jens Axboe

2014-02-10 22:41:51

by Kent Overstreet

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 09:26:15AM -0700, Jens Axboe wrote:
>
>
> On 02/10/2014 03:32 AM, Christoph Hellwig wrote:
> >On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote:
> >>Yeah, that was my first thought when I posted "percpu_ida: Allow variable
> >>maximum number of cached tags" patch some few months ago. But I am back-
> >>pedalling as it does not appear solves the fundamental problem - what is the
> >>best threshold?
> >>
> >>May be we can walk off with a per-cpu timeout that flushes batch nr of tags
> >>from local caches to the pool? Each local allocation would restart the timer,
> >>but once allocation requests stopped coming on a CPU the tags would not gather
> >>dust in local caches.
> >
> >We'll defintively need a fix to be able to allow the whole tag space.
>
> Certainly. The current situation of effectively only allowing half
> the tags (if spread) is pretty crappy with (by far) most hardware.
>
> >For large numbers of tags per device the flush might work, but for
> >devices with low number of tags we need something more efficient. The
> >case of less tags than CPUs isn't that unusual either and we probably
> >want to switch to an allocator without per cpu allocations for them to
> >avoid all this. E.g. for many ATA devices we just have a single tag,
> >and many scsi drivers also only want single digit outstanding commands
> >per LUN.
>
> Even for cases where you have as many (or more) CPUs than tags,
> per-cpu allocation is not necessarily a bad idea. It's a rare case
> where you have all the CPUs touching the device at the same time,
> after all.

<just back from Switzerland, probably forgetting some of where I left off>

You do still need to have enough tags to shard across the number of cpus
_currently_ touching the device. I think I'm with Christoph here, I'm not sure
how percpu tag allocation would be helpful when we have single digits/low double
digits of tags available.

I would expect that in that case we're better off with just a well implemented
atomic bit vector and waitlist. However, I don't know where the crossover point
is and I think Jens has done by far the most and most relevant benchmarking
here.

How about we just make the number of tags that are allowed to be stranded an
explicit parameter (somehow) - then it can be up to device drivers to do
something sensible with it. Half is probably an ideal default for devices where
that works, but this way more constrained devices will be able to futz with it
however they want.

2014-02-10 23:06:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock



On 02/10/2014 03:41 PM, Kent Overstreet wrote:
> On Mon, Feb 10, 2014 at 09:26:15AM -0700, Jens Axboe wrote:
>>
>>
>> On 02/10/2014 03:32 AM, Christoph Hellwig wrote:
>>> On Sun, Feb 09, 2014 at 04:50:07PM +0100, Alexander Gordeev wrote:
>>>> Yeah, that was my first thought when I posted "percpu_ida: Allow variable
>>>> maximum number of cached tags" patch some few months ago. But I am back-
>>>> pedalling as it does not appear solves the fundamental problem - what is the
>>>> best threshold?
>>>>
>>>> May be we can walk off with a per-cpu timeout that flushes batch nr of tags
>>> >from local caches to the pool? Each local allocation would restart the timer,
>>>> but once allocation requests stopped coming on a CPU the tags would not gather
>>>> dust in local caches.
>>>
>>> We'll defintively need a fix to be able to allow the whole tag space.
>>
>> Certainly. The current situation of effectively only allowing half
>> the tags (if spread) is pretty crappy with (by far) most hardware.
>>
>>> For large numbers of tags per device the flush might work, but for
>>> devices with low number of tags we need something more efficient. The
>>> case of less tags than CPUs isn't that unusual either and we probably
>>> want to switch to an allocator without per cpu allocations for them to
>>> avoid all this. E.g. for many ATA devices we just have a single tag,
>>> and many scsi drivers also only want single digit outstanding commands
>>> per LUN.
>>
>> Even for cases where you have as many (or more) CPUs than tags,
>> per-cpu allocation is not necessarily a bad idea. It's a rare case
>> where you have all the CPUs touching the device at the same time,
>> after all.
>
> <just back from Switzerland, probably forgetting some of where I left off>
>
> You do still need to have enough tags to shard across the number of cpus
> _currently_ touching the device. I think I'm with Christoph here, I'm not sure
> how percpu tag allocation would be helpful when we have single digits/low double
> digits of tags available.

For the common case, I'd assume that anywhere between 31..256 tags is
"normal". That's where the majority of devices will end up being,
largely. So single digits would be an anomaly.

And even for the case of 31 tags and, eg, a 64 cpu system, over windows
of access I don't think it's unreasonable to expect that you are not
going to have 64 threads banging on the same device.

It obviously all depends on the access pattern. X threads for X tags
would work perfectly well with per-cpu tagging, if they are doing sync
IO. And similarly, 8 threads each having low queue depth would be fine.
However, it all falls apart pretty quickly if threads*qd > tag space.

> I would expect that in that case we're better off with just a well implemented
> atomic bit vector and waitlist. However, I don't know where the crossover point
> is and I think Jens has done by far the most and most relevant benchmarking
> here.

The problem with that is when you have some of those threads on
different nodes, it ends up collapsing pretty quickly again. Maybe the
solution is to have a hierarchy of caching instead - per-node, per-cpu.
At least that has the potential to make the common case still perform
better.

> How about we just make the number of tags that are allowed to be stranded an
> explicit parameter (somehow) - then it can be up to device drivers to do
> something sensible with it. Half is probably an ideal default for devices where
> that works, but this way more constrained devices will be able to futz with it
> however they want.

I don't think we should involve device drivers in this, that's punting a
complicated issue to someone who likely has little idea what to do about
it. This needs to be handled sensibly in the core, not in a device
driver. If we can't come up with a sensible algorithm to handle this,
how can we expect someone writing a device driver to do so?

--
Jens Axboe

2014-02-11 09:12:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote:
> For the common case, I'd assume that anywhere between 31..256 tags
> is "normal". That's where the majority of devices will end up being,
> largely. So single digits would be an anomaly.

Unfortunately that's not true in SCSI land, where most driver do per-lun
tagging, and the the cmd_per_lun values are very low and very often
single digits, as a simple grep for cmd_per_lun will tell. Now it might
be that the tag space actually is much bigger in the hardware and the
driver authors for some reason want to limit the number of outstanding
commands, but the interface to the drivers doesn't allow them to express
such a difference at the moment.

> >How about we just make the number of tags that are allowed to be stranded an
> >explicit parameter (somehow) - then it can be up to device drivers to do
> >something sensible with it. Half is probably an ideal default for devices where
> >that works, but this way more constrained devices will be able to futz with it
> >however they want.
>
> I don't think we should involve device drivers in this, that's
> punting a complicated issue to someone who likely has little idea
> what to do about it. This needs to be handled sensibly in the core,
> not in a device driver. If we can't come up with a sensible
> algorithm to handle this, how can we expect someone writing a device
> driver to do so?

Agreed, punting this to the drivers is a bad idea. But at least
exposing variable for the allowed tag space and allowed outstanding
commands to be able to make a smarter decision might be a good idea. On
the other hand this will require us to count the outstanding commands
again, introducing more cachelines touched than nessecary. To make
things worse for complex topologies like SCSI we might have to limit the
outstanding commands at up to three levels in the hierarchy.

2014-02-11 14:42:46

by James Bottomley

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Tue, 2014-02-11 at 01:12 -0800, Christoph Hellwig wrote:
> On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote:
> > For the common case, I'd assume that anywhere between 31..256 tags
> > is "normal". That's where the majority of devices will end up being,
> > largely. So single digits would be an anomaly.
>
> Unfortunately that's not true in SCSI land, where most driver do per-lun
> tagging, and the the cmd_per_lun values are very low and very often
> single digits, as a simple grep for cmd_per_lun will tell.

Remember we do shared (all queue) tags on qla, aic and a few other
drivers (it's actually the mailbox slot tag for the HBA).

> Now it might
> be that the tag space actually is much bigger in the hardware and the
> driver authors for some reason want to limit the number of outstanding
> commands, but the interface to the drivers doesn't allow them to express
> such a difference at the moment.

Tag space is dependent on SCSI protocol. It's 256 for SPI, 65536 for
SAS and I'm not sure for FCP.

> > >How about we just make the number of tags that are allowed to be stranded an
> > >explicit parameter (somehow) - then it can be up to device drivers to do
> > >something sensible with it. Half is probably an ideal default for devices where
> > >that works, but this way more constrained devices will be able to futz with it
> > >however they want.
> >
> > I don't think we should involve device drivers in this, that's
> > punting a complicated issue to someone who likely has little idea
> > what to do about it. This needs to be handled sensibly in the core,
> > not in a device driver. If we can't come up with a sensible
> > algorithm to handle this, how can we expect someone writing a device
> > driver to do so?
>
> Agreed, punting this to the drivers is a bad idea. But at least
> exposing variable for the allowed tag space and allowed outstanding
> commands to be able to make a smarter decision might be a good idea. On
> the other hand this will require us to count the outstanding commands
> again, introducing more cachelines touched than nessecary. To make
> things worse for complex topologies like SCSI we might have to limit the
> outstanding commands at up to three levels in the hierarchy.

The list seems to be missing prior context but a few SPI drivers use the
clock algorithm for tag starvation in the driver. The NCR ones are the
ones I know about: tag allocation is the hands of a clock sweeping
around (one for last tag and one for last outstanding tag). The hands
are never allowed to cross, so if a tag gets starved the hands try to
cross and the driver stops issuing until the missing tag returns. Tag
starvation used to be a known problem for Parallel devices; I haven't
seen much in the way of tag starvation algorithms for other types of
devices, so I assume the problem went away.

James

2014-02-11 14:54:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Tue, Feb 11, 2014 at 06:42:40AM -0800, James Bottomley wrote:
> > Unfortunately that's not true in SCSI land, where most driver do per-lun
> > tagging, and the the cmd_per_lun values are very low and very often
> > single digits, as a simple grep for cmd_per_lun will tell.
>
> Remember we do shared (all queue) tags on qla, aic and a few other
> drivers (it's actually the mailbox slot tag for the HBA).

That's why I said most and not all. We have a fair amount of drivers
using host-wide tags. Having to support both models actually is a bit
of an issue with the blk-mq support in scsi.

At this point I suspect we'll have to detangle the concept of tag space
and the queueing limits that blk-mq merged into a single problem space
again to support SCSI properly.

> > Now it might
> > be that the tag space actually is much bigger in the hardware and the
> > driver authors for some reason want to limit the number of outstanding
> > commands, but the interface to the drivers doesn't allow them to express
> > such a difference at the moment.
>
> Tag space is dependent on SCSI protocol. It's 256 for SPI, 65536 for
> SAS and I'm not sure for FCP.

But that doesn't mean a HBA can support the full space. Unless we know
each piece of hardware detailed enough we can't simplify increase the
tag space as we can be sure something will break.

> The list seems to be missing prior context but a few SPI drivers use the
> clock algorithm for tag starvation in the driver. The NCR ones are the
> ones I know about: tag allocation is the hands of a clock sweeping
> around (one for last tag and one for last outstanding tag). The hands
> are never allowed to cross, so if a tag gets starved the hands try to
> cross and the driver stops issuing until the missing tag returns. Tag
> starvation used to be a known problem for Parallel devices; I haven't
> seen much in the way of tag starvation algorithms for other types of
> devices, so I assume the problem went away.

As long as the drivers don't rely on the block layer tag handling they
can keep whatever they want to do. Given that it's mostly SPI drivers
I doubt the tag allocation is a performance bottle neck of any sort for
them anyway.

2014-02-14 10:34:47

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [patch 1/2]percpu_ida: fix a live lock

On Mon, Feb 10, 2014 at 04:06:27PM -0700, Jens Axboe wrote:
> It obviously all depends on the access pattern. X threads for X tags
> would work perfectly well with per-cpu tagging, if they are doing
> sync IO. And similarly, 8 threads each having low queue depth would
> be fine. However, it all falls apart pretty quickly if threads*qd >
> tag space.

What about inroducing two modes: gentle and agressive?

Gentle would be something implemented now with occasional switches into
agressive mode and back (more about this below).

In agressive mode percpu_ida_alloc() would ignore nr_cpus/2 threshold
and rob remote caches of tags whenever available. Further, percpu_ida_free()
would wake waiters always, even under percpu_max_size limit.

In essence, the agressive mode would be used to overcome the tag space
fragmentation (garbage collector, if you want), but not only. Some classes
of devices would initialize it constantly - that would be the case for i.e.
libata, paired with TASK_RUNNING condition.

Most challenging question here is when to switch from gentle mode to
agressive and back. I am thinking about a timeout that indicates a
reasonable average time for an IO to complete:

* Once CPU failed to obtain a tag and cpus_have_tags is not empty and the
timeout expired, the agressive mode is enforced, meaning no IO activity
on other CPUs and tags are out there;

* Once no tags were requested and the timeout expired the agressive mode
is switched back to gentle;

* Once the second tag returned to a local cache the agressive mode is
switched back to gentle, meaning no threads were/are in the waitqueue,
hungry for tags;

The value of timeout is calculated based on the recent IO activity, probably
with some hints from the device driver. The good thing here that occasional
unnecessary switches to the agressive mode would not hit the overall
performance.

Quite complicated, huh? :)

--
Regards,
Alexander Gordeev
[email protected]