2011-03-01 00:20:18

by Justin TerAvest

[permalink] [raw]
Subject: RFC: default group_isolation to 1, remove option

Hi Vivek,

I'd like to propose removing the group_isolation setting and changing
the default to 1. Do we know if anyone is using group_isolation=0 to get
easy group separation between sequential readers and random readers?

Allowing group_isolation complicates implementing per-cgroup request
descriptor pools when a queue is moved to the root group. Specifically,
if we have pools per-cgroup, we would be forced to use request
descriptors from the pool for the "original" cgroup, while the requests
are actually being serviced by the root cgroup.

That might be acceptable, but I figured this would be a good opportunity
to revisit keeping queues within their original groups.

I know this discussion has come up before. I'm curious if we have a good
reason to keep it around right now. I'd be happy to do some investigation
to help make my case.

Thanks,
Justin


2011-03-01 14:20:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Feb 28, 2011 at 04:19:43PM -0800, Justin TerAvest wrote:
> Hi Vivek,
>
> I'd like to propose removing the group_isolation setting and changing
> the default to 1. Do we know if anyone is using group_isolation=0 to get
> easy group separation between sequential readers and random readers?

CCing Corrado.

I like the idea of setting group_isolation = 1 to default. So far I have
not found anybody trying to use group_isolation=0 and every time I had
to say can you try setting group_isolation to 1 if you are not seeing
service differentiation.

I think I would not mind removing it completely altogether also. This will
also remove some code from CFQ. The reason we introduced group_isolation
because by default we idle on sync-noidle tree and on fast devices idling on
every syn-noidle tree can be very harmful for throughput, especially on faster
storage like storage arrays.

One of the soutions for that problem can be that run with slice idle
enabled on SATA disks and run with slice_idle=0 and possibly group_idle=0
also on faster storage. Setting idling to 0 will increase throughput but
at the same time reduce the isolation significantly. But I guess this
is the performance vs isolation trade off.

>
> Allowing group_isolation complicates implementing per-cgroup request
> descriptor pools when a queue is moved to the root group. Specifically,
> if we have pools per-cgroup, we would be forced to use request
> descriptors from the pool for the "original" cgroup, while the requests
> are actually being serviced by the root cgroup.

I think creating per group request pool will complicate the implementation
further. (we have done that once in the past). Jens once mentioned that
he liked number of requests per iocontext limit better than overall queue
limit. So if we implement per iocontext limit, it will get rid of need
of doing anything extra for group infrastructure.

Jens, do you think per iocontext per queue limit on request descriptors make
sense and we can get rid of per queue overall limit?

Thanks
Vivek

2011-03-01 18:45:22

by Justin TerAvest

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Tue, Mar 1, 2011 at 6:20 AM, Vivek Goyal <[email protected]> wrote:
> On Mon, Feb 28, 2011 at 04:19:43PM -0800, Justin TerAvest wrote:
>> Hi Vivek,
>>
>> I'd like to propose removing the group_isolation setting and changing
>> the default to 1. Do we know if anyone is using group_isolation=0 to get
>> easy group separation between sequential readers and random readers?
>
> CCing Corrado.
>
> I like the idea of setting group_isolation = 1 to default. So far I have
> not found anybody trying to use group_isolation=0 and every time I had
> to say can you try setting group_isolation to 1 if you are not seeing
> service differentiation.
>
> I think I would not mind removing it completely altogether also. This will
> also remove some code from CFQ. The reason we introduced group_isolation
> because by default we idle on sync-noidle tree and on fast devices idling on
> every syn-noidle tree can be very harmful for throughput, especially on faster
> storage like storage arrays.
>
> One of the soutions for that problem can be that run with slice idle
> enabled on SATA disks and run with slice_idle=0 and possibly group_idle=0
> also on faster storage. Setting idling to 0 will increase throughput but
> at the same time reduce the isolation significantly. But I guess this
> is the performance vs isolation trade off.

I agree. Thanks! I'll send a patch shortly, CCing everyone here and we
can have any further discussion there.

>
>>
>> Allowing group_isolation complicates implementing per-cgroup request
>> descriptor pools when a queue is moved to the root group. Specifically,
>> if we have pools per-cgroup, we would be forced to use request
>> descriptors from the pool for the "original" cgroup, while the requests
>> are actually being serviced by the root cgroup.
>
> I think creating per group request pool will complicate the implementation
> further. (we have done that once in the past). Jens once mentioned that
> he liked number of requests per iocontext limit better than overall queue
> limit. So if we implement per iocontext limit, it will get rid of need
> of doing anything extra for group infrastructure.

I will go read the discussion history for this, but I am concerned that doing
page tracking to look up the iocontext will be more complicated than
tracking dirtied pages per-cgroup. I would hope there is a big advantage
to per icontext limits.

Thanks,
Justin

>
> Jens, do you think per iocontext per queue limit on request descriptors make
> sense and we can get rid of per queue overall limit?
>
> Thanks
> Vivek
>

2011-03-02 21:29:05

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Tue, Mar 01, 2011 at 10:44:52AM -0800, Justin TerAvest wrote:
> On Tue, Mar 1, 2011 at 6:20 AM, Vivek Goyal <[email protected]> wrote:
> > On Mon, Feb 28, 2011 at 04:19:43PM -0800, Justin TerAvest wrote:
> >> Hi Vivek,
> >>
> >> I'd like to propose removing the group_isolation setting and changing
> >> the default to 1. Do we know if anyone is using group_isolation=0 to get
> >> easy group separation between sequential readers and random readers?
> >
> > CCing Corrado.
> >
> > I like the idea of setting group_isolation = 1 to default. So far I have
> > not found anybody trying to use group_isolation=0 and every time I had
> > to say can you try setting group_isolation to 1 if you are not seeing
> > service differentiation.
> >
> > I think I would not mind removing it completely altogether also. This will
> > also remove some code from CFQ. The reason we introduced group_isolation
> > because by default we idle on sync-noidle tree and on fast devices idling on
> > every syn-noidle tree can be very harmful for throughput, especially on faster
> > storage like storage arrays.
> >
> > One of the soutions for that problem can be that run with slice idle
> > enabled on SATA disks and run with slice_idle=0 and possibly group_idle=0
> > also on faster storage. Setting idling to 0 will increase throughput but
> > at the same time reduce the isolation significantly. But I guess this
> > is the performance vs isolation trade off.
>
> I agree. Thanks! I'll send a patch shortly, CCing everyone here and we
> can have any further discussion there.
>
> >
> >>
> >> Allowing group_isolation complicates implementing per-cgroup request
> >> descriptor pools when a queue is moved to the root group. Specifically,
> >> if we have pools per-cgroup, we would be forced to use request
> >> descriptors from the pool for the "original" cgroup, while the requests
> >> are actually being serviced by the root cgroup.
> >
> > I think creating per group request pool will complicate the implementation
> > further. (we have done that once in the past). Jens once mentioned that
> > he liked number of requests per iocontext limit better than overall queue
> > limit. So if we implement per iocontext limit, it will get rid of need
> > of doing anything extra for group infrastructure.
>
> I will go read the discussion history for this, but I am concerned that doing
> page tracking to look up the iocontext will be more complicated than
> tracking dirtied pages per-cgroup. I would hope there is a big advantage
> to per icontext limits.

Advantage of iocontext limit I think is that we don't have to implement
per group request descriptor pools or limits.

Secondly, it also allows some isolation between two processes/iocontexts
with-in the group.

So we could think of a page taking a reference on iocontext but the
real issue would be how to store the pointer of that iocontext in
page_cgroup. We are already hard pressed for space and cssid consumes
less bits.

So may be keeping it per group might turn out to be simpler.

Also once we were talking (I think at LSF last year) about associating an
inode to a cgroup. So a bio can point to page and page can point to its inode
and we get the pointer to the cgroup. This approach should not require tracking
information in page and at coarse level it should work (as long as
applications are working on separate inodes).

I think another advantage of this scheme is that request queue can
export per cgroup congestion mechanism. So if a flusher thread picks
an inode for writting back the pages, it can retrieve the cgroup from
inode and enquire block layer if associated cgroup on the device is
congested and if it is, flusher thread can move on to next inode.

In fact if we are just worried about isolating READS from WRITES, then
we can skip implementing per group congestion idea. Even if flusher
thread gets blocked because of request descriptors, it does not impact
reads. I think at last summit people did not like the idea of congestion
per group per bdi.

One could do similar things with getting page->cgroup pointer also but
then flusher threads will also have to go throgh all the pages of
inode before it can decide whether to dispatch some IO or not and that
might slow down the writeback.

So if we are ready to sacrifice some granularity of async writeback
control and do it per inode basis instead of per page basis, it might
simplify the implementation and reduce performance overhead.

CCing Greg and Andrea. They might have thoughts on this.

Thanks
Vivek

2011-03-03 03:45:33

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On 2011-03-01 09:20, Vivek Goyal wrote:
> I think creating per group request pool will complicate the
> implementation further. (we have done that once in the past). Jens
> once mentioned that he liked number of requests per iocontext limit
> better than overall queue limit. So if we implement per iocontext
> limit, it will get rid of need of doing anything extra for group
> infrastructure.
>
> Jens, do you think per iocontext per queue limit on request
> descriptors make sense and we can get rid of per queue overall limit?

Since we practically don't need a limit anymore to begin with (or so is
the theory), then yes we can move to per-ioc limits instead and get rid
of that queue state. We'd have to hold on to the ioc for the duration of
the IO explicitly from the request then.

I primarily like that implementation since it means we can make the IO
completion lockless, at least on the block layer side. We still have
state to complete in the schedulers that require that, but it's a good
step at least.

--
Jens Axboe

2011-03-03 15:30:24

by Vivek Goyal

[permalink] [raw]
Subject: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On Wed, Mar 02, 2011 at 10:45:20PM -0500, Jens Axboe wrote:
> On 2011-03-01 09:20, Vivek Goyal wrote:
> > I think creating per group request pool will complicate the
> > implementation further. (we have done that once in the past). Jens
> > once mentioned that he liked number of requests per iocontext limit
> > better than overall queue limit. So if we implement per iocontext
> > limit, it will get rid of need of doing anything extra for group
> > infrastructure.
> >
> > Jens, do you think per iocontext per queue limit on request
> > descriptors make sense and we can get rid of per queue overall limit?
>
> Since we practically don't need a limit anymore to begin with (or so is
> the theory).

So what has changed that we don't need queue limits on nr_requests anymore?
If we get rid of queue limits then we need to get rid of bdi congestion
logic also and come up with some kind of ioc congestion logic so that
a thread which does not want to sleep while submitting the request needs to
checks it own ioc for being congested or not for a specific device/bdi.

>then yes we can move to per-ioc limits instead and get rid
> of that queue state. We'd have to hold on to the ioc for the duration of
> the IO explicitly from the request then.

I think every request submitted on request queue already takes a reference
on ioc (set_request) and reference is not dropped till completion. So
ioc is anyway around till request completes.

>
> I primarily like that implementation since it means we can make the IO
> completion lockless, at least on the block layer side. We still have
> state to complete in the schedulers that require that, but it's a good
> step at least.

Ok so in completion path the contention will move from queue_lock to
ioc lock or something like that. (We hope that there are no other
dependencies on queue here, devil lies in details :-))

The other potential issue with this approach is how will we handle the
case of flusher thread submitting IO. At some point of time we want to
account it to right cgroup.

Retrieving iocontext from bio will be hard as it will atleast require
on extra pointer in page_cgroup and I am not sure how feasible that is.

Or we could come up with the concept of group iocontext. With the help
of page cgroup we should be able to get to cgroup, retrieve the right
group iocontext and check the limit against that. But I guess this
get complicated.

So if we move to ioc based limit, then for async IO, a reasonable way
would be to find the io context of submitting task and operate on that
even if that means increased page_cgroup size.

Thanks
Vivek

2011-03-03 15:44:27

by Jens Axboe

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On 2011-03-03 10:30, Vivek Goyal wrote:
> On Wed, Mar 02, 2011 at 10:45:20PM -0500, Jens Axboe wrote:
>> On 2011-03-01 09:20, Vivek Goyal wrote:
>>> I think creating per group request pool will complicate the
>>> implementation further. (we have done that once in the past). Jens
>>> once mentioned that he liked number of requests per iocontext limit
>>> better than overall queue limit. So if we implement per iocontext
>>> limit, it will get rid of need of doing anything extra for group
>>> infrastructure.
>>>
>>> Jens, do you think per iocontext per queue limit on request
>>> descriptors make sense and we can get rid of per queue overall limit?
>>
>> Since we practically don't need a limit anymore to begin with (or so is
>> the theory).
>
> So what has changed that we don't need queue limits on nr_requests anymore?
> If we get rid of queue limits then we need to get rid of bdi congestion
> logic also and come up with some kind of ioc congestion logic so that
> a thread which does not want to sleep while submitting the request needs to
> checks it own ioc for being congested or not for a specific device/bdi.

Right now congestion is a measure of request starvation on the OS side.
It may make sense to keep the notion of a congested device when we are
operating at the device limits. But as a blocking measure it should go
away.

No recent change is causing us to be able to throw away the limit. It
used to be that the vm got really unhappy with long queues, since you
could have tons of memory dirty. This works a LOT better now. And one
would hope that it does, since there are a number of drivers that don't
have limts. So when I say "practically" don't need limits anymore, the
hope is that we'll behave well enough with just per-ioc limits in place.

>> then yes we can move to per-ioc limits instead and get rid
>> of that queue state. We'd have to hold on to the ioc for the duration of
>> the IO explicitly from the request then.
>
> I think every request submitted on request queue already takes a reference
> on ioc (set_request) and reference is not dropped till completion. So
> ioc is anyway around till request completes.

That's only true for CFQ, it's not a block layer property. This would
have to be explicitly done.

>> I primarily like that implementation since it means we can make the IO
>> completion lockless, at least on the block layer side. We still have
>> state to complete in the schedulers that require that, but it's a good
>> step at least.
>
> Ok so in completion path the contention will move from queue_lock to
> ioc lock or something like that. (We hope that there are no other
> dependencies on queue here, devil lies in details :-))

Right, so it's spread out and in most cases the ioc will be completely
uncontended since it's usually private to the process.

> The other potential issue with this approach is how will we handle the
> case of flusher thread submitting IO. At some point of time we want to
> account it to right cgroup.
>
> Retrieving iocontext from bio will be hard as it will atleast require
> on extra pointer in page_cgroup and I am not sure how feasible that is.
>
> Or we could come up with the concept of group iocontext. With the help
> of page cgroup we should be able to get to cgroup, retrieve the right
> group iocontext and check the limit against that. But I guess this
> get complicated.
>
> So if we move to ioc based limit, then for async IO, a reasonable way
> would be to find the io context of submitting task and operate on that
> even if that means increased page_cgroup size.

For now it's not a complicated effort, I already have a patch for this.
If page tracking needs extra complexity, it'll have to remain in the
page tracking code.

--
Jens Axboe

2011-03-03 16:57:35

by Vivek Goyal

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On Thu, Mar 03, 2011 at 10:44:17AM -0500, Jens Axboe wrote:
> On 2011-03-03 10:30, Vivek Goyal wrote:
> > On Wed, Mar 02, 2011 at 10:45:20PM -0500, Jens Axboe wrote:
> >> On 2011-03-01 09:20, Vivek Goyal wrote:
> >>> I think creating per group request pool will complicate the
> >>> implementation further. (we have done that once in the past). Jens
> >>> once mentioned that he liked number of requests per iocontext limit
> >>> better than overall queue limit. So if we implement per iocontext
> >>> limit, it will get rid of need of doing anything extra for group
> >>> infrastructure.
> >>>
> >>> Jens, do you think per iocontext per queue limit on request
> >>> descriptors make sense and we can get rid of per queue overall limit?
> >>
> >> Since we practically don't need a limit anymore to begin with (or so is
> >> the theory).
> >
> > So what has changed that we don't need queue limits on nr_requests anymore?
> > If we get rid of queue limits then we need to get rid of bdi congestion
> > logic also and come up with some kind of ioc congestion logic so that
> > a thread which does not want to sleep while submitting the request needs to
> > checks it own ioc for being congested or not for a specific device/bdi.
>
> Right now congestion is a measure of request starvation on the OS side.
> It may make sense to keep the notion of a congested device when we are
> operating at the device limits. But as a blocking measure it should go
> away.

Ok, so keep q->nr_requests around to only figure out when a queue/device
is congested or not but a submitter does not actually block on a congested
device. A submitter will block only if it ioc->nr_requests are exceeding.

So keeping nontion of bdi congested will not hurt.

>
> No recent change is causing us to be able to throw away the limit. It
> used to be that the vm got really unhappy with long queues, since you
> could have tons of memory dirty. This works a LOT better now. And one
> would hope that it does, since there are a number of drivers that don't
> have limts. So when I say "practically" don't need limits anymore, the
> hope is that we'll behave well enough with just per-ioc limits in place.

Ok. Understood.

>
> >> then yes we can move to per-ioc limits instead and get rid
> >> of that queue state. We'd have to hold on to the ioc for the duration of
> >> the IO explicitly from the request then.
> >
> > I think every request submitted on request queue already takes a reference
> > on ioc (set_request) and reference is not dropped till completion. So
> > ioc is anyway around till request completes.
>
> That's only true for CFQ, it's not a block layer property. This would
> have to be explicitly done.

Oh yes. only CFQ set_request call takes reference and it does that
because CFQ looks into ioc for ioprio, class and cfq queues are per
ioc. So yes, this notion shall have to be brought to block layer.

>
> >> I primarily like that implementation since it means we can make the IO
> >> completion lockless, at least on the block layer side. We still have
> >> state to complete in the schedulers that require that, but it's a good
> >> step at least.
> >
> > Ok so in completion path the contention will move from queue_lock to
> > ioc lock or something like that. (We hope that there are no other
> > dependencies on queue here, devil lies in details :-))
>
> Right, so it's spread out and in most cases the ioc will be completely
> uncontended since it's usually private to the process.

Ok, that makes sense. ioc is per process so lock contetion in completion
path goes down.

>
> > The other potential issue with this approach is how will we handle the
> > case of flusher thread submitting IO. At some point of time we want to
> > account it to right cgroup.
> >
> > Retrieving iocontext from bio will be hard as it will atleast require
> > on extra pointer in page_cgroup and I am not sure how feasible that is.
> >
> > Or we could come up with the concept of group iocontext. With the help
> > of page cgroup we should be able to get to cgroup, retrieve the right
> > group iocontext and check the limit against that. But I guess this
> > get complicated.
> >
> > So if we move to ioc based limit, then for async IO, a reasonable way
> > would be to find the io context of submitting task and operate on that
> > even if that means increased page_cgroup size.
>
> For now it's not a complicated effort, I already have a patch for this.
> If page tracking needs extra complexity, it'll have to remain in the
> page tracking code.

Great. I am hoping once you get some free time, you will cleanup and post
that patch.

Thanks
Vivek

2011-03-03 18:03:43

by Vivek Goyal

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On Thu, Mar 03, 2011 at 11:57:17AM -0500, Vivek Goyal wrote:

[..]
> > For now it's not a complicated effort, I already have a patch for this.
> > If page tracking needs extra complexity, it'll have to remain in the
> > page tracking code.
>
> Great. I am hoping once you get some free time, you will cleanup and post
> that patch.

Hmmm..., Now I got interested in this stuff. In case you would like to see
your patch to be in and finding it hard to find time to clean it up, I can
do that and repost and also fix any bugs.

Thanks
Vivek

2011-03-04 11:01:45

by Jens Axboe

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On 2011-03-03 19:03, Vivek Goyal wrote:
> On Thu, Mar 03, 2011 at 11:57:17AM -0500, Vivek Goyal wrote:
>
> [..]
>>> For now it's not a complicated effort, I already have a patch for this.
>>> If page tracking needs extra complexity, it'll have to remain in the
>>> page tracking code.
>>
>> Great. I am hoping once you get some free time, you will cleanup and post
>> that patch.
>
> Hmmm..., Now I got interested in this stuff. In case you would like to see
> your patch to be in and finding it hard to find time to clean it up, I can
> do that and repost and also fix any bugs.

I did an initial run on the plane:

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-2.6.39/ioc-rq-alloc

Not tested at all, so there are likely a few bugs. Congestion is gone,
we'll need to consider that. Before the device was either congested or
not congested, now the question to ask would be "is the device congested
for ME" as the answer will depend on who is asking.

--
Jens Axboe

2011-03-04 21:32:01

by Vivek Goyal

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On Fri, Mar 04, 2011 at 12:01:41PM +0100, Jens Axboe wrote:
> On 2011-03-03 19:03, Vivek Goyal wrote:
> > On Thu, Mar 03, 2011 at 11:57:17AM -0500, Vivek Goyal wrote:
> >
> > [..]
> >>> For now it's not a complicated effort, I already have a patch for this.
> >>> If page tracking needs extra complexity, it'll have to remain in the
> >>> page tracking code.
> >>
> >> Great. I am hoping once you get some free time, you will cleanup and post
> >> that patch.
> >
> > Hmmm..., Now I got interested in this stuff. In case you would like to see
> > your patch to be in and finding it hard to find time to clean it up, I can
> > do that and repost and also fix any bugs.
>
> I did an initial run on the plane:
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-2.6.39/ioc-rq-alloc
>
> Not tested at all, so there are likely a few bugs. Congestion is gone,
> we'll need to consider that. Before the device was either congested or
> not congested, now the question to ask would be "is the device congested
> for ME" as the answer will depend on who is asking.


Thanks Jens. I was having a quick look and noticed that ioc->count[] state
is global across queues. I guess what we want is to keep track of ioc
state per queue (something like cfq_io_context). By maintaining per device
state in ioc, we can also tell whether a particular ioc is congested or
not on a specific device.

Thanks
Vivek

2011-03-04 21:34:29

by Jens Axboe

[permalink] [raw]
Subject: Re: Per iocontext request descriptor limits (Was: Re: RFC: default group_isolation to 1, remove option)

On 2011-03-04 22:31, Vivek Goyal wrote:
> On Fri, Mar 04, 2011 at 12:01:41PM +0100, Jens Axboe wrote:
>> On 2011-03-03 19:03, Vivek Goyal wrote:
>>> On Thu, Mar 03, 2011 at 11:57:17AM -0500, Vivek Goyal wrote:
>>>
>>> [..]
>>>>> For now it's not a complicated effort, I already have a patch for this.
>>>>> If page tracking needs extra complexity, it'll have to remain in the
>>>>> page tracking code.
>>>>
>>>> Great. I am hoping once you get some free time, you will cleanup and post
>>>> that patch.
>>>
>>> Hmmm..., Now I got interested in this stuff. In case you would like to see
>>> your patch to be in and finding it hard to find time to clean it up, I can
>>> do that and repost and also fix any bugs.
>>
>> I did an initial run on the plane:
>>
>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=refs/heads/for-2.6.39/ioc-rq-alloc
>>
>> Not tested at all, so there are likely a few bugs. Congestion is gone,
>> we'll need to consider that. Before the device was either congested or
>> not congested, now the question to ask would be "is the device congested
>> for ME" as the answer will depend on who is asking.
>
>
> Thanks Jens. I was having a quick look and noticed that ioc->count[] state
> is global across queues. I guess what we want is to keep track of ioc
> state per queue (something like cfq_io_context). By maintaining per device
> state in ioc, we can also tell whether a particular ioc is congested or
> not on a specific device.

Indeed, similar to how we setup cfq_io_contexts would be useful there...
Requires a bit of thinking.

I updated the patchset btw, found some silly errors in it. But yes, we
definitely need some better ->count[] tracking.

--
Jens Axboe

2011-03-06 16:06:26

by Andrea Righi

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Wed, Mar 02, 2011 at 04:28:45PM -0500, Vivek Goyal wrote:
> On Tue, Mar 01, 2011 at 10:44:52AM -0800, Justin TerAvest wrote:
> > On Tue, Mar 1, 2011 at 6:20 AM, Vivek Goyal <[email protected]> wrote:
> > > On Mon, Feb 28, 2011 at 04:19:43PM -0800, Justin TerAvest wrote:
> > >> Hi Vivek,
> > >>
> > >> I'd like to propose removing the group_isolation setting and changing
> > >> the default to 1. Do we know if anyone is using group_isolation=0 to get
> > >> easy group separation between sequential readers and random readers?
> > >
> > > CCing Corrado.
> > >
> > > I like the idea of setting group_isolation = 1 to default. So far I have
> > > not found anybody trying to use group_isolation=0 and every time I had
> > > to say can you try setting group_isolation to 1 if you are not seeing
> > > service differentiation.
> > >
> > > I think I would not mind removing it completely altogether also. This will
> > > also remove some code from CFQ. The reason we introduced group_isolation
> > > because by default we idle on sync-noidle tree and on fast devices idling on
> > > every syn-noidle tree can be very harmful for throughput, especially on faster
> > > storage like storage arrays.
> > >
> > > One of the soutions for that problem can be that run with slice idle
> > > enabled on SATA disks and run with slice_idle=0 and possibly group_idle=0
> > > also on faster storage. Setting idling to 0 will increase throughput but
> > > at the same time reduce the isolation significantly. But I guess this
> > > is the performance vs isolation trade off.
> >
> > I agree. Thanks! I'll send a patch shortly, CCing everyone here and we
> > can have any further discussion there.
> >
> > >
> > >>
> > >> Allowing group_isolation complicates implementing per-cgroup request
> > >> descriptor pools when a queue is moved to the root group. Specifically,
> > >> if we have pools per-cgroup, we would be forced to use request
> > >> descriptors from the pool for the "original" cgroup, while the requests
> > >> are actually being serviced by the root cgroup.
> > >
> > > I think creating per group request pool will complicate the implementation
> > > further. (we have done that once in the past). Jens once mentioned that
> > > he liked number of requests per iocontext limit better than overall queue
> > > limit. So if we implement per iocontext limit, it will get rid of need
> > > of doing anything extra for group infrastructure.
> >
> > I will go read the discussion history for this, but I am concerned that doing
> > page tracking to look up the iocontext will be more complicated than
> > tracking dirtied pages per-cgroup. I would hope there is a big advantage
> > to per icontext limits.
>
> Advantage of iocontext limit I think is that we don't have to implement
> per group request descriptor pools or limits.
>
> Secondly, it also allows some isolation between two processes/iocontexts
> with-in the group.
>
> So we could think of a page taking a reference on iocontext but the
> real issue would be how to store the pointer of that iocontext in
> page_cgroup. We are already hard pressed for space and cssid consumes
> less bits.
>
> So may be keeping it per group might turn out to be simpler.
>
> Also once we were talking (I think at LSF last year) about associating an
> inode to a cgroup. So a bio can point to page and page can point to its inode
> and we get the pointer to the cgroup. This approach should not require tracking
> information in page and at coarse level it should work (as long as
> applications are working on separate inodes).
>
> I think another advantage of this scheme is that request queue can
> export per cgroup congestion mechanism. So if a flusher thread picks
> an inode for writting back the pages, it can retrieve the cgroup from
> inode and enquire block layer if associated cgroup on the device is
> congested and if it is, flusher thread can move on to next inode.
>
> In fact if we are just worried about isolating READS from WRITES, then
> we can skip implementing per group congestion idea. Even if flusher
> thread gets blocked because of request descriptors, it does not impact
> reads. I think at last summit people did not like the idea of congestion
> per group per bdi.

I'm not so confident about this approach and I think we should never
block flusher threads. READs are not probably affected, but the
writeback of dirty pages for that particular block device will be
stopped. This means that all the tasks that are generating dirty pages
for that device will soon exceed the dirty_ratio limit and will start to
actively writeback dirty pages.

I would consider the writeback IO performed (only) by the flusher thread
a "good" condition. Or better, if flusher threads are able to maintain
the amount of dirty pages below dirty_background_ratio limit, then they
shouldn't be throttled.

Instead, when normal tasks are forced to writeback dirty pages to disk
via writeback_inodes_wb(), because flusher threads aren't able to
maintain the quota of dirty pages below the limit, well, only in that
case IMHO each task should be forced to writeback the inodes of the
cgroup they belong (when we'll have the inode->cgroup mapping, or kind
of) and in case stop those tasks if blkio limits are exceeded.

>
> One could do similar things with getting page->cgroup pointer also but
> then flusher threads will also have to go throgh all the pages of
> inode before it can decide whether to dispatch some IO or not and that
> might slow down the writeback.
>
> So if we are ready to sacrifice some granularity of async writeback
> control and do it per inode basis instead of per page basis, it might
> simplify the implementation and reduce performance overhead.
>
> CCing Greg and Andrea. They might have thoughts on this.
>
> Thanks
> Vivek

Thanks,
-Andrea

2011-03-07 18:20:34

by Justin TerAvest

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Wed, Mar 2, 2011 at 7:45 PM, Jens Axboe <[email protected]> wrote:
> On 2011-03-01 09:20, Vivek Goyal wrote:
>> I think creating per group request pool will complicate the
>> implementation further. (we have done that once in the past). Jens
>> once mentioned that he liked number of requests per iocontext limit
>> better than overall queue limit. So if we implement per iocontext
>> limit, it will get rid of need of doing anything extra for group
>> infrastructure.
>>
>> Jens, do you think per iocontext per queue limit on request
>> descriptors make sense and we can get rid of per queue overall limit?
>
> Since we practically don't need a limit anymore to begin with (or so is
> the theory), then yes we can move to per-ioc limits instead and get rid
> of that queue state. We'd have to hold on to the ioc for the duration of
> the IO explicitly from the request then.
>
> I primarily like that implementation since it means we can make the IO
> completion lockless, at least on the block layer side. We still have
> state to complete in the schedulers that require that, but it's a good
> step at least.

So, the primary advantage of using per-ioc limits that we can make
IO completions lockless?

I'm concerned that looking up the correct iocontext for a page will be more
complicated, and require more storage (than a css_id, anyway). I think Vivek
mentioned this too.

I don't understand what the advantage is of offering isolation between
iocontexts within a cgroup; if the user wanted isolation, shouldn't
they just create multiple cgroups? It seems like per-cgroup limits would work
as well.

Thanks,
Justin

>
> --
> Jens Axboe
>
>

2011-03-07 19:39:55

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On 2011-03-07 19:20, Justin TerAvest wrote:
> On Wed, Mar 2, 2011 at 7:45 PM, Jens Axboe <[email protected]> wrote:
>> On 2011-03-01 09:20, Vivek Goyal wrote:
>>> I think creating per group request pool will complicate the
>>> implementation further. (we have done that once in the past). Jens
>>> once mentioned that he liked number of requests per iocontext limit
>>> better than overall queue limit. So if we implement per iocontext
>>> limit, it will get rid of need of doing anything extra for group
>>> infrastructure.
>>>
>>> Jens, do you think per iocontext per queue limit on request
>>> descriptors make sense and we can get rid of per queue overall limit?
>>
>> Since we practically don't need a limit anymore to begin with (or so is
>> the theory), then yes we can move to per-ioc limits instead and get rid
>> of that queue state. We'd have to hold on to the ioc for the duration of
>> the IO explicitly from the request then.
>>
>> I primarily like that implementation since it means we can make the IO
>> completion lockless, at least on the block layer side. We still have
>> state to complete in the schedulers that require that, but it's a good
>> step at least.
>
> So, the primary advantage of using per-ioc limits that we can make IO
> completions lockless?

Primarily, yes. The rq pool and accounting is the only state left we
have to touch from both queuing IO and completing it.

> I'm concerned that looking up the correct iocontext for a page will be
> more complicated, and require more storage (than a css_id, anyway). I
> think Vivek mentioned this too.

A contained cgroup, is that sharing an IO context across the processes?

> I don't understand what the advantage is of offering isolation between
> iocontexts within a cgroup; if the user wanted isolation, shouldn't
> they just create multiple cgroups? It seems like per-cgroup limits
> would work as well.

It's at least not my goal, it has nothing to do with isolation. Since we
have ->make_request_fn() drivers operating completely without queuing
limits, it may just be that we can drop the tracking completely on the
request side. Either one is currently broken, or both will work that
way. And if that is the case, then we don't have to do this ioc tracking
at all. With the additional complication of now needing
per-disk-per-process io contexts, that approach is looking a lot more
tasty right now.

Or not get rid of limits completely, but do a lot more relaxed
accounting at the queue level still. That will not require any
additional tracking of io contexts etc, but still impose some limit on
the number of queued IOs.

--
Jens Axboe

2011-03-07 20:24:42

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Mar 07, 2011 at 08:39:52PM +0100, Jens Axboe wrote:
> On 2011-03-07 19:20, Justin TerAvest wrote:
> > On Wed, Mar 2, 2011 at 7:45 PM, Jens Axboe <[email protected]> wrote:
> >> On 2011-03-01 09:20, Vivek Goyal wrote:
> >>> I think creating per group request pool will complicate the
> >>> implementation further. (we have done that once in the past). Jens
> >>> once mentioned that he liked number of requests per iocontext limit
> >>> better than overall queue limit. So if we implement per iocontext
> >>> limit, it will get rid of need of doing anything extra for group
> >>> infrastructure.
> >>>
> >>> Jens, do you think per iocontext per queue limit on request
> >>> descriptors make sense and we can get rid of per queue overall limit?
> >>
> >> Since we practically don't need a limit anymore to begin with (or so is
> >> the theory), then yes we can move to per-ioc limits instead and get rid
> >> of that queue state. We'd have to hold on to the ioc for the duration of
> >> the IO explicitly from the request then.
> >>
> >> I primarily like that implementation since it means we can make the IO
> >> completion lockless, at least on the block layer side. We still have
> >> state to complete in the schedulers that require that, but it's a good
> >> step at least.
> >
> > So, the primary advantage of using per-ioc limits that we can make IO
> > completions lockless?
>
> Primarily, yes. The rq pool and accounting is the only state left we
> have to touch from both queuing IO and completing it.
>
> > I'm concerned that looking up the correct iocontext for a page will be
> > more complicated, and require more storage (than a css_id, anyway). I
> > think Vivek mentioned this too.
>
> A contained cgroup, is that sharing an IO context across the processes?

Not necessarily. We can have tasks in an IO cgroup which are not sharing
io context.

>
> > I don't understand what the advantage is of offering isolation between
> > iocontexts within a cgroup; if the user wanted isolation, shouldn't
> > they just create multiple cgroups? It seems like per-cgroup limits
> > would work as well.
>
> It's at least not my goal, it has nothing to do with isolation. Since we
> have ->make_request_fn() drivers operating completely without queuing
> limits, it may just be that we can drop the tracking completely on the
> request side. Either one is currently broken, or both will work that
> way. And if that is the case, then we don't have to do this ioc tracking
> at all. With the additional complication of now needing
> per-disk-per-process io contexts, that approach is looking a lot more
> tasty right now.

I am writting the code for per-disk-per-process io context and it is
significant amount of code and as code size is growing I am also wondering
if it worth the complication.

Currently request queue blocks a process if device is congested. It might
happen that one process in a low weight cgroup is doing writes and has
consumed all available request descriptors (it is really easy to produce)
and now device is congested. Now any writes from high weight/prio cgroup
will not even be submitted on request queue and hence they can not be
given priority by CFQ.

>
> Or not get rid of limits completely, but do a lot more relaxed
> accounting at the queue level still. That will not require any
> additional tracking of io contexts etc, but still impose some limit on
> the number of queued IOs.

A lot more relaxed limit accounting should help a bit but it after a
while it might happen that slow movers eat up lots of request descriptors
and making not much of progress.

Long back I had implemented this additional notion of q->nr_group_requests
where we defined per group number of requests allowed submitter will
be put to sleep. I also extended it to also export per bdi per group
congestion notion. So a flusher thread can look at the page and cgroup
of the page and determine if respective cgroup is congested or not. If
cgroup is congested, flusher thread can move to next inode so that it
is not put to sleep behind a slow mover.

Completely limitless queueu will solve the problem completely. But I guess
then we can get that complain back that flusher thread submitted too much
of IO to device.

So given then fact that per-ioc-per-disk accounting of request descriptors
makes the accounting complicated and also makes it hard for block IO
controller to use it, the other approach of implementing per group limit
and per-group-per-bdi congested might be reasonable. Having said that, the
patch I had written for per group descritor was also not necessarily very
simple.

Thanks
Vivek

2011-03-07 20:32:57

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On 2011-03-07 21:24, Vivek Goyal wrote:
> On Mon, Mar 07, 2011 at 08:39:52PM +0100, Jens Axboe wrote:
>> It's at least not my goal, it has nothing to do with isolation. Since we
>> have ->make_request_fn() drivers operating completely without queuing
>> limits, it may just be that we can drop the tracking completely on the
>> request side. Either one is currently broken, or both will work that
>> way. And if that is the case, then we don't have to do this ioc tracking
>> at all. With the additional complication of now needing
>> per-disk-per-process io contexts, that approach is looking a lot more
>> tasty right now.
>
> I am writting the code for per-disk-per-process io context and it is
> significant amount of code and as code size is growing I am also wondering
> if it worth the complication.

Yep, I don't think we should do that.

> Currently request queue blocks a process if device is congested. It might
> happen that one process in a low weight cgroup is doing writes and has
> consumed all available request descriptors (it is really easy to produce)
> and now device is congested. Now any writes from high weight/prio cgroup
> will not even be submitted on request queue and hence they can not be
> given priority by CFQ.
>
>>
>> Or not get rid of limits completely, but do a lot more relaxed
>> accounting at the queue level still. That will not require any
>> additional tracking of io contexts etc, but still impose some limit on
>> the number of queued IOs.
>
> A lot more relaxed limit accounting should help a bit but it after a
> while it might happen that slow movers eat up lots of request descriptors
> and making not much of progress.
>
> Long back I had implemented this additional notion of q->nr_group_requests
> where we defined per group number of requests allowed submitter will
> be put to sleep. I also extended it to also export per bdi per group
> congestion notion. So a flusher thread can look at the page and cgroup
> of the page and determine if respective cgroup is congested or not. If
> cgroup is congested, flusher thread can move to next inode so that it
> is not put to sleep behind a slow mover.
>
> Completely limitless queueu will solve the problem completely. But I guess
> then we can get that complain back that flusher thread submitted too much
> of IO to device.
>
> So given then fact that per-ioc-per-disk accounting of request descriptors
> makes the accounting complicated and also makes it hard for block IO
> controller to use it, the other approach of implementing per group limit
> and per-group-per-bdi congested might be reasonable. Having said that, the
> patch I had written for per group descritor was also not necessarily very
> simple.

So before all of this gets over designed a lot... If we get rid of the
one remaining direct buffered writeback in bdp(), then only the flusher
threads should be sending huge amounts of IO. So if we attack the
problem from that end instead, have it do that accounting in the bdi.
With that in place, I'm fairly confident that we can remove the request
limits.

Basically just replace the congestion_wait() in there with a bit of
accounting logic. Since it's per bdi anyway, we don't even have to
maintain that state in the bdi itself. It can remain in the thread
stack.

--
Jens Axboe

2011-03-07 20:34:30

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Mar 07, 2011 at 03:24:32PM -0500, Vivek Goyal wrote:

[..]
> >
> > Or not get rid of limits completely, but do a lot more relaxed
> > accounting at the queue level still. That will not require any
> > additional tracking of io contexts etc, but still impose some limit on
> > the number of queued IOs.
>
> A lot more relaxed limit accounting should help a bit but it after a
> while it might happen that slow movers eat up lots of request descriptors
> and making not much of progress.
>
> Long back I had implemented this additional notion of q->nr_group_requests
> where we defined per group number of requests allowed submitter will
> be put to sleep. I also extended it to also export per bdi per group
> congestion notion. So a flusher thread can look at the page and cgroup
> of the page and determine if respective cgroup is congested or not. If
> cgroup is congested, flusher thread can move to next inode so that it
> is not put to sleep behind a slow mover.
>
> Completely limitless queueu will solve the problem completely. But I guess
> then we can get that complain back that flusher thread submitted too much
> of IO to device.

Also wanted to add that currently blk-throttling code implements limitless
queuing of bio. The reason I did not enforce the limit yet because of
same reason that I will run into issues with async WRITES and flusher
thread.

So once we have figured out what't the right thing to do here, I can
implement similar solution for throttling too.

One side affect of limitless bio queueing is an AIO process can queue up lots
of bios in a group and if one tries to kill the process, it waits for all the
IOs to finish and can take up a very long time depending on throttling limits
of the group.

Thanks
Vivek

2011-03-07 20:36:54

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On 2011-03-07 21:34, Vivek Goyal wrote:
> One side affect of limitless bio queueing is an AIO process can queue
> up lots of bios in a group and if one tries to kill the process, it
> waits for all the IOs to finish and can take up a very long time
> depending on throttling limits of the group.

But I doubt this is a problem, really. Sure you could queue tons of IO
with aio if you really wanted, but aio does maintain er per-system
limit. So it's not infinite. The fact that waiting for pending aios can
take a while if you ctrl-c or kill the process is not a real worry,
imho.

--
Jens Axboe

2011-03-07 20:47:01

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Mar 07, 2011 at 09:32:54PM +0100, Jens Axboe wrote:

[..]
> > So given then fact that per-ioc-per-disk accounting of request descriptors
> > makes the accounting complicated and also makes it hard for block IO
> > controller to use it, the other approach of implementing per group limit
> > and per-group-per-bdi congested might be reasonable. Having said that, the
> > patch I had written for per group descritor was also not necessarily very
> > simple.
>
> So before all of this gets over designed a lot... If we get rid of the
> one remaining direct buffered writeback in bdp(), then only the flusher
> threads should be sending huge amounts of IO. So if we attack the
> problem from that end instead, have it do that accounting in the bdi.
> With that in place, I'm fairly confident that we can remove the request
> limits.
>
> Basically just replace the congestion_wait() in there with a bit of
> accounting logic. Since it's per bdi anyway, we don't even have to
> maintain that state in the bdi itself. It can remain in the thread
> stack.

Moving the accounting up sounds interesting. For cgroup stuff we again shall
have to do something additional like having per cgroup per bdi flusher threads
or mainting the number of pending IO per group and not flusher thread does not
submitting IOs for groups which have lots of pending IOs (to avoid faster
group getting blocked behind slower one).

Thanks
Vivek

2011-03-07 20:48:02

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On 2011-03-07 21:46, Vivek Goyal wrote:
> On Mon, Mar 07, 2011 at 09:32:54PM +0100, Jens Axboe wrote:
>
> [..]
>>> So given then fact that per-ioc-per-disk accounting of request descriptors
>>> makes the accounting complicated and also makes it hard for block IO
>>> controller to use it, the other approach of implementing per group limit
>>> and per-group-per-bdi congested might be reasonable. Having said that, the
>>> patch I had written for per group descritor was also not necessarily very
>>> simple.
>>
>> So before all of this gets over designed a lot... If we get rid of the
>> one remaining direct buffered writeback in bdp(), then only the flusher
>> threads should be sending huge amounts of IO. So if we attack the
>> problem from that end instead, have it do that accounting in the bdi.
>> With that in place, I'm fairly confident that we can remove the request
>> limits.
>>
>> Basically just replace the congestion_wait() in there with a bit of
>> accounting logic. Since it's per bdi anyway, we don't even have to
>> maintain that state in the bdi itself. It can remain in the thread
>> stack.
>
> Moving the accounting up sounds interesting. For cgroup stuff we again
> shall have to do something additional like having per cgroup per bdi
> flusher threads or mainting the number of pending IO per group and not
> flusher thread does not submitting IOs for groups which have lots of
> pending IOs (to avoid faster group getting blocked behind slower one).

So since there are at least two use cases, we could easily provide
helpers to do that sort of blocking to not throw too much work at it.

I think we are making progress :-)

--
Jens Axboe

2011-03-07 23:42:15

by Justin TerAvest

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Mar 7, 2011 at 12:47 PM, Jens Axboe <[email protected]> wrote:
> On 2011-03-07 21:46, Vivek Goyal wrote:
>> On Mon, Mar 07, 2011 at 09:32:54PM +0100, Jens Axboe wrote:
>>
>> [..]
>>>> So given then fact that per-ioc-per-disk accounting of request descriptors
>>>> makes the accounting complicated and also makes it hard for block IO
>>>> controller to use it, the other approach of implementing per group limit
>>>> and per-group-per-bdi congested might be reasonable. Having said that, the
>>>> patch I had written for per group descritor was also not necessarily very
>>>> simple.
>>>
>>> So before all of this gets over designed a lot... If we get rid of the
>>> one remaining direct buffered writeback in bdp(), then only the flusher
>>> threads should be sending huge amounts of IO. So if we attack the
>>> problem from that end instead, have it do that accounting in the bdi.
>>> With that in place, I'm fairly confident that we can remove the request
>>> limits.
>>>
>>> Basically just replace the congestion_wait() in there with a bit of
>>> accounting logic. Since it's per bdi anyway, we don't even have to
>>> maintain that state in the bdi itself. It can remain in the thread
>>> stack.
>>
>> Moving the accounting up sounds interesting. For cgroup stuff we again
>> shall have to do something additional like having per cgroup per bdi
>> flusher threads or mainting the number of pending IO per group and not
>> flusher thread does not submitting IOs for groups which have lots of
>> pending IOs (to avoid faster group getting blocked behind slower one).
>
> So since there are at least two use cases, we could easily provide
> helpers to do that sort of blocking to not throw too much work at it.
>
> I think we are making progress :-)

This generally sounds good to me, though I didn't think per-cgroup limits
were terribly complicated.

I wanted to make a quick note-- it sounds like part of the intent here is to
avoid doing any page tracking in the page_cgroup structure, but I think that
we will inevitably have to do some tracking there for css ids, to provide
isolation between buffered writers. I'd like to send out a patchset soon
to track buffered writers, but we should probably work out the request
descriptor limits first.


>
> --
> Jens Axboe
>
>

2011-03-08 00:06:06

by Vivek Goyal

[permalink] [raw]
Subject: Re: RFC: default group_isolation to 1, remove option

On Mon, Mar 07, 2011 at 09:32:54PM +0100, Jens Axboe wrote:
> On 2011-03-07 21:24, Vivek Goyal wrote:
> > On Mon, Mar 07, 2011 at 08:39:52PM +0100, Jens Axboe wrote:
> >> It's at least not my goal, it has nothing to do with isolation. Since we
> >> have ->make_request_fn() drivers operating completely without queuing
> >> limits, it may just be that we can drop the tracking completely on the
> >> request side. Either one is currently broken, or both will work that
> >> way. And if that is the case, then we don't have to do this ioc tracking
> >> at all. With the additional complication of now needing
> >> per-disk-per-process io contexts, that approach is looking a lot more
> >> tasty right now.
> >
> > I am writting the code for per-disk-per-process io context and it is
> > significant amount of code and as code size is growing I am also wondering
> > if it worth the complication.
>
> Yep, I don't think we should do that.
>
> > Currently request queue blocks a process if device is congested. It might
> > happen that one process in a low weight cgroup is doing writes and has
> > consumed all available request descriptors (it is really easy to produce)
> > and now device is congested. Now any writes from high weight/prio cgroup
> > will not even be submitted on request queue and hence they can not be
> > given priority by CFQ.
> >
> >>
> >> Or not get rid of limits completely, but do a lot more relaxed
> >> accounting at the queue level still. That will not require any
> >> additional tracking of io contexts etc, but still impose some limit on
> >> the number of queued IOs.
> >
> > A lot more relaxed limit accounting should help a bit but it after a
> > while it might happen that slow movers eat up lots of request descriptors
> > and making not much of progress.
> >
> > Long back I had implemented this additional notion of q->nr_group_requests
> > where we defined per group number of requests allowed submitter will
> > be put to sleep. I also extended it to also export per bdi per group
> > congestion notion. So a flusher thread can look at the page and cgroup
> > of the page and determine if respective cgroup is congested or not. If
> > cgroup is congested, flusher thread can move to next inode so that it
> > is not put to sleep behind a slow mover.
> >
> > Completely limitless queueu will solve the problem completely. But I guess
> > then we can get that complain back that flusher thread submitted too much
> > of IO to device.
> >
> > So given then fact that per-ioc-per-disk accounting of request descriptors
> > makes the accounting complicated and also makes it hard for block IO
> > controller to use it, the other approach of implementing per group limit
> > and per-group-per-bdi congested might be reasonable. Having said that, the
> > patch I had written for per group descritor was also not necessarily very
> > simple.
>
> So before all of this gets over designed a lot... If we get rid of the
> one remaining direct buffered writeback in bdp(), then only the flusher
> threads should be sending huge amounts of IO. So if we attack the
> problem from that end instead, have it do that accounting in the bdi.
> With that in place, I'm fairly confident that we can remove the request
> limits.
>
> Basically just replace the congestion_wait() in there with a bit of
> accounting logic. Since it's per bdi anyway, we don't even have to
> maintain that state in the bdi itself. It can remain in the thread
> stack.

I am wondering if we can make use of per BDI_WRITEBACK per bdi state
to keep track of bdi state and do congestion_wait() accouting accordingly.
It is percpu so we will introduce some inaccuracy but I guess the goal here
is not to do very accurate nr_request accouting. That should atleast
remove nr_request accouting from queue.

For cgroup stuff, may be we can maintain some state in memory cgroups. For
example, some kind of per bdi writeback in progress from that memory cgroup.
If a BDI is congested then we can do some additional checks whether any
IO from this cgroup is in progress or not a particular BDI. If yes, we
throttle the writeout otherwise we allow the thread to submit more IO.

So in practice q->nr_requests gets replaced with bdi->bdi_stat[BDI_WRITEBACK].
That way nr_request moves out of request queue. As bdi_stat is per cpu,
locking overhead should reduce overall. I think tricky part is how to
keep track of per cgroup per bdi stat. We need some kind of simple
approximation to allow IO from one cgroup to multiple bdi at the same
time.

Thanks
Vivek