Subject: aio: questions with ioctx_alloc() and large num_possible_cpus()

Hi Benjamin, Kent, and others,

Would you please comment / answer about this possible problem?
Any feedback is appreciated.

Since commit e1bdd5f27a5b ("aio: percpu reqs_available") the maximum
number of aio nr_events may be a function of num_possible_cpus() and
actually be /inversely proportional/ to it (i.e., more CPUs lead to
less system-wide aio nr_events). This is a problem on larger systems.

That's because if "nr_events < num_possible_cpus() * 4" (for example
nr_events == 1) that counts as "num_possible_cpus() * 4" into aio_nr
and against aio_max_nr

static struct kioctx *ioctx_alloc(unsigned nr_events)
...
nr_events = max(nr_events, num_possible_cpus() * 4);
nr_events *= 2;
...
/* limit the number of system wide aios */
....
if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
...
err = -EAGAIN;
...
aio_nr += ctx->max_reqs;
...

That problem is easily noticeable on a common POWER8 system: 160 CPUs
(2 sockets * 10 cores/socket * 8 threads/core = 160 CPUs) limits the max
AIO contexts with "io_setup(1, )" to 102 out of 64k (default ax_aio_nr):

# cat /sys/devices/system/cpu/possible
0-159

# cat /proc/sys/fs/aio-max-nr
65536

# echo $(( 65536 / (160 * 4) ))
102

test-case snippet & output:

for (i = 0; i < 65536; i++)
if (rc = io_setup(1, &ioctx[i]))
break;

printf("rc = %d, i = %d\n", rc, i);

> rc = -11, i = 102

(another problem is that the sysctl aio-nr grows larger than aio-max-nr,
since it's checked against "aio_max_nr * 2")

So,

I've been trying to understand/fix this, but soon got stuck on options
as I didn't quite get a few points.. if you could provide some insight,
please, that would be really helpful:

- why "num_possible_cpus() * 4", and why "max(nr_events, <it>)" ?

Is it just related to req_batch in a form of a reasonable constant,
or there are other implications (e.g., related to "up to half of
slots on other cpu's percpu counters" -- which would be nice to
understand why too.)

- "struct kioctx" says max_reqs is

" is what userspace passed to io_setup(), it's not used for
anything but counting against the global max_reqs quota. "

However, we see it incremented by the modified nr_events, thus
not really the value from userspace anymore, and used to derive
nr_events in aio_setup_ring(). Is the comment wrong nowadays,
or is the code usage of max_reqs wrong/abusing it, or... ? :)

- what's really expected to be counted by aio-nr is nr_events
(er.. the value actually requested by userspace?) or the number
of times io_setup(N, ) returned successfully (say, io contexts),
regardless of the total/sum of their nr_events?

- any other comments/suggestions are appreciated.

Thanks in advance,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center


2016-10-05 06:34:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

On Tue, Oct 04, 2016 at 07:55:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin, Kent, and others,
>
> Would you please comment / answer about this possible problem?
> Any feedback is appreciated.
>
> Since commit e1bdd5f27a5b ("aio: percpu reqs_available") the maximum
> number of aio nr_events may be a function of num_possible_cpus() and
> actually be /inversely proportional/ to it (i.e., more CPUs lead to
> less system-wide aio nr_events). This is a problem on larger systems.
>
> That's because if "nr_events < num_possible_cpus() * 4" (for example
> nr_events == 1) that counts as "num_possible_cpus() * 4" into aio_nr
> and against aio_max_nr
>
> static struct kioctx *ioctx_alloc(unsigned nr_events)
> ...
> nr_events = max(nr_events, num_possible_cpus() * 4);
> nr_events *= 2;
> ...
> /* limit the number of system wide aios */
> ....
> if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
> ...
> err = -EAGAIN;
> ...
> aio_nr += ctx->max_reqs;
> ...
>
> That problem is easily noticeable on a common POWER8 system: 160 CPUs
> (2 sockets * 10 cores/socket * 8 threads/core = 160 CPUs) limits the max
> AIO contexts with "io_setup(1, )" to 102 out of 64k (default ax_aio_nr):
>
> # cat /sys/devices/system/cpu/possible
> 0-159
>
> # cat /proc/sys/fs/aio-max-nr
> 65536
>
> # echo $(( 65536 / (160 * 4) ))
> 102
>
> test-case snippet & output:
>
> for (i = 0; i < 65536; i++)
> if (rc = io_setup(1, &ioctx[i]))
> break;
>
> printf("rc = %d, i = %d\n", rc, i);
>
> > rc = -11, i = 102
>
> (another problem is that the sysctl aio-nr grows larger than aio-max-nr,
> since it's checked against "aio_max_nr * 2")
>
> So,
>
> I've been trying to understand/fix this, but soon got stuck on options
> as I didn't quite get a few points.. if you could provide some insight,
> please, that would be really helpful:
>
> - why "num_possible_cpus() * 4", and why "max(nr_events, <it>)" ?

For the scheme to work - percpu allocation of slots - we have to ensure that
there aren't too many unused slots stranded on other CPUs. The stranding is
limited to 1/4th of the slots as I figured any more than that could be too
unpredictable - the effective maximum number of in flight iocbs would vary too
much.

For systems with large numbers of CPUs, what I'd prefer to do is make it per
core or numa node or somesuch. But we don't have any infrastructure for that
equivilant to the alloc_percpu() stuff, so that's why I didn't do it at the
time.

Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

Hi Kent,

Thanks for commenting. I understood more of the code in trying to make
sense of your point, but there are some things still unclear about it;
if you could help a bit more, please.

Can you describe how a single thread might not be able to use all the
slots because 'up to about half of the reqs_available slots might
be on other percpu reqs_available' ?

I see that the thread might be scheduled on different CPUs (say, only
2 possible CPUs) and perform get_reqs_available() on both -- but that
only gives one req_batch to each CPU, and for req_batch to be half of
reqs_available its denominator needs to be 2, which doesn't happen w/
num_possible_cpus() * 4 -- which is 8. So I'm a bit confused here.

atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);

On 10/05/2016 03:34 AM, Kent Overstreet wrote:
>> - why "num_possible_cpus() * 4", and why "max(nr_events, <it>)" ?

> For the scheme to work - percpu allocation of slots - we have to ensure that
> there aren't too many unused slots stranded on other CPUs. The stranding is
> limited to 1/4th of the slots [snip]

By 'unused slots' you mean the slots included in the batch allocated
to a particular cpu but not actually used by a thread in that cpu?
(e.g., get_reqs_available() called once, unused_slots == req_batch - 1)

Can you please detail a bit more how the limit to 1/4th of the slots is
ensured because of "num_possible_cpus() * 4", and what is the scenario
where the math is based on? I've been thinking and assuming values for
a while now, and didn't figure out the point where / how it occurs.

Thanks for your support,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

Hi Benjamin,

On 10/05/2016 02:41 PM, Benjamin LaHaise wrote:
> I'd suggest increasing the default limit by changing how it is calculated.
> The current number came about 13 years ago when machines had orders of
> magnitude less RAM than they do today.

Thanks for the suggestion.

Does the default also have implications other than memory usage?
For example, concurrency/performance of as much aio contexts running,
or if userspace could try to exploit some point with a larger number?

Wondering about it because it can be set based on num_possible_cpus(),
but that might be really large on high-end systems.

Regards,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

2016-10-05 18:02:28

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

On Tue, Oct 04, 2016 at 07:55:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin, Kent, and others,
>
> Would you please comment / answer about this possible problem?
> Any feedback is appreciated.

I'd suggest increasing the default limit by changing how it is calculated.
The current number came about 13 years ago when machines had orders of
magnitude less RAM than they do today.

-ben
--
"Thought is the essence of where you are now."

2016-10-05 18:17:43

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

On Wed, Oct 05, 2016 at 02:58:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin,
>
> On 10/05/2016 02:41 PM, Benjamin LaHaise wrote:
> >I'd suggest increasing the default limit by changing how it is calculated.
> >The current number came about 13 years ago when machines had orders of
> >magnitude less RAM than they do today.
>
> Thanks for the suggestion.
>
> Does the default also have implications other than memory usage?
> For example, concurrency/performance of as much aio contexts running,
> or if userspace could try to exploit some point with a larger number?

Anything's possible when a local user can run code. It's the same problem
as determining how much memory can be mlock()ed, or how much i/o a process
should be allowed to do. Nothing prevents an app from doing a huge amount
of readahed() calls to make the system prefetch gigabytes of data. That
said, local users tend not to DoS themselves.

> Wondering about it because it can be set based on num_possible_cpus(),
> but that might be really large on high-end systems.

Today's high end systems are tomorrow's desktops... It probably makes
sense to implement per-user limits rather than the current global limit,
and maybe even convert them to an rlimit to better fit in with the
available frameworks for managing these things.

-ben

> Regards,
>
> --
> Mauricio Faria de Oliveira
> IBM Linux Technology Center

--
"Thought is the essence of where you are now."

Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

Ben,

On 10/05/2016 03:17 PM, Benjamin LaHaise wrote:
> Anything's possible when a local user can run code. [snip] That
> said, local users tend not to DoS themselves.

Agree. I thought of something that could be particularly related to
the aio implementation; but I guess there's nothing so special then.

> [snip] It probably makes
> sense to implement per-user limits rather than the current global limit,
> and maybe even convert them to an rlimit to better fit in with the
> available frameworks for managing these things.

I see it would be a nice improvement, but unfortunately that's not a
task that I can take at the moment. For now, the most I'm able to do
is to continue to try to understand whether there's something we can
do that may help the small nr_events case to be more on par with the
large nr_events case.. or any other beginner-level fixes to the area. :)

Thanks again,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center

2016-10-28 18:59:45

by Jeff Moyer

[permalink] [raw]
Subject: Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

Benjamin LaHaise <[email protected]> writes:

> Today's high end systems are tomorrow's desktops... It probably makes

Well, to some degree I agree with you. >100 processor high end systems
have been around for a long time, but we still don't have those on the
desktop. ;-)

> sense to implement per-user limits rather than the current global limit,
> and maybe even convert them to an rlimit to better fit in with the
> available frameworks for managing these things.

I actually wrote a patch to do this back in 2007:
http://www.gossamer-threads.com/lists/linux/kernel/1043934

It used the mlock rlimit. I ultimately decided to rescind it, since
there were years of experience with the current tunable, and plenty of
documentation on it, too. We could put aio-max-nr on the deprecated
path, though, if folks want to go that route. Let me know and I can
investigate resurrecting that patch. Though I would like input on
whether a new rlimit is desired.

Cheers,
Jeff