2021-09-07 16:46:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On 9/7/21 10:39 AM, Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 8:46 AM Jens Axboe <[email protected]> wrote:
>>
>> Are we at all worried about these? There's been a number of them
>> reported, basically for all the accounting enablements that have been
>> done in this merge window.
>
> We are worried about them. I'm considering reverting several of them
> because I think the problems are
>
> (a) big
>
> (b) nontrivial
>
> and the patches clearly weren't ready and people weren't aware of this issue.

I think that is prudent. When I first enabled it for io_uring it was a
bit of a shit show in terms of performance degradations, and some work
had to be done before it could get enabled in a sane fashion.

The accounting needs to be more efficient if we're seeing 30-50%
slowdowns simply by enabling it on a kmem cache.

--
Jens Axboe


2021-09-07 17:14:27

by Roman Gushchin

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On Tue, Sep 07, 2021 at 10:43:46AM -0600, Jens Axboe wrote:
> On 9/7/21 10:39 AM, Linus Torvalds wrote:
> > On Tue, Sep 7, 2021 at 8:46 AM Jens Axboe <[email protected]> wrote:
> >>
> >> Are we at all worried about these? There's been a number of them
> >> reported, basically for all the accounting enablements that have been
> >> done in this merge window.
> >
> > We are worried about them. I'm considering reverting several of them
> > because I think the problems are
> >
> > (a) big
> >
> > (b) nontrivial
> >
> > and the patches clearly weren't ready and people weren't aware of this issue.
>
> I think that is prudent. When I first enabled it for io_uring it was a
> bit of a shit show in terms of performance degradations, and some work
> had to be done before it could get enabled in a sane fashion.
>
> The accounting needs to be more efficient if we're seeing 30-50%
> slowdowns simply by enabling it on a kmem cache.

There are two polar cases:
1) a big number of relatively short-living allocations, which lifetime is well
bounded (e.g. by a lifetime of a task),
2) a relatively small number of long-living allocations, which lifetime
is potentially indefinite (e.g. struct mount).

We can't use the same approach for both cases, otherwise we'll run into either
performance or garbage collection problems (which also lead to performance
problems, but delayed).

I think of maybe building a generic cache layer for accounted allocations,
which can be used in cases like io_uring. Shakeel, what's your plan here?

As now, I agree that reverting patches causing a significant regression is best
way forward.

2021-09-07 17:46:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

Hello,

On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
> There are two polar cases:
> 1) a big number of relatively short-living allocations, which lifetime is well
> bounded (e.g. by a lifetime of a task),
> 2) a relatively small number of long-living allocations, which lifetime
> is potentially indefinite (e.g. struct mount).
>
> We can't use the same approach for both cases, otherwise we'll run into either
> performance or garbage collection problems (which also lead to performance
> problems, but delayed).

Wouldn't a front cache which expires after some seconds catch both cases?

Thanks.

--
tejun

2021-09-07 17:46:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On 9/7/21 11:14 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
>> There are two polar cases:
>> 1) a big number of relatively short-living allocations, which lifetime is well
>> bounded (e.g. by a lifetime of a task),
>> 2) a relatively small number of long-living allocations, which lifetime
>> is potentially indefinite (e.g. struct mount).
>>
>> We can't use the same approach for both cases, otherwise we'll run into either
>> performance or garbage collection problems (which also lead to performance
>> problems, but delayed).
>
> Wouldn't a front cache which expires after some seconds catch both cases?

A purely time based approach might be problematic, as you can allocate a
LOT of data in a short amount of time. Heuristics probably need to be a
hybrid of "time much time has passed" OR "we're over the front cache
threshold in terms of deferred accounting". But yes, I don't see why
we'd necessarily need different approaches for short vs long life times.

--
Jens Axboe

2021-09-07 17:46:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On Tue, Sep 07, 2021 at 11:18:21AM -0600, Jens Axboe wrote:
> A purely time based approach might be problematic, as you can allocate a
> LOT of data in a short amount of time. Heuristics probably need to be a
> hybrid of "time much time has passed" OR "we're over the front cache
> threshold in terms of deferred accounting". But yes, I don't see why
> we'd necessarily need different approaches for short vs long life times.

Yeah, it'd need some heuristics to guard against the cache exploding and a
lot of laziness in expiration logic but none of those should be too
complicated.

Thanks.

--
tejun

2021-09-07 17:49:26

by Roman Gushchin

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On Tue, Sep 07, 2021 at 07:14:45AM -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
> > There are two polar cases:
> > 1) a big number of relatively short-living allocations, which lifetime is well
> > bounded (e.g. by a lifetime of a task),
> > 2) a relatively small number of long-living allocations, which lifetime
> > is potentially indefinite (e.g. struct mount).
> >
> > We can't use the same approach for both cases, otherwise we'll run into either
> > performance or garbage collection problems (which also lead to performance
> > problems, but delayed).
>
> Wouldn't a front cache which expires after some seconds catch both cases?

I'm not sure. For the second case we need to pack allocations from different
tasks/cgroups into a small number of shared pages. It means the front cache
should be really small/non-existing. For the first case we likely need a
substantial cache. Maybe we can do something really smart with scattering
the cache over multiple pages, but I really doubt.

2021-09-07 17:51:27

by Shakeel Butt

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On Tue, Sep 7, 2021 at 10:31 AM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Sep 07, 2021 at 07:14:45AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
> > > There are two polar cases:
> > > 1) a big number of relatively short-living allocations, which lifetime is well
> > > bounded (e.g. by a lifetime of a task),
> > > 2) a relatively small number of long-living allocations, which lifetime
> > > is potentially indefinite (e.g. struct mount).
> > >
> > > We can't use the same approach for both cases, otherwise we'll run into either
> > > performance or garbage collection problems (which also lead to performance
> > > problems, but delayed).
> >
> > Wouldn't a front cache which expires after some seconds catch both cases?
>
> I'm not sure. For the second case we need to pack allocations from different
> tasks/cgroups into a small number of shared pages. It means the front cache
> should be really small/non-existing. For the first case we likely need a
> substantial cache. Maybe we can do something really smart with scattering
> the cache over multiple pages, but I really doubt.

I think we need to prototype this to sensibly evaluate. Let me know if
you want to take a stab at this otherwise I can try.

2021-09-07 20:51:08

by Roman Gushchin

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On Tue, Sep 07, 2021 at 10:48:06AM -0700, Shakeel Butt wrote:
> On Tue, Sep 7, 2021 at 10:31 AM Roman Gushchin <[email protected]> wrote:
> >
> > On Tue, Sep 07, 2021 at 07:14:45AM -1000, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
> > > > There are two polar cases:
> > > > 1) a big number of relatively short-living allocations, which lifetime is well
> > > > bounded (e.g. by a lifetime of a task),
> > > > 2) a relatively small number of long-living allocations, which lifetime
> > > > is potentially indefinite (e.g. struct mount).
> > > >
> > > > We can't use the same approach for both cases, otherwise we'll run into either
> > > > performance or garbage collection problems (which also lead to performance
> > > > problems, but delayed).
> > >
> > > Wouldn't a front cache which expires after some seconds catch both cases?
> >
> > I'm not sure. For the second case we need to pack allocations from different
> > tasks/cgroups into a small number of shared pages. It means the front cache
> > should be really small/non-existing. For the first case we likely need a
> > substantial cache. Maybe we can do something really smart with scattering
> > the cache over multiple pages, but I really doubt.
>
> I think we need to prototype this to sensibly evaluate. Let me know if
> you want to take a stab at this otherwise I can try.

If you have time and are ready to jump in, please, go on. Otherwise I can start
working on it in few weeks. In any case, I'm happy to help with discussions, code
reviews & whatever else I can do.

Thanks!

2021-09-08 08:16:37

by Vasily Averin

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

On 9/7/21 10:42 PM, Roman Gushchin wrote:
> On Tue, Sep 07, 2021 at 10:48:06AM -0700, Shakeel Butt wrote:
>> On Tue, Sep 7, 2021 at 10:31 AM Roman Gushchin <[email protected]> wrote:
>>>
>>> On Tue, Sep 07, 2021 at 07:14:45AM -1000, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> On Tue, Sep 07, 2021 at 10:11:21AM -0700, Roman Gushchin wrote:
>>>>> There are two polar cases:
>>>>> 1) a big number of relatively short-living allocations, which lifetime is well
>>>>> bounded (e.g. by a lifetime of a task),
>>>>> 2) a relatively small number of long-living allocations, which lifetime
>>>>> is potentially indefinite (e.g. struct mount).
>>>>>
>>>>> We can't use the same approach for both cases, otherwise we'll run into either
>>>>> performance or garbage collection problems (which also lead to performance
>>>>> problems, but delayed).
>>>>
>>>> Wouldn't a front cache which expires after some seconds catch both cases?
>>>
>>> I'm not sure. For the second case we need to pack allocations from different
>>> tasks/cgroups into a small number of shared pages. It means the front cache
>>> should be really small/non-existing. For the first case we likely need a
>>> substantial cache. Maybe we can do something really smart with scattering
>>> the cache over multiple pages, but I really doubt.
>>
>> I think we need to prototype this to sensibly evaluate. Let me know if
>> you want to take a stab at this otherwise I can try.
>
> If you have time and are ready to jump in, please, go on. Otherwise I can start
> working on it in few weeks. In any case, I'm happy to help with discussions, code
> reviews & whatever else I can do.

(Quite looking at the dubious achievement "my upstream patch was returned personally by Linus")
Please keep me informed about these tasks too. Unfortunately I cannot help right now,
however perhaps we will need to backport these patches to OpenVz kernels.

Thank you,
Vasily Averin

2022-05-02 04:19:00

by Michal Koutný

[permalink] [raw]
Subject: Re: [memcg] 0f12156dff: will-it-scale.per_process_ops -33.6% regression

Hello.

On Wed, Sep 08, 2021 at 11:13:57AM +0300, Vasily Averin <[email protected]> wrote:
> Please keep me informed about these tasks too. Unfortunately I cannot help right now,
> however perhaps we will need to backport these patches to OpenVz kernels.

Is there any progress around this POSIX lock efficient charging?
(Basically I'm just asking whether there's anything to base on now
except the former theoretical idea.)

Some people give this a security weight [1].

Thanks,
Michal

[1] https://github.com/kata-containers/kata-containers/issues/3373