by John Johansen

[permalink] [raw]

Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 5/30/24 02:47, Mateusz Guzik wrote:
> On Thu, May 30, 2024 at 7:59 AM John Johansen
> <[email protected]> wrote:
>>
>> On 5/29/24 21:19, Neeraj Upadhyay wrote:
>>> Hi John,
>>>
>>> Thanks for taking a look at the series!
>>>
>>> On 5/29/2024 6:07 AM, John Johansen wrote:
>>>> On 5/28/24 06:29, Mateusz Guzik wrote:
>>>>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>>>>> Gentle ping.
>>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> Could you please confirm that:
>>>>>>>>>>>
>>>>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Neeraj,
>>>>>>>>>>
>>>>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>>>>> to get the review fully done.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gentle prod.
>>>>>>>>>
>>>>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>>>>> is has an entry on the list.
>>>>>>>>
>>>>>>>
>>>>>>> So where do we stand here?
>>>>>>>
>>>>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>>>>> this weekend.
>>>>>>
>>>>>
>>>>> How was the weekend? ;)
>>>>>
>>>>
>>>> lets say it was busy. Have I looked at this, yes. I am still digesting it.
>>>> I don't have objections to moving towards percpu refcounts, but the overhead
>>>> of a percpu stuct per label is a problem when we have thousands of labels
>>>> on the system. That is to say, this would have to be a config option. We
>>>> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
>>>> contention. The to percpu, to a global pool because the percpu overhead was
>>>> too high for some machines, and then from a global pool to a hybrid scheme
>>>> because of global lock contention. I don't see a way of doing that with the
>>>> label, which means a config would be the next best thing.
>>>>
>>>
>>> For the buffers, what was the percpu overhead roughly? For
>>> thousands of labels, I think, the extra memory overhead roughly would
>>> be in the range of few MBs (need to be profiled though). This extra
>>> label overhead would be considered high for the machines where percpu
>>> buffer overhead was considered high?
>>>
>>
>> It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
>> size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
>> on bigger machines. The problem here is say the percpu refcount while smaller
>> per label, will be more in situations with lots of cpus. Which is fine if that
>> is what it needs to be, but for other use cases tuning it to be smaller would
>> be nice.
>>
>>
>>> Please correct me here, so you are proposing that we use a kconfig to
>>> use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
>>> inside the 'struct aa_label' and update the refcount operations accordingly?
>>> If yes, I will work on a patch with this kconfig based selection of
>>> refcounting mode to see how it pans out.
>>>
>> possibly, I am still mulling over how we want to approach this
>>
>>> @Mateusz can you share the dynamic switching counter mode patch series please?
>>>
>> yes I am interested in looking at this as well.
>>
>
> https://lore.kernel.org/lkml/[email protected]/
>
>>> In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
>>> on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
>>> which would not have memory usage overhead as in percpu refcount. At this point the
>>> API design/implementation is in early prototype stage.
>>>
>>>
>>> [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing
>>
>> okay, I will take a look
>>
>>>
>>>> Not part of your patch but something to be considered is that the label tree
>>>> needs a rework, its locking needs to move to read side a read side lock less
>>>> scheme, and the plan was to make it also use a linked list such that new
>>>> labels are always queued at the end, allowing dynamically created labels to
>>>> be lazily added to the tree.
>>>>
>>>
>>> Read side would be rcu read lock protected in this scheme?
>>> The linked list would store the dynamically created compound labels?
>>> What is the advantage of using this lazy addition to the tree? We optimize
>>> on the label search, addition/deletion for dynamic labels? The lazy addition
>>> to the tree is done when a label find operation on the list succeeds?
>>>
>> there are contexts where we are creating labels, and do not want to wait on
>> some of the longer tree walk profile updates/replacements. If a replacement is
>> on going the idea is to just add the label to the end of a list and let the
>> process that is doing the tree update take the hit of inserting and rebalancing
>> the tree.
>>
>>
>>>> I see the use of the kworker as problematic as well, especially if we are
>>>> talking using kconfig to switch reference counting modes. I am futzing with
>>>> some ideas, on how to deal with this.
>>>>
>>>
>>> We can disable queuing of label reclaim work for non-percpu case?
>>>
>> maybe, I am pondering ways we can deal with this. I have been pondering the
>> if we might be able to leverage a seqlock here, but I will also take a look
>> at hazard pointers.
>>
>
> Since there is some elaborate talk going about this, let me throw in
> my own $0,03 -- I may happen to have a simple solution which will sort
> it out and it boils down to storing local ref updates in task_struct.
>
> Here is the context: creds are always refed and unrefed when creating
> and destroying a file object. Should you have one instance of
> credentials for a given user across different processes they would
> serialize on updating the ref. Linux mostly dodges the problem by
> always creating a copy on fork, thus only serializing within threads
> of a given process. Even then that induces avoidable overhead if only
> from single-threaded standpoint -- that's 2 atomics slapped for every
> open/close cycle.
>
so the apparmor label can and will update beyond the open/close cycle.
Yes they are used in the cred as well but, for more than that. The
apparmor label on file can be updated by other tasks, for various
reasons.

> $elsewhere I devised an idea where cred ref updates to creds matching
> current->cred only modify a local counter. They get rolled up when
> current->creds is changed. That is to say there is 0 atomics or
> modifying memory seen by other cpus as long as the process does not
> change creds, which almost never happens compared to how often refing
> and unrefing is implemented.
>
right, we do something like this for the task cred with a crit section
marked out by

label = begin_current_label_crit_section()

end_current_label_crit_section(label);

if everything works out, no reference counts are taken. The purpose
of the fns is to deal with the cases where for one reason or another
a refcount needs to be taken (generally because of live policy
replacement, and the task hasn't been able to update its cred yet).

> In struct cred apart from regular refs you would have "user" counter
> and "distributed" counter. switching user to > 0 grabs a normal ref on
> creds, the value of the "distributed" counter is arbitrary as long as
> user > 0. users going back to 0 means we can release the special ref
> held for that purpose.
>
So I don't see how this will generally help for labels which exist
on many different objects.

> I was considering implementing this for Linux. In my original code all
> cred handling is augmented like this, but for Linux one could add a
> dedicated get_cred_localref or similar machinery.
>

sure, but I am not sure its needed. The rules for task creds is only
task can update its cred. The task can look at its cred and do most
things without having to take a count. Most cred refs should just
be being taken for objects.

> Skimming through apparmor suggests the bit which does cause
> performance problems can be sorted out in the same manner.
>
I don't see it. The file cred is very much updated live, async to
the task cred. And while currently it always starts as the task
cred, that won't even be true much longer.

> Maybe i'll hack it up as a demo just for apparmor.
>
> This would come with some extra space usage in task_struct which on
> the surface may sounds like a non-starter. However, should you take a
> look at the struct with pahole you will find it is riddles with
> padding. If I wanted to I could add all fields I need to the struct
> and not grow it on LP64.
>