2024-01-10 11:11:54

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Problem Statement
=================

Nginx performance testing with Apparmor enabled (with nginx
running in unconfined profile), on kernel versions 6.1 and 6.5
show significant drop in throughput scalability, when Nginx
workers are scaled to higher number of CPUs across various
L3 cache domains.

Below is one sample data on the throughput scalability loss,
based on results on AMD Zen4 system with 96 CPUs with SMT
core count 2; so, overall, 192 CPUs:

Config Cache Domains apparmor=off apparmor=on
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 95% 94%
24C48T 3 94% 93%
48C96T 6 92% 88%
96C192T 12 85% 68%

If we look at above data, there is a significant drop in
scaling efficiency, when we move to 96 CPUs/192 SMT threads.

Perf tool shows most of the contention coming from below
6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
6.22% nginx [kernel.vmlinux] [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on label.

Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
from 6.7 alleviates the issue to an extent, but not completely:

Config Cache Domains apparmor=on apparmor=on (patched)
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 97% 93%
24C48T 3 94% 92%
48C96T 6 88% 88%
96C192T 12 65% 79%

This adverse impact gets more pronounced when we move to >192 CPUs.
The memory contention and impact increases with high frequency label
update operations and labels are marked stale more frequently.


Label Refcount Management
=========================

Apparmor uses label objects (struct aa_label) to manage refcounts for
below set of objects:

- Applicable profiles
- Namespaces (unconfined profile)
- Other non-profile references

These label references are acquired on various apparmor lsm hooks,
on operations such as file open, task kill operations, socket bind,
and other file, socket, misc operations which use current task's cred,
when the label for the current cred, has been marked stale. This is
done to check these operations against the set of allowed operations
for the task performing them.

Use Percpu refcount for ref management?
=======================================

The ref put operations (percpu_ref_put()) in percpu refcount,
in active mode, do not check whether ref count has dropped to
0. The users of the percpu_ref need to explicitly invoke
a percpu_ref_kill() operation, to drop the initial reference,
at shutdown paths. After the percpu_ref_kill() operation, ref
switches to atomic mode and any new percpu_ref_put() operation
checks for the drop to 0 case and invokes the release operation
on that label.

Labels are marked stale is situations like profile removal,
profile updates. For a namespace, the unconfined label reference
is dropped when the namespace is destroyed. These points
are potential shutdown points for labels. However, killing
the percpu ref from these points has few issues:

- The label could still be referenced by tasks, which are
still holding the reference to the now stale label.
Killing the label ref while these operations are in progress
will make all subsequent ref-put operations on the stale label
to be atomic, which would still result in memory contention.
Also, any new reference to the stale label, which is acquired
with the elevated refcount will have atomic op contention.

- The label is marked stale using a non-atomic write operation.
It is possible that new operations do not observe this flag
and still reference it for quite some time.

- Explicitly tracking the shutdown points might not be maintainable
at a per label granularity, as there can be various paths where
label reference could get dropped, such as, before the label has
gone live - object initialization error paths. Also, tracking
the shutdown points for labels which reference other labels -
subprofiles, merged labels requires careful analysis, and adds
heavy burden on ensuring the memory contention is not introduced
by these ref kill points.


Proposed Solution
=================

One potential solution to the refcount scalability problem is to
convert the label refcount to a percpu refcount, and manage
the initial reference from kworker context. The kworker
keeps an extra reference to the label and periodically scans
labels and release them if their refcount drops to 0.

Below is the sequence of operations, which shows the refcount
management with this approach:

1. During label initialization, the percpu ref is initialized in
atomic mode. This is done to ensure that, for cases where the
label hasn't gone live (->ns isn't assigned), mostly during
initialization error paths.

2. Labels are switched to percpu mode at various points -
when a label is added to labelset tree, when a unconfined profile
has been assigned a namespace.

3. As part of the initial prototype, only the in tree labels
are managed by the kworker. These labels are added to a lockless
list. The unconfined labels invoke a percpu_ref_kill() operation
when the namespace is destroyed.

4. The kworker does a periodic scan of all the labels in the
llist. It does below sequence of operations:

a. Enqueue a dummy node to mark the start of scan. This dummy
node is used as start point of scan and ensures that we
there is no additional synchronization required with new
label node additions to the llist. Any new labels will
be processed in next run of the kworker.

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+
| | | | | | | |
| head ------> dummy|--->|label |--->| label|--->NULL
| | | node | | | | |
+----------+ +------+ +------+ +------+


New label addition:

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+ +------+
| | | | | | | | | |
| head |--> label|--> dummy|--->|label |--->| label|--->NULL
| | | | | node | | | | |
+----------+ +------+ +------+ +------+ +------+

b. Traverse through the llist, starting from dummy->next.
If the node is a dummy node, mark it free.
If the node is a label node, do,

i) Switch the label ref to atomic mode. The ref switch wait
for the existing percpu_ref_get() and percpu_ref_put()
operations to complete, by waiting for a RCU grace period.

Once the switch is complete, from this point onwards, any
percpu_ref_get(), percpu_ref_put() operations use
atomic operations.

ii) Drop the initial reference, which was taken while adding
the label node to the llist.

iii) Use a percpu_ref_tryget() increment operation on the
ref, to see if we dropped the last ref count. if we
dropped the last count, we remove the node from the llist.

All of these operations are done inside a RCU critical
section, to avoid race with the release operations,
which can potentially trigger, as soon as we drop
the initial ref count.

iv) If we didn't drop the last ref, switch back the counter
to percpu mode.

Using this approach, to move the atomic refcount manipulation out of the
contended paths, there is a significant scalability improvement seen on
nginx test, and scalability efficiency is close to apparmor-off case.

Config Cache Domains apparmor=on (percpuref)
scaling eff (%)
8C16T 1 100%
16C32T 2 96%
24C48T 3 94%
48C96T 6 93%
96C192T 12 90%

Limitations
===========

1. Switching to percpu refcount increases memory size overhead, as
percpu memory is allocated for all labels.

2. Deferring labels reclaim could potentially result in memory
pressure, when there are high frequency of label update operations.

3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
These can impact energy efficiency, due to back to back hurry
callbacks. Using deferrable workqueue partly mitigates this.
However, deferring kworker can delay reclaims.

4. Back to back label switches can delay other percpu users, as
there is a single global switch spinlock used by percpu refcount
lib.

5. Long running kworker can delay other use cases like system suspend.
This is mitigated using freezable workqueue and litming node
scans to a max count.

6. There is a window where label operates is atomic mode, when its
counter is being checked for zero. This can potentially result
in high memory contention, during this window which spans RCU
grace period (plus callback execution). For example, when
scanning label corresponding to unconfined profile, all
applications which use unconfined profile would be using
atomic ref increment and decrement operations.

There are a few apparoaches which were tried to mitigate this issue:

a. At a lower time interval, check if scanned label's counter
has changed since the start of label scan. If there is a change
in count, terminate the switch to atomic mode. Below shows the
apparoch using rcuwait.

static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
{
WRITE_ONCE(aa_atomic_switch_complete, true);
rcuwait_wake_up(&aa_reclaim_rcuwait);
}

rcuwait_init(&aa_reclaim_rcuwait);
percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);

atomic_count = percpu_ref_count_read(&label->count);
do {
rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
(switch_complete = READ_ONCE(aa_atomic_switch_complete)),
TASK_IDLE,
msecs_to_jiffies(5));
if (percpu_ref_count_read(&label->count) != atomic_count)
break;
} while (!READ_ONCE(switch_complete));

However, this approach does not work, as percpu refcount lib does not
allow termination of an ongoing switch operation. Also, the counter
can return to the original value with set of get() and put() operations
before we check the current value.

b. Approaches to notify the reclaim kworker from ref get and put operations
can potentially disturb cache line state between the various CPU
contexts, which are referncing the label, and can potentially impact
scalability again.

c. Swith the label to an immortal percpu ref, while the scan operates
on the current counter.

Below is the sequence of operations to do this:

1. Ensure that both immortal ref and label ref are in percpu mode.
Reinit the immortal ref in percpu mode.

Swap percpu and atomic counters of label refcount and immortal ref
percpu-ref
+-------------------+
+-------+ | percpu-ctr-addr1 |
| label | --------->|-------------------| +----------------+
+-------+ | data |--->| Atomic counter1|
+-------------------+ +----------------+
+-------+ +-------------------+
|ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
+-------+ |-------------------|--->| Atomic counter2|
| data | +----------------+
+-------------------+

label ->percpu-ctr-addr = percpu-ctr-addr2
ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
label ->data->count = Atomic counter2
ImmLbl ->data->count = Atomic counter1


2. Check the counters collected in immortal label, by switch it
to atomic mode.

3. If the count is 0, do,
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.
e. Switch label percpu ref to atomic mode.
f. If the counter is 1, drop the initial ref.

4. If the count is not 0, re-swap the counters.
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.


Using this approach, we ensure that, label ref users do not switch
to atomic mode, while there are active references on the label.
However, this approach requires multiple percpu ref mode switches
and adds high overhead and complexity to the scanning code.

Extended/Future Work
====================

1. Look for ways to fix the limitations, as described in the "Limitations"
section.

2. Generalize the approach to percpu rcuref, which is used for contexts
where release path uses RCU grace period for release operations. Patch
7 creates an initial prototype for this.

3. Explore hazard pointers for scalable refcounting of labels.

Highly appreciate any feedback/suggestions on the design approach.

The patches of this patchset introduce following changes:

1. Documentation of Apparmor Refcount management.

2. Switch labels to percpu refcount in atomic mode.

Use percpu refcount for apparmor labels. Initial patch to init
the percpu ref in atomic mode, to evaluate the potential
impact of percpuref on top of kref based implementation.

3. Switch unconfined namespaces refcount to percpu mode.

Switch unconfined ns labels to percpu mode, and kill the
initial refcount from namespace destroy path.

4. Add infrastructure to reclaim percpu labels.

Add a label reclaim infrastructure for labels which are
in percpu mode, for managing their inital refcount.

5. Switch intree labels to percpu mode.

Use label reclaim infrastruture to manage intree labels.

6. Initial prototype for optimizing ref switch.

Prototype for reducing the time window when a label
scan switches the label ref to atomic mode.

7. percpu-rcuref: Add basic infrastructure.

Prototype for Percpu refcounts for objects, which protect
their object reclaims using RCU grace period.

8. Switch labels to percpu rcurefcount in unmanaged mode.

Use percpu rcuref for labels. Start with unmanaged/atomic
mode.

9. Switch unconfined and in tree labels to managed ref mode.

Use percpu mode with manager worker for unconfined and intree
labels.


------------------------------------------------------------------------

b/Documentation/admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++
b/Documentation/admin-guide/LSM/index.rst | 1
b/Documentation/admin-guide/kernel-parameters.txt | 8 +
b/include/linux/percpu-rcurefcount.h | 115 ++++++++++++++++
b/include/linux/percpu-refcount.h | 2
b/lib/Makefile | 2
b/lib/percpu-rcurefcount.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
b/lib/percpu-refcount.c | 93 +++++++++++++
b/security/apparmor/include/label.h | 16 +-
b/security/apparmor/include/policy.h | 8 -
b/security/apparmor/include/policy_ns.h | 24 +++
b/security/apparmor/label.c | 11 +
b/security/apparmor/lsm.c | 145 ++++++++++++++++++++
b/security/apparmor/policy_ns.c | 6
include/linux/percpu-refcount.h | 2
lib/percpu-refcount.c | 93 -------------
security/apparmor/include/label.h | 17 +-
security/apparmor/include/policy.h | 56 +++----
security/apparmor/include/policy_ns.h | 24 ---
security/apparmor/label.c | 11 -
security/apparmor/lsm.c | 325 ++++++++++++----------------------------------
security/apparmor/policy_ns.c | 8 -
22 files changed, 1237 insertions(+), 417 deletions(-)

base-commit: ab27740f7665


2024-01-10 11:21:07

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 2/9] apparmor: Switch labels to percpu refcount in atomic mode

In preparation of using percpu refcount for labels,
this patch replaces label kref with percpu refcount.
The percpu ref is initialized to atomic mode, as
using percpu mode, requires tracking ref kill points.
As the atomic counter is in a different cacheline now,
rearrange some of the fields - flags, proxy; to
optimize some of the fast paths for unconfined labels.

In addition to the requirement to cleanup the percpu
ref using percpu_ref_exit() in label destruction path,
other potential impact from this patch could be:

- Increase in memory requirement (for per cpu counters)
for each label.

- Displacement of aa_label struct members to different
cacheline, as percpu ref takes 2 pointers space.

- Moving of the atomic counter outside of the cacheline
of the aa_label struct.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
security/apparmor/include/label.h | 16 ++++++++--------
security/apparmor/include/policy.h | 8 ++++----
security/apparmor/label.c | 11 ++++++++---
3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/security/apparmor/include/label.h b/security/apparmor/include/label.h
index 2a72e6b17d68..4b29a4679c74 100644
--- a/security/apparmor/include/label.h
+++ b/security/apparmor/include/label.h
@@ -121,12 +121,12 @@ struct label_it {
* @ent: set of profiles for label, actual size determined by @size
*/
struct aa_label {
- struct kref count;
+ struct percpu_ref count;
+ long flags;
+ struct aa_proxy *proxy;
struct rb_node node;
struct rcu_head rcu;
- struct aa_proxy *proxy;
__counted char *hname;
- long flags;
u32 secid;
int size;
struct aa_profile *vec[];
@@ -276,7 +276,7 @@ void __aa_labelset_update_subtree(struct aa_ns *ns);

void aa_label_destroy(struct aa_label *label);
void aa_label_free(struct aa_label *label);
-void aa_label_kref(struct kref *kref);
+void aa_label_percpu_ref(struct percpu_ref *ref);
bool aa_label_init(struct aa_label *label, int size, gfp_t gfp);
struct aa_label *aa_label_alloc(int size, struct aa_proxy *proxy, gfp_t gfp);

@@ -373,7 +373,7 @@ int aa_label_match(struct aa_profile *profile, struct aa_ruleset *rules,
*/
static inline struct aa_label *__aa_get_label(struct aa_label *l)
{
- if (l && kref_get_unless_zero(&l->count))
+ if (l && percpu_ref_tryget(&l->count))
return l;

return NULL;
@@ -382,7 +382,7 @@ static inline struct aa_label *__aa_get_label(struct aa_label *l)
static inline struct aa_label *aa_get_label(struct aa_label *l)
{
if (l)
- kref_get(&(l->count));
+ percpu_ref_get(&(l->count));

return l;
}
@@ -402,7 +402,7 @@ static inline struct aa_label *aa_get_label_rcu(struct aa_label __rcu **l)
rcu_read_lock();
do {
c = rcu_dereference(*l);
- } while (c && !kref_get_unless_zero(&c->count));
+ } while (c && !percpu_ref_tryget(&c->count));
rcu_read_unlock();

return c;
@@ -442,7 +442,7 @@ static inline struct aa_label *aa_get_newest_label(struct aa_label *l)
static inline void aa_put_label(struct aa_label *l)
{
if (l)
- kref_put(&l->count, aa_label_kref);
+ percpu_ref_put(&l->count);
}


diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 75088cc310b6..5849b6b94cea 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -329,7 +329,7 @@ static inline aa_state_t ANY_RULE_MEDIATES(struct list_head *head,
static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
{
if (p)
- kref_get(&(p->label.count));
+ percpu_ref_get(&(p->label.count));

return p;
}
@@ -343,7 +343,7 @@ static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
*/
static inline struct aa_profile *aa_get_profile_not0(struct aa_profile *p)
{
- if (p && kref_get_unless_zero(&p->label.count))
+ if (p && percpu_ref_tryget(&p->label.count))
return p;

return NULL;
@@ -363,7 +363,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
rcu_read_lock();
do {
c = rcu_dereference(*p);
- } while (c && !kref_get_unless_zero(&c->label.count));
+ } while (c && !percpu_ref_tryget(&c->label.count));
rcu_read_unlock();

return c;
@@ -376,7 +376,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
static inline void aa_put_profile(struct aa_profile *p)
{
if (p)
- kref_put(&p->label.count, aa_label_kref);
+ percpu_ref_put(&p->label.count);
}

static inline int AUDIT_MODE(struct aa_profile *profile)
diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index c71e4615dd46..aa9e6eac3ecc 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -336,6 +336,7 @@ void aa_label_destroy(struct aa_label *label)
rcu_assign_pointer(label->proxy->label, NULL);
aa_put_proxy(label->proxy);
}
+ percpu_ref_exit(&label->count);
aa_free_secid(label->secid);

label->proxy = (struct aa_proxy *) PROXY_POISON + 1;
@@ -369,9 +370,9 @@ static void label_free_rcu(struct rcu_head *head)
label_free_switch(label);
}

-void aa_label_kref(struct kref *kref)
+void aa_label_percpu_ref(struct percpu_ref *ref)
{
- struct aa_label *label = container_of(kref, struct aa_label, count);
+ struct aa_label *label = container_of(ref, struct aa_label, count);
struct aa_ns *ns = labels_ns(label);

if (!ns) {
@@ -408,7 +409,11 @@ bool aa_label_init(struct aa_label *label, int size, gfp_t gfp)

label->size = size; /* doesn't include null */
label->vec[size] = NULL; /* null terminate */
- kref_init(&label->count);
+ if (percpu_ref_init(&label->count, aa_label_percpu_ref, PERCPU_REF_INIT_ATOMIC, gfp)) {
+ aa_free_secid(label->secid);
+ return false;
+ }
+
RB_CLEAR_NODE(&label->node);

return true;
--
2.34.1


2024-01-10 11:21:27

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 1/9] doc: Add document for apparmor refcount management

Add a document to describe refcount management of AppArmor
kernel objects.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
.../admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++
Documentation/admin-guide/LSM/index.rst | 1 +
2 files changed, 352 insertions(+)
create mode 100644 Documentation/admin-guide/LSM/ApparmorRefcount.rst

diff --git a/Documentation/admin-guide/LSM/ApparmorRefcount.rst b/Documentation/admin-guide/LSM/ApparmorRefcount.rst
new file mode 100644
index 000000000000..8132f1b661bb
--- /dev/null
+++ b/Documentation/admin-guide/LSM/ApparmorRefcount.rst
@@ -0,0 +1,351 @@
+============================
+AppArmor Refcount Management
+============================
+
+Introduction
+============
+
+AppArmor task confinement is based on task profiles. Task profiles
+specify the access rules - list of files which the task is allowed
+to open/read/write, socket bind, mount, signal, ptrace and other
+capabilities of the task.
+
+A sample raw task profile (typically present in /etc/apparmor.d/)
+would look like this:
+
+::
+
+ 1. profile test_app /usr/bin/test_app {
+ 2. @{sys}/devices/** r,
+ 3. /var/log/testapp/access.log w,
+ 4. /var/log/testapp/error.log w,
+ 5.
+ 6. /lib/ld-*.so* mrix,
+ 7.
+ 8. ^hat {
+ 9. /dev/pts/* rw,
+ 10. }
+ 11.
+ 12. change_profile -> restricted_access_profile,
+ 13.
+ 14. /usr/*bash cx -> local_profile,
+ 15.
+ 16. profile local_profile {
+ 17. ...
+ 18. }
+ 19. }
+
+
+Above example shows a sample profile. A quick description of
+each line is given below:
+
+1 Defines a profile with name ``test_app`` and attachment specification
+ ``/usr/bin/test_app``. The attachment specification is used for
+ associating the application with a profile, during launch.
+
+2,3, 4
+ Specifies read and write access to various paths.
+
+6 Read access for the so and inherit profile transition specification.
+
+8 Hat profile. Used for running a portion of the program with different
+ permissions, compared to the other portions of the program. For example,
+ to run unauthenticated traffic and authenticated traffic in separate
+ profiles in OpenSSH; running user supplied CGI scripts in separate
+ profile in Apache.
+
+12 Change profile rules, to switch child process to a profile, different
+ from the parent process, on exec.
+
+14 Profile transition for processes started from the current procees. For
+ example, transition to a different profile for ``ls``, which is invoked
+ from a shell program.
+
+
+Objects and their Refcount Management
+=====================================
+
+There are various object resources within AppArmor
+
+- Namespaces
+
+ There is a root namespace associated for apparmorfs. This is the default
+ namespace, to which all profiles are associated with.
+
+ Profiles can be associated with a different namespaces (for chroot,
+ containers env).
+
+ Namespaces are represented using ``struct aa_ns``. Some of the relevant
+ fields are::
+
+ struct aa_policy base
+ struct aa_profile *unconfined
+ struct list_head sub_ns
+ struct aa_ns *parent
+
+ ``struct aa_policy`` contains a list of profiles associated with this ns.
+
+ ``unconfined`` profile manages refcount for this namespace. It is also
+ used as the default profile for tasks in this namespace and a proxy label,
+ when profiles are removed.
+
+ ``sub_ns`` is the list of child namespaces.
+
+ ``parent`` Parent namespace, for this namespace.
+
+ A parent and its child sub namespaces keep reference to each other::
+
+ +---------------------+
+ | |
+ | root_ns |
+ | |
+ +---------------------+
+ ^ / \ ^
+ / / \ \
+ parent / / \ \ parent
+ / / sub_ns \ \
+ / / \ \
+ / / \ \
+ / v v \
+ +-----------+ +-----------+
+ | | | |
+ | ns1 | | ns2 |
+ | | | |
+ +-----------+ +-----------+
+
+ Here, ``root_ns`` is the root apparmor namespace. It maintains a
+ reference to all child namespace which are present in ``->sub_ns``.
+ The child namespaces ``ns1``, ``ns2`` keep a reference to their
+ ``parent``, which is the ``root_ns``.
+
+
+- Profiles
+
+ Profiles are represented as ``struct aa_profile``
+
+ Some of the fields of interest are::
+
+ struct aa_policy base
+ struct aa_profile __rcu *parent
+ struct aa_ns *ns
+ struct aa_loaddata *rawdata
+ struct aa_label label
+
+ ``base`` - Maintains the list of child subprofiles - hats
+
+ ``parent`` - If subprofile, pointer to the parent profile
+
+ ``ns`` - Parent namespace
+
+ ``rawdata`` - Used for matching profile data, for profile updates
+
+ ``label`` - Refcount object
+
+ A profile keeps a reference to the namespace it is associated with.
+ In addition, there is a reference kept for all profiles in
+ ``base.profiles`` of a namespace::
+
+ +-----------------------------+
+ | |
+ | root_ns |
+ | |
+ +-----------------------------+
+ ^ / ^ |
+ / / ns | |
+ parent / / | |
+ / / sub_ns | |base.profiles
+ / / | |
+ / / | |
+ / v | v
+ +-----------+ +-----------+
+ | | | |
+ | ns1 | | A |
+ | | | |
+ +-----------+ +-----------+
+ base | ^
+ .profiles| | parent
+ v |
+ +-----------+
+ | |
+ | P |
+ | |
+ +-----------+
+
+ For subprofiles, a refcount is kept for the ``->parent`` profile.
+ For each child in ``->base.profiles``, a refcount is kept::
+
+ +--------------+
+ | |
+ | root_ns |
+ | |
+ +-------^------+
+ base. | |
+ profiles v |ns
+ +---------------+
+ | |
+ ^| A |^
+ parent / | | \parent
+ / +---------------+ \
+ / / base.profiles\ \
+ / / v \
+ +------v-+ +----\---+
+ | | | |
+ | B | | C |
+ | | | |
+ +--------+ +--------+
+
+
+- Labels
+
+ Label manages refcount for the ``aa_profile`` objects. It is
+ represented as ``struct aa_label``. Some of the fields are::
+
+ struct kref count
+ struct aa_proxy *proxy
+ long flags
+ int size
+ struct aa_profile *vec[]
+
+ ``count`` - Refcounting for the enclosing object.
+ ``proxy`` - Redirection of stale profiles
+ ``flags`` - state (STALE), type (NS, PROFILE)
+ ``vec`` - if ``size`` > 1, for compound labels (for stacked profiles)
+
+
+ For compound/stack labels, there is a reference kept, for all
+ the stack profiles::
+
+ +--------+ +---------+ +-------+
+ | A | | B | | C |
+ | | | | | |
+ +-----^--+ +---------+ +-------+
+ ^ \ ^ ^
+ \ \ | |
+ \ \ +---------------+ |
+ \ \ | A//&:ns1:B | |
+ \ \| | |
+ \ +---------------+ |
+ \ |
+ \ |
+ \ +-------------------+
+ \|A//&:ns1:B//&:ns2:C|
+ | |
+ +-------------------+
+
+- Proxy
+
+ A proxy is associated with a label, and is used for redirecting
+ running tasks to new profile on profile change. Proxies are
+ represented as ``struct aa_proxy``::
+
+ struct aa_label __rcu *label
+
+ ``label`` - Redirect to the latest profile label.
+
+ While a label is not stale, its proxy points to the same label.
+ On profile updates, the proxy, the label is marked as stale,
+ and label is redirected to the new profile label::
+
+ +------------+ +-----------+
+ | | | |
+ | old | -------->| P1 |
+ | | <--------| |
+ +------------+ +-----------+
+
+
+ +------------+ +------------+
+ | | | |
+ | old |-------->| P1 |
+ | | | |
+ +------------+ +--^---------+
+ | |
+ +------------+ | |
+ | |-----------/ |
+ | new |<-------------/
+ | |
+ +------------+
+
+Lifecycle of the Apparmor Kernel Objects
+========================================
+
+#. Init
+
+ #. During AppArmor init, root ns (RNS:1) and its unconfined
+ profile are created (RNS:1). If initialization completes
+ successfully, the ``root_ns`` initial ref is never destroyed
+ (?).
+
+ #. Usespace init scripts load the current set of defined profiles
+ into kernel, typically through ``apparmor_parser`` command.
+
+ The loaded profiles have an initial refcount (P1:1 , P2:1).
+ A profile P1, which is in default namespace keeps a reference
+ to root ns (RNS:2). If a profile P2 is in a different namespace,
+ NS1, that namespace object is allocated (NS1:1) and the namespace
+ is added to ``sub_ns`` of ``root_ns`` (NS1:2). The child namespace
+ NS1 keeps a reference to parent ``root_ns`` (RNS:3). P2 keeps a
+ reference to NS1 (NS1:2). The root ns keeps a reference to P1 in
+ ``->profiles`` list (P1:2). NS1 keeps a reference to P2 in its
+ ``->profiles`` list (P2:2). In addition, label proxies keep
+ reference to P1 and P2 (P1:3, P2:3).
+
+#. Runtime
+
+ #. As part of the bprm cred updates (``apparmor_bprm_creds_for_exec()``),
+ the current task T1 is attached to a profile (P1), based on the best
+ attachment match rule. T1 keeps a refcount for P1, while the current
+ ``cred`` is active (P1:4).
+
+ #. If P1 is replaced with a new profile P3, P1 is removed from the root
+ ns profiles list (P1:3), proxy is redirected to P3 (P1:2), and the
+ initial label is released (P1:1) and P1's label is marked stale.
+
+ #. Any T1 accesses, which have a apparmor hook, would reference the
+ current task's cred label::
+
+ __begin_current_label_crit_section()
+ struct aa_label *label = cred_label(cred);
+
+ if (label_is_stale(label))
+ label = aa_get_newest_label(label);
+
+ return label;
+
+ aa_get_newest_label(struct aa_label __rcu **l)
+ return aa_get_label_rcu(&l->proxy->label);
+
+ aa_get_label_rcu(struct aa_label __rcu **l)
+ rcu_read_lock();
+ do {
+ c = rcu_dereference(*l);
+ } while (c && !kref_get_unless_zero(&c->count));
+ rcu_read_unlock();
+
+ #. On task exit and cref freeing, the last reference for P1 is
+ released (P1:0).
+
+#. Release
+
+ Below is the set of release operations, based on the label's
+ parent object type.
+
+ #. If ns is not assigned (early init error exit), do not wait for
+ RCU grace period. Otherwise use ``call_rcu()``
+
+ #. If label is associated with a namespace (unconfined label)
+ #. Drop Parent ns reference.
+
+ #. If label is associated with a profile
+ #. Drop parent profile reference.
+ #. Drop ns reference.
+
+ #. Drop all vector profile references for stacked profiles.
+
+
+Links
+=====
+
+Userspace tool - https://gitlab.com/apparmor/apparmor
+ Profile syntax - parser/apparmor.d.pod
+ Sample change hats - changehat/
+ Other documentation - libraries/libapparmor/doc
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
index a6ba95fbaa9f..c608db9e7107 100644
--- a/Documentation/admin-guide/LSM/index.rst
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -41,6 +41,7 @@ subdirectories.
:maxdepth: 1

apparmor
+ ApparmorRefcount
LoadPin
SELinux
Smack
--
2.34.1


2024-01-10 11:23:00

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 4/9] apparmor: Add infrastructure to reclaim percpu labels

Nginx performance testing with Apparmor enabled (with nginx
running in unconfined profile), on kernel versions 6.1 and 6.5
show significant drop in throughput scalability, when Nginx
workers are scaled to higher number of CPUs across various
L3 cache domains.

Below is one sample data on the throughput scalability loss,
based on results on AMD Zen4 system with 96 CPUs with SMT
core count 2; so, overall, 192 CPUs:

Config Cache Domains apparmor=off apparmor=on
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 95% 94%
24C48T 3 94% 93%
48C96T 6 92% 88%
96C192T 12 85% 68%

There is a significant drop in scaling efficiency, when we
move to 96 CPUs/192 SMT threads.

Perf tool shows most of the contention coming from below
6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
6.22% nginx [kernel.vmlinux] [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on label.

A part of the contention was remove with commit 2516fde1fa00
("apparmor: Optimize retrieving current task secid"), which
is part of 6.7-rc1 release. After including this commit, the
scaling efficiency improved as shown below:

Config Cache Domains apparmor=on apparmor=on (patched)
scaling eff (%) scaling eff (%)
8C16T 1 100% 100%
16C32T 2 97% 93%
24C48T 3 94% 92%
48C96T 6 88% 88%
96C192T 12 65% 79%

However, the scaling efficiency impact is still significant even
after including the commit. In addition, the performance impact
is even higher when we move to >192 CPUs. In addition, the
memory contention impact would increase whem there is a high
frequency of label update operations and labels are marked
stale more frequently.

This patch adds a mechanism to manage reclaim of apparmor labels,
when they are working in percpu mode. Using percpu refcount
for apparmor label refcount helps solve the throughput scalability
drop problem seen on nginx.

Config Cache Domains apparmor=on (percpuref)
scaling eff (%)
8C16T 1 100%
16C32T 2 96%
24C48T 3 94%
48C96T 6 93%
96C192T 12 90%

Below is the sequence of operations, which shows the refcount
management with this approach:

1. During label initialization, the percpu ref is initialized in
atomic mode. This is done to ensure that, for cases where the
label hasn't gone live (->ns isn't assigned), mostly during
initialization error paths.

2. Labels are switched to percpu mode at various points -
when a label is added to labelset tree, when a unconfined profile
has been assigned a namespace.

3. As part of the initial prototype, only the in tree labels
are managed by the kworker. These labels are added to a lockless
list. The unconfined labels invoke a percpu_ref_kill() operation
when the namespace is destroyed.

4. The kworker does a periodic scan of all the labels in the
llist. It does below sequence of operations:

a. Enqueue a dummy node to mark the start of scan. This dummy
node is used as start point of scan and ensures that we
there is no additional synchronization required with new
label node additions to the llist. Any new labels will
be processed in next run of the kworker.

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+
| | | | | | | |
| head ------> dummy|--->|label |--->| label|--->NULL
| | | node | | | | |
+----------+ +------+ +------+ +------+

New label addition:

SCAN START PTR
|
v
+----------+ +------+ +------+ +------+ +------+
| | | | | | | | | |
| head |--> label|--> dummy|--->|label |--->| label|--->NULL
| | | | | node | | | | |
+----------+ +------+ +------+ +------+ +------+

b. Traverse through the llist, starting from dummy->next.
If the node is a dummy node, mark it free.
If the node is a label node, do,

i) Switch the label ref to atomic mode. The ref switch wait
for the existing percpu_ref_get() and percpu_ref_put()
operations to complete, by waiting for a RCU grace period.

Once the switch is complete, from this point onwards, any
percpu_ref_get(), percpu_ref_put() operations use
atomic operations.

ii) Drop the initial reference, which was taken while adding
the label node to the llist.

iii) Use a percpu_ref_tryget() increment operation on the
ref, to see if we dropped the last ref count. if we
dropped the last count, we remove the node from the llist.

All of these operations are done inside a RCU critical
section, to avoid race with the release operations,
which can potentially trigger, as soon as we drop
the initial ref count.

iv) If we didn't drop the last ref, switch back the counter
to percpu mode.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
security/apparmor/include/label.h | 3 +
security/apparmor/lsm.c | 145 ++++++++++++++++++++++++++++++
2 files changed, 148 insertions(+)

diff --git a/security/apparmor/include/label.h b/security/apparmor/include/label.h
index 4b29a4679c74..0fc4879930dd 100644
--- a/security/apparmor/include/label.h
+++ b/security/apparmor/include/label.h
@@ -125,6 +125,7 @@ struct aa_label {
long flags;
struct aa_proxy *proxy;
struct rb_node node;
+ struct llist_node reclaim_node;
struct rcu_head rcu;
__counted char *hname;
u32 secid;
@@ -465,4 +466,6 @@ static inline void aa_put_proxy(struct aa_proxy *proxy)

void __aa_proxy_redirect(struct aa_label *orig, struct aa_label *new);

+void aa_label_reclaim_add_label(struct aa_label *label);
+
#endif /* __AA_LABEL_H */
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index e490a7000408..cf8429f5c88e 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -64,6 +64,143 @@ static LIST_HEAD(aa_global_buffers);
static DEFINE_SPINLOCK(aa_buffers_lock);
static DEFINE_PER_CPU(struct aa_local_cache, aa_local_buffers);

+static struct workqueue_struct *aa_label_reclaim_wq;
+static void aa_label_reclaim_work_fn(struct work_struct *work);
+
+/*
+ * Dummy llist nodes, for lockless list traveral and deletions by
+ * the reclaim worker, while nodes are added from normal label
+ * insertion paths.
+ */
+struct aa_label_reclaim_node {
+ bool inuse;
+ struct llist_node node;
+};
+
+/*
+ * We need two dummy head nodes for lockless list manipulations from reclaim
+ * worker - first dummy node will be used in current reclaim iteration;
+ * the second one will be used in next iteration. Next iteration marks
+ * the first dummy node as free, for use in following iteration.
+ */
+#define AA_LABEL_RECLAIM_NODE_MAX 2
+
+#define AA_MAX_LABEL_RECLAIMS 100
+#define AA_LABEL_RECLAIM_INTERVAL_MS 5000
+
+static LLIST_HEAD(aa_label_reclaim_head);
+static struct llist_node *last_reclaim_label;
+static struct aa_label_reclaim_node aa_label_reclaim_nodes[AA_LABEL_RECLAIM_NODE_MAX];
+static DECLARE_DELAYED_WORK(aa_label_reclaim_work, aa_label_reclaim_work_fn);
+
+void aa_label_reclaim_add_label(struct aa_label *label)
+{
+ percpu_ref_get(&label->count);
+ llist_add(&label->reclaim_node, &aa_label_reclaim_head);
+}
+
+static bool aa_label_is_reclaim_node(struct llist_node *node)
+{
+ return &aa_label_reclaim_nodes[0].node <= node &&
+ node <= &aa_label_reclaim_nodes[AA_LABEL_RECLAIM_NODE_MAX - 1].node;
+}
+
+static struct llist_node *aa_label_get_reclaim_node(void)
+{
+ int i;
+ struct aa_label_reclaim_node *rn;
+
+ for (i = 0; i < AA_LABEL_RECLAIM_NODE_MAX; i++) {
+ rn = &aa_label_reclaim_nodes[i];
+ if (!rn->inuse) {
+ rn->inuse = true;
+ return &rn->node;
+ }
+ }
+
+ return NULL;
+}
+
+static void aa_label_put_reclaim_node(struct llist_node *node)
+{
+ struct aa_label_reclaim_node *rn = container_of(node, struct aa_label_reclaim_node, node);
+
+ rn->inuse = false;
+}
+
+static void aa_put_all_reclaim_nodes(void)
+{
+ int i;
+
+ for (i = 0; i < AA_LABEL_RECLAIM_NODE_MAX; i++)
+ aa_label_reclaim_nodes[i].inuse = false;
+}
+
+static void aa_label_reclaim_work_fn(struct work_struct *work)
+{
+ struct llist_node *pos, *first, *head, *prev, *next;
+ struct llist_node *reclaim_node;
+ struct aa_label *label;
+ int cnt = 0;
+ bool held;
+
+ first = aa_label_reclaim_head.first;
+ if (!first)
+ goto queue_reclaim_work;
+
+ if (last_reclaim_label == NULL || last_reclaim_label->next == NULL) {
+ reclaim_node = aa_label_get_reclaim_node();
+ WARN_ONCE(!reclaim_node, "Reclaim heads exhausted\n");
+ if (unlikely(!reclaim_node)) {
+ head = first->next;
+ if (!head) {
+ aa_put_all_reclaim_nodes();
+ goto queue_reclaim_work;
+ }
+ prev = first;
+ } else {
+ llist_add(reclaim_node, &aa_label_reclaim_head);
+ prev = reclaim_node;
+ head = prev->next;
+ }
+ } else {
+ prev = last_reclaim_label;
+ head = prev->next;
+ }
+
+ last_reclaim_label = NULL;
+ llist_for_each_safe(pos, next, head) {
+ /* Free reclaim node, which is present in the list */
+ if (aa_label_is_reclaim_node(pos)) {
+ prev->next = pos->next;
+ aa_label_put_reclaim_node(pos);
+ continue;
+ }
+
+ label = container_of(pos, struct aa_label, reclaim_node);
+ percpu_ref_switch_to_atomic_sync(&label->count);
+ rcu_read_lock();
+ percpu_ref_put(&label->count);
+ held = percpu_ref_tryget(&label->count);
+ if (!held)
+ prev->next = pos->next;
+ rcu_read_unlock();
+ if (!held)
+ continue;
+ percpu_ref_switch_to_percpu(&label->count);
+ cnt++;
+ if (cnt == AA_MAX_LABEL_RECLAIMS) {
+ last_reclaim_label = pos;
+ break;
+ }
+ prev = pos;
+ }
+
+queue_reclaim_work:
+ queue_delayed_work(aa_label_reclaim_wq, &aa_label_reclaim_work,
+ msecs_to_jiffies(AA_LABEL_RECLAIM_INTERVAL_MS));
+}
+
/*
* LSM hook functions
*/
@@ -2277,6 +2414,14 @@ static int __init apparmor_init(void)
aa_free_root_ns();
goto buffers_out;
}
+
+ aa_label_reclaim_wq = alloc_workqueue("aa_label_reclaim",
+ WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+ WARN_ON(!aa_label_reclaim_wq);
+ if (aa_label_reclaim_wq)
+ queue_delayed_work(aa_label_reclaim_wq, &aa_label_reclaim_work,
+ AA_LABEL_RECLAIM_INTERVAL_MS);
+
security_add_hooks(apparmor_hooks, ARRAY_SIZE(apparmor_hooks),
&apparmor_lsmid);

--
2.34.1


2024-01-10 11:25:23

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 6/9] apparmor: Initial prototype for optimizing ref switch

This patches adds a prototype for optimizing the atomic
window, during label scan by switching to an immortal
percpu ref.

Below is the sequence of operations to do this:

1. Ensure that both immortal ref and label ref are in percpu mode.
Reinit the immortal ref in percpu mode.

Swap percpu and atomic counters of label refcount and immortal ref
percpu-ref
+-------------------+
+-------+ | percpu-ctr-addr1 |
| label | --------->|-------------------| +----------------+
+-------+ | data |--->| Atomic counter1|
+-------------------+ +----------------+
+-------+ +-------------------+
|ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
+-------+ |-------------------|--->| Atomic counter2|
| data | +----------------+
+-------------------+

label ->percpu-ctr-addr = percpu-ctr-addr2
ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
label ->data->count = Atomic counter2
ImmLbl ->data->count = Atomic counter1

2. Check the counters collected in immortal label, by switch it
to atomic mode.

3. If the count is 0, do,
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.
e. Switch label percpu ref to atomic mode.
f. If the counter is 1, drop the initial ref.

4. If the count is not 0, terminate the operations and re-swap
the counters.
a. Switch immortal counter to percpu again, giving it an
initial count of 1.
b. Swap the label and immortal counters again. The immortal
ref now has the counter values from new percpu ref get
and get operations on the label ref, from the point
when we did the initial swap operation.
c. Transfer the percpu counts in immortal ref to atomic
counter of label percpu refcount.
d. Kill immortal ref, for reinit on next iteration.

Using this approach, we ensure that, label ref users do not switch
to atomic mode, while there are active references on the label.
However, this approach requires multiple percpu ref mode switches
and adds high overhead and complexity to the scanning code.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
include/linux/percpu-refcount.h | 2 +
lib/percpu-refcount.c | 93 +++++++++++++++++++++++++++++
security/apparmor/lsm.c | 101 ++++++++++++++++++++++++++++----
3 files changed, 185 insertions(+), 11 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d73a1c08c3e3..9e30c458cc00 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -131,6 +131,8 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
void percpu_ref_resurrect(struct percpu_ref *ref);
void percpu_ref_reinit(struct percpu_ref *ref);
bool percpu_ref_is_zero(struct percpu_ref *ref);
+void percpu_ref_swap_percpu_sync(struct percpu_ref *ref1, struct percpu_ref *ref2);
+void percpu_ref_transfer_percpu_count(struct percpu_ref *ref1, struct percpu_ref *ref2);

/**
* percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 668f6aa6a75d..36814446db34 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -477,3 +477,96 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
}
EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
+
+static void percpu_ref_swap_percpu_rcu(struct rcu_head *rcu)
+{
+ struct percpu_ref_data *data = container_of(rcu,
+ struct percpu_ref_data, rcu);
+ struct percpu_ref *ref = data->ref;
+
+ data->confirm_switch(ref);
+ data->confirm_switch = NULL;
+ wake_up_all(&percpu_ref_switch_waitq);
+
+}
+
+static void __percpu_ref_swap_percpu(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch)
+{
+ ref->data->confirm_switch = confirm_switch ?:
+ percpu_ref_noop_confirm_switch;
+ call_rcu_hurry(&ref->data->rcu,
+ percpu_ref_swap_percpu_rcu);
+}
+
+/**
+ * percpuref_swap_percpu_sync - Swap percpu counter of one ref with other
+ * @ref1: First perpcu_ref to swap the counter
+ * @ref2: Second percpu_ref for counter swap
+ */
+void percpu_ref_swap_percpu_sync(struct percpu_ref *ref1, struct percpu_ref *ref2)
+{
+ unsigned long __percpu *percpu_count;
+ unsigned long flags;
+ struct percpu_ref_data *data1 = ref1->data;
+ struct percpu_ref_data *data2 = ref2->data;
+ unsigned long percpu_cnt_ptr1 = ref1->percpu_count_ptr;
+ unsigned long percpu_cnt_ptr2 = ref2->percpu_count_ptr;
+ atomic_long_t count1 = ref1->data->count;
+ atomic_long_t count2 = ref2->data->count;
+
+ spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+ wait_event_lock_irq(percpu_ref_switch_waitq,
+ !data1->confirm_switch && !data2->confirm_switch,
+ percpu_ref_switch_lock);
+ if (!__ref_is_percpu(ref1, &percpu_count) ||
+ !__ref_is_percpu(ref2, &percpu_count)) {
+ spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+ return;
+ }
+ WRITE_ONCE(ref1->percpu_count_ptr, percpu_cnt_ptr2);
+ WRITE_ONCE(ref2->percpu_count_ptr, percpu_cnt_ptr1);
+
+ __percpu_ref_swap_percpu(ref1, NULL);
+ __percpu_ref_swap_percpu(ref2, NULL);
+ ref1->data->count = count2;
+ ref2->data->count = count1;
+ spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+ wait_event(percpu_ref_switch_waitq, !ref1->data->confirm_switch &&
+ !ref2->data->confirm_switch);
+}
+
+/**
+ * percpu_ref_transfer_percpu_count - Transfer percpu counts of one ref to other
+ * @ref1: perpcu_ref to transfer the counters to
+ * @ref2: percpu_ref to transfer the counters from
+ *
+ * The per cpu counts of ref2 are transferred to the atomic counter of ref1.
+ * The ref2 is expected to be inactive.
+ */
+void percpu_ref_transfer_percpu_count(struct percpu_ref *ref1, struct percpu_ref *ref2)
+{
+ unsigned long __percpu *percpu_count = percpu_count_ptr(ref2);
+ struct percpu_ref_data *data1 = ref1->data;
+ struct percpu_ref_data *data2 = ref2->data;
+ unsigned long count = 0;
+ unsigned long flags;
+ int cpu;
+
+ spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+ wait_event_lock_irq(percpu_ref_switch_waitq,
+ !data1->confirm_switch && !data2->confirm_switch,
+ percpu_ref_switch_lock);
+
+ if (!__ref_is_percpu(ref1, &percpu_count) ||
+ !__ref_is_percpu(ref2, &percpu_count)) {
+ spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+ return;
+ }
+
+ for_each_possible_cpu(cpu) {
+ count += *per_cpu_ptr(percpu_count, cpu);
+ *per_cpu_ptr(percpu_count, cpu) = 0;
+ }
+ atomic_long_add((long)count, &ref1->data->count);
+ spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+}
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index cf8429f5c88e..d0d4ebad1e26 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -92,6 +92,7 @@ static LLIST_HEAD(aa_label_reclaim_head);
static struct llist_node *last_reclaim_label;
static struct aa_label_reclaim_node aa_label_reclaim_nodes[AA_LABEL_RECLAIM_NODE_MAX];
static DECLARE_DELAYED_WORK(aa_label_reclaim_work, aa_label_reclaim_work_fn);
+static struct percpu_ref aa_label_reclaim_ref;

void aa_label_reclaim_add_label(struct aa_label *label)
{
@@ -135,14 +136,18 @@ static void aa_put_all_reclaim_nodes(void)
for (i = 0; i < AA_LABEL_RECLAIM_NODE_MAX; i++)
aa_label_reclaim_nodes[i].inuse = false;
}
+static void aa_release_reclaim_ref_noop(struct percpu_ref *ref)
+{
+}

static void aa_label_reclaim_work_fn(struct work_struct *work)
{
struct llist_node *pos, *first, *head, *prev, *next;
+ static bool reclaim_ref_dead_once;
struct llist_node *reclaim_node;
struct aa_label *label;
int cnt = 0;
- bool held;
+ bool held, ref_is_zero;

first = aa_label_reclaim_head.first;
if (!first)
@@ -178,16 +183,72 @@ static void aa_label_reclaim_work_fn(struct work_struct *work)
}

label = container_of(pos, struct aa_label, reclaim_node);
- percpu_ref_switch_to_atomic_sync(&label->count);
- rcu_read_lock();
- percpu_ref_put(&label->count);
- held = percpu_ref_tryget(&label->count);
- if (!held)
- prev->next = pos->next;
- rcu_read_unlock();
- if (!held)
- continue;
- percpu_ref_switch_to_percpu(&label->count);
+ if (reclaim_ref_dead_once)
+ percpu_ref_reinit(&aa_label_reclaim_ref);
+
+ /*
+ * Switch counters of label ref and reclaim ref.
+ * Label's refcount becomes 1
+ * Percpu refcount has the current refcount value
+ * of the label percpu_ref.
+ */
+ percpu_ref_swap_percpu_sync(&label->count, &aa_label_reclaim_ref);
+
+ /* Switch reclaim ref to percpu, to check for 0 */
+ percpu_ref_switch_to_atomic_sync(&aa_label_reclaim_ref);
+
+ /*
+ * Release a count (original label percpu ref had an extra count,
+ * from the llist addition).
+ * When all percpu references have been released, this should
+ * be the initial count, which gets dropped.
+ */
+ percpu_ref_put(&aa_label_reclaim_ref);
+ /*
+ * Release function of reclaim ref is noop; we store the result
+ * for later processing after common code.
+ */
+ if (percpu_ref_is_zero(&aa_label_reclaim_ref))
+ ref_is_zero = true;
+
+ /*
+ * Restore back initial count. Switch reclaim ref to
+ * percpu, for switching back the label percpu and
+ * atomic counters.
+ */
+ percpu_ref_get(&aa_label_reclaim_ref);
+ percpu_ref_switch_to_percpu(&aa_label_reclaim_ref);
+ /*
+ * Swap the refs again. Label gets all old counts
+ * in its atomic counter after this operation.
+ */
+ percpu_ref_swap_percpu_sync(&label->count, &aa_label_reclaim_ref);
+
+ /*
+ * Transfer the percpu counts, which got added, while this
+ * switch was going on. The counters are accumulated into
+ * the label ref's atomic counter.
+ */
+ percpu_ref_transfer_percpu_count(&label->count, &aa_label_reclaim_ref);
+
+ /* Kill reclaim ref for reinitialization, for next iteration */
+ percpu_ref_kill(&aa_label_reclaim_ref);
+ reclaim_ref_dead_once = true;
+
+ /* If refcount of label ref was found to be 0, reclaim it now! */
+ if (ref_is_zero) {
+ percpu_ref_switch_to_atomic_sync(&label->count);
+ rcu_read_lock();
+ percpu_ref_put(&label->count);
+ held = percpu_ref_tryget(&label->count);
+ if (!held)
+ prev->next = pos->next;
+ rcu_read_unlock();
+ if (!held)
+ continue;
+ percpu_ref_switch_to_percpu(&label->count);
+ }
+
cnt++;
if (cnt == AA_MAX_LABEL_RECLAIMS) {
last_reclaim_label = pos;
@@ -2136,6 +2197,16 @@ static int __init set_init_ctx(void)
return 0;
}

+static int __init clear_init_ctx(void)
+{
+ struct cred *cred = (__force struct cred *)current->real_cred;
+
+ set_cred_label(cred, NULL);
+ aa_put_label(ns_unconfined(root_ns));
+
+ return 0;
+}
+
static void destroy_buffers(void)
{
union aa_buffer *aa_buf;
@@ -2422,6 +2493,14 @@ static int __init apparmor_init(void)
queue_delayed_work(aa_label_reclaim_wq, &aa_label_reclaim_work,
AA_LABEL_RECLAIM_INTERVAL_MS);

+ if (!percpu_ref_init(&aa_label_reclaim_ref, aa_release_reclaim_ref_noop,
+ PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
+ AA_ERROR("Failed to allocate label reclaim percpu ref\n");
+ aa_free_root_ns();
+ clear_init_ctx();
+ goto buffers_out;
+ }
+
security_add_hooks(apparmor_hooks, ARRAY_SIZE(apparmor_hooks),
&apparmor_lsmid);

--
2.34.1


2024-01-10 11:25:51

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 7/9] percpu-rcuref: Add basic infrastructure

Add infrastructure for managing reclaims of percpu refs,
which use RCU grace period before reclaiming the referenced
objects.

The refcount management of percpu rcuref is same as
normal percpu refcount, with the only difference that,
instead of a explicit shutdown percpu_ref_kill() operation
by the user, the initial ref is managed by a kworker.

The ref can be initialized to start either in managed or
unmanaged mode. In managed mode, the ref is a set of percpu
counters. There is an extra reference acquired for the llist
node and provides the notion of initial ref in percpu refcount.

During normal operation, users ref get() and put() operations
increment/decrement the percpu counters. There is no check
for drop-to-zero while in percpu mode.

Periodically, the manager kworker thread scans all percpu
rcurefs. It switches ref to centralized atomic counter mode
and checks whether the object has no references left. The ref is
dropped if there are no references. Otherwise, the ref is switched
back to percpu mode again. During this ref scan, there is a
window where ref operates in atomic mode. This window spans
one RCU grace period.

There is a provision to start a percpu rcuref in unmanaged mode.
This is provided for cases, where there is a need to avoid
dependency on kworker and RCU grace period. In addition,
unmanaged mode can be used for a ref, for which the release
function initially does not wait for RCU grace period, for
example when the enclosing object initialization fails, and
there is a rollback operation in error paths. Later, when
object initialization is complete, ref can be switched to
percpu managed mode.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 8 +
include/linux/percpu-rcurefcount.h | 115 ++++++
lib/Makefile | 2 +-
lib/percpu-rcurefcount.c | 336 ++++++++++++++++++
4 files changed, 460 insertions(+), 1 deletion(-)
create mode 100644 include/linux/percpu-rcurefcount.h
create mode 100644 lib/percpu-rcurefcount.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e0891ac76ab3..b2536c4223c1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4576,6 +4576,14 @@
allocator. This parameter is primarily for debugging
and performance comparison.

+ percpu-rcurefcount.ref_scan_interval= [KNL]
+ Interval (in ms) between 2 scans of percpu rcu ref
+ managed refs.
+
+ percpu-rcurefcount.max_ref_scan_count= [KNL]
+ Count of the maximum number of pcpu refs scanned during
+ one scan of managed refs.
+
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/arch/x86/i386/IO-APIC.rst.

diff --git a/include/linux/percpu-rcurefcount.h b/include/linux/percpu-rcurefcount.h
new file mode 100644
index 000000000000..6022aee1f76e
--- /dev/null
+++ b/include/linux/percpu-rcurefcount.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Percpu refcounts with RCU protected release operation.
+ *
+ * Percpu rcuref is similar to percpu refs. However, they are specialized for
+ * use cases, where the release of the object is protected by a RCU grace
+ * period.
+ *
+ * The initial ref is managed by the reclaim logic; so, users do not need to
+ * keep track of their initial ref. This is particularly useful, when object's
+ * has references active, beyond the release of the initial reference.
+ *
+ * The current implementation is just a wrapper around the percpu refcount
+ * implementation, to reuse the existing percpu and atomic ref switch
+ * management. Switching to a standalone implementation might be required
+ * if percpuref implementation switches to a non-rcu managed read sections.
+ */
+
+#ifndef _LINUX_PERCPU_RCUREFCOUNT_H
+#define _LINUX_PERCPU_RCUREFCOUNT_H
+
+#include <linux/percpu-refcount.h>
+
+struct percpu_rcuref;
+
+struct percpu_rcuref {
+ struct percpu_ref pcpu_ref;
+ struct llist_node node;
+};
+
+int __must_check percpu_rcuref_init(struct percpu_rcuref *rcuref,
+ percpu_ref_func_t *release, gfp_t gfp);
+int __must_check percpu_rcuref_init_unmanaged(struct percpu_rcuref *rcuref,
+ percpu_ref_func_t *release, gfp_t gfp);
+int percpu_rcuref_manage(struct percpu_rcuref *rcuref);
+bool percpu_rcuref_is_zero(struct percpu_rcuref *rcuref);
+void percpu_rcuref_exit(struct percpu_rcuref *rcuref);
+
+/**
+ * percpu_rcuref_get_many - increment a percpu rcuref count
+ * @rcuref: percpu_rcuref to get
+ * @nr: number of references to get
+ *
+ * Analogous to percpu_ref_get_many().
+ */
+static inline void percpu_rcuref_get_many(struct percpu_rcuref *rcuref, unsigned long nr)
+{
+ percpu_ref_get_many(&rcuref->pcpu_ref, nr);
+}
+
+/**
+ * percpu_rcuref_get - increment a percpu rcuref count
+ * @rcuref: percpu_rcuref to get
+ *
+ * Analogous to percpu_ref_get().
+ *
+ */
+static inline void percpu_rcuref_get(struct percpu_rcuref *rcuref)
+{
+ percpu_rcuref_get_many(rcuref, 1);
+}
+
+/**
+ * percpu_rcuref_tryget_many - try to increment a percpu rcuref count
+ * @rcuref: percpu_rcuref to try-get
+ * @nr: number of references to get
+ *
+ * Increment a percpu rcuref count by @nr unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ */
+static inline bool percpu_rcuref_tryget_many(struct percpu_rcuref *rcuref,
+ unsigned long nr)
+{
+ return percpu_ref_tryget_many(&rcuref->pcpu_ref, nr);
+}
+
+/**
+ * percpu_rcuref_tryget - try to increment a percpu rcuref count
+ * @rcuref: percpu_rcuref to try-get
+ *
+ * Increment a percpu rcurefcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ */
+static inline bool percpu_rcuref_tryget(struct percpu_rcuref *rcuref)
+{
+ return percpu_rcuref_tryget_many(rcuref, 1);
+}
+
+/**
+ * percpu_rcuref_put_many - decrement a percpu rcuref count
+ * @rcuref: percpu_rcuref to put
+ * @nr: number of references to put
+ *
+ * Decrement the refcount, and if 0, call the release function (which was passed
+ * to percpu_rcuref_init())
+ */
+static inline void percpu_rcuref_put_many(struct percpu_rcuref *rcuref, unsigned long nr)
+{
+ percpu_ref_put_many(&rcuref->pcpu_ref, nr);
+}
+
+/**
+ * percpu_rcuref_put - decrement a percpu rcuref count
+ * @rcuref: percpu_rcuref to put
+ *
+ * Decrement the refcount, and if 0, call the release function (which was passed
+ * to percpu_ref_init())
+ */
+static inline void percpu_rcuref_put(struct percpu_rcuref *rcuref)
+{
+ percpu_rcuref_put_many(rcuref, 1);
+}
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e61..11da2c586591 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -46,7 +46,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
list_sort.o uuid.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
- percpu-refcount.o rhashtable.o base64.o \
+ percpu-refcount.o percpu-rcurefcount.o rhashtable.o base64.o \
once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
generic-radix-tree.o bitmap-str.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
diff --git a/lib/percpu-rcurefcount.c b/lib/percpu-rcurefcount.c
new file mode 100644
index 000000000000..d0f2d5e88f98
--- /dev/null
+++ b/lib/percpu-rcurefcount.c
@@ -0,0 +1,336 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/moduleparam.h>
+#include <linux/percpu-rcurefcount.h>
+
+static LLIST_HEAD(pcpu_rcuref_head);
+
+/*
+ * The refcount management of percpu rcuref is same as
+ * normal percpu refcount, with the only difference that,
+ * instead of a explicit shutdown percpu_ref_kill() operation
+ * by the user, the initial ref is managed by a kworker.
+ *
+ * The ref can be initialized to start either in managed or
+ * unmanaged mode. In managed mode, the ref is a set of percpu
+ * counters. There is an extra reference acquired for the llist
+ * node and provides the notion of initial ref in percpu refcount.
+ *
+ * During normal operation, users ref get() and put() operations
+ * increment/decrement the percpu counters. There is no check
+ * for drop-to-zero while in percpu mode.
+ *
+ * Periodically, the manager kworker thread scans all percpu
+ * rcurefs. It switches ref to centralized atomic counter mode
+ * and checks whether the object has no references left. The ref is
+ * dropped if there are no references. Otherwise, the ref is switched
+ * back to percpu mode again. During this ref scan, there is a
+ * window where ref operates in atomic mode. This window spans
+ * one RCU grace period.
+ *
+ * There is a provision to start a percpu rcuref in unmanaged mode.
+ * This is provided for cases, where there is a need to avoid
+ * dependency on kworker and RCU grace period. In addition,
+ * unmanaged mode can be used for a ref, for which the release
+ * function initially does not wait for RCU grace period, for
+ * example when the enclosing object initialization fails, and
+ * there is a rollback operation in error paths. Later, when
+ * object initialization is complete, ref can be switched to
+ * percpu managed mode.
+ */
+/**
+ * percpu_rcuref_init - initialize a percpu rcuref count
+ * @rcuref: percpu_rcuref to initialize
+ * @release: function which will be called when refcount hits 0
+ * @gfp: allocation mask to use
+ *
+ * Initializes @rcuref. @rcuref starts out in percpu mode with a refcount of 2.
+ * The initial ref is managed by the pcpu rcuref release worker kthread.
+ * The second reference is for the user.
+ *
+ * Note that @release must not sleep - it can block release of other
+ * pcpu rcurefs.
+ */
+int percpu_rcuref_init(struct percpu_rcuref *rcuref, percpu_ref_func_t *release, gfp_t gfp)
+{
+ int ret;
+
+ ret = percpu_ref_init(&rcuref->pcpu_ref, release,
+ PERCPU_REF_ALLOW_REINIT, gfp);
+ if (ret)
+ return ret;
+ percpu_ref_get(&rcuref->pcpu_ref);
+ llist_add(&rcuref->node, &pcpu_rcuref_head);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(percpu_rcuref_init);
+
+/**
+ * percpu_rcuref_init_unmanaged - initialize a percpu rcuref count in
+ * unmanaged (atomic) mode.
+ * @rcuref: percpu_rcuref to initialize
+ * @release: function which will be called when refcount hits 0
+ * @gfp: allocation mask to use
+ *
+ * Initializes @rcuref. @rcuref starts out in unmanaged/atomic mode
+ * with a refcount of 1.
+ * The initial ref is passed to the user and ref management is
+ * auto, the last put operation releases the ref.
+ * The ref may be initialized in this mode, to avoid dependency
+ * on workqueue and RCU, for early boot code; and for cases where
+ * a ref starts as non-RCU release and switches to RCU grace period
+ * based release of the reference. The percpu_rcuref_manage() call
+ * can be used to switch this ref to managed mode, while the ref
+ * is active. This operation is non-reversible, and the ref remains
+ * in managed mode, for its lifeline, until it is released by the
+ * pcpu release kworker.
+ *
+ * Note that @release must not sleep - if the ref is switched to
+ * managed mode, it can block release of other pcpu rcurefs.
+ */
+int percpu_rcuref_init_unmanaged(struct percpu_rcuref *rcuref,
+ percpu_ref_func_t *release, gfp_t gfp)
+{
+ int ret;
+
+ ret = percpu_ref_init(&rcuref->pcpu_ref, release, PERCPU_REF_INIT_ATOMIC, gfp);
+ if (!ret)
+ init_llist_node(&rcuref->node);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(percpu_rcuref_init_unmanaged);
+
+/**
+ * percpu_rcuref_manage - Switch an unmanaged ref to percpu mode.
+ *
+ * @rcuref: percpu_rcuref to initialize
+ * @release: function which will be called when refcount hits 0
+ * @gfp: allocation mask to use
+ *
+ */
+int percpu_rcuref_manage(struct percpu_rcuref *rcuref)
+{
+ if (WARN_ONCE(!percpu_rcuref_tryget(rcuref), "Percpu rcuref is not active\n"))
+ return -1;
+ if (WARN_ONCE(llist_on_list(&rcuref->node), "Percpu rcuref already managed\n")) {
+ percpu_rcuref_put(rcuref);
+ return -2;
+ }
+ percpu_ref_switch_to_percpu(&rcuref->pcpu_ref);
+ /* Ensure ordering of percpu mode switch and node scan */
+ smp_mb();
+ llist_add(&rcuref->node, &pcpu_rcuref_head);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(percpu_rcuref_manage);
+
+/**
+ * percpu_rcuref_is_zero - test whether a percpu rcuref count reached zero
+ * @rcuref: percpu_rcuref to test
+ *
+ * Returns %true if @ref reached zero.
+ */
+bool percpu_rcuref_is_zero(struct percpu_rcuref *rcuref)
+{
+ return percpu_ref_is_zero(&rcuref->pcpu_ref);
+}
+EXPORT_SYMBOL_GPL(percpu_rcuref_is_zero);
+
+/**
+ * percpu_rcuref_exit - undo percpu_rcuref_init()
+ * @rcuref: percpu_rcuref to exit
+ *
+ * This function exits @rcuref. The caller is responsible for ensuring that
+ * @rcuref is no longer in active use. The usual places to invoke this
+ * function from are the @rcuref->release() callback or in init failure path
+ * where percpu_rcuref_init() succeeded but other parts of the initialization
+ * of the embedding object failed.
+ */
+void percpu_rcuref_exit(struct percpu_rcuref *rcuref)
+{
+ percpu_ref_exit(&rcuref->pcpu_ref);
+ init_llist_node(&rcuref->node);
+}
+
+#define DEFAULT_PCPU_RCUREF_SCAN_INTERVAL_MS 5000
+/* Interval duration between two ref scans. */
+static ulong ref_scan_interval = DEFAULT_PCPU_RCUREF_SCAN_INTERVAL_MS;
+module_param(ref_scan_interval, ulong, 0444);
+
+#define DEFAULT_PCPU_RCUREF_MAX_SCAN_COUNT 100
+/* Number of pcpu refs scanned in one iteration of worker execution. */
+static int max_ref_scan_count = DEFAULT_PCPU_RCUREF_MAX_SCAN_COUNT;
+module_param(max_ref_scan_count, int, 0444);
+
+static void percpu_rcuref_release_work_fn(struct work_struct *work);
+
+/*
+ * Sentinel llist nodes, for lockless list traveral and deletions by
+ * the pcpu rcuref release worker, while nodes are added from normal
+ * from percpu_rcuref_init() and percpu_rcuref_manage().
+ *
+ * Sentinel node marks the head of list traversal for the current
+ * iteration of kworker execution.
+ */
+struct pcpu_rcuref_sen_node {
+ bool inuse;
+ struct llist_node node;
+};
+
+/*
+ * We need two sentinel nodes for lockless list manipulations from release
+ * worker - first node will be used in current reclaim iteration.The second
+ * node will be used in next iteration. Next iteration marks the first node
+ * as free, for use in following iteration.
+ */
+#define PCPU_RCUREF_SEN_NODES_COUNT 2
+
+/* Track last processed percpu rcuref node */
+static struct llist_node *last_pcu_rcuref_node;
+
+static struct pcpu_rcuref_sen_node
+ pcpu_rcuref_sen_nodes[PCPU_RCUREF_SEN_NODES_COUNT];
+
+static DECLARE_DELAYED_WORK(percpu_rcuref_release_work,
+ percpu_rcuref_release_work_fn);
+
+static bool percpu_rcuref_is_sen_node(struct llist_node *node)
+{
+ return &pcpu_rcuref_sen_nodes[0].node <= node &&
+ node <= &pcpu_rcuref_sen_nodes[PCPU_RCUREF_SEN_NODES_COUNT - 1].node;
+}
+
+static struct llist_node *percpu_rcuref_get_sen_node(void)
+{
+ int i;
+ struct pcpu_rcuref_sen_node *sn;
+
+ for (i = 0; i < PCPU_RCUREF_SEN_NODES_COUNT; i++) {
+ sn = &pcpu_rcuref_sen_nodes[i];
+ if (!sn->inuse) {
+ sn->inuse = true;
+ return &sn->node;
+ }
+ }
+
+ return NULL;
+}
+
+static void percpu_rcuref_put_sen_node(struct llist_node *node)
+{
+ struct pcpu_rcuref_sen_node *sn = container_of(node, struct pcpu_rcuref_sen_node, node);
+
+ sn->inuse = false;
+}
+
+static void percpu_rcuref_put_all_sen_nodes_except(struct llist_node *node)
+{
+ int i;
+
+ for (i = 0; i < PCPU_RCUREF_SEN_NODES_COUNT; i++) {
+ if (&pcpu_rcuref_sen_nodes[i].node == node)
+ continue;
+ pcpu_rcuref_sen_nodes[i].inuse = false;
+ init_llist_node(&pcpu_rcuref_sen_nodes[i].node);
+ }
+}
+
+static struct workqueue_struct *percpu_rcuref_wq;
+
+static void percpu_rcuref_release_work_fn(struct work_struct *work)
+{
+ struct llist_node *pos, *first, *head, *prev, *next;
+ struct percpu_rcuref *rcuref;
+ struct llist_node *sen_node;
+ int count = 0;
+ bool held;
+
+ first = READ_ONCE(pcpu_rcuref_head.first);
+ if (!first)
+ goto queue_release_work;
+
+ if (last_pcu_rcuref_node == NULL || last_pcu_rcuref_node->next == NULL) {
+retry_sentinel_get:
+ sen_node = percpu_rcuref_get_sen_node();
+ /*
+ * All sentinel nodes are in use? This should not happen, as we
+ * require only one sentinel for the start of list traversal and
+ * other sentinel node is freed during the traversal.
+ */
+ if (WARN_ONCE(!sen_node, "Percpu RCU ref sentinel nodes exhausted\n")) {
+ /* Use first node as the sentinel node */
+ head = first->next;
+ if (!head) {
+ struct llist_node *ign_node = NULL;
+ /*
+ * We exhausted sentinel nodes. However, there aren't
+ * enough nodes in the llist. So, we have leaked
+ * sentinel nodes. Reclaim sentinels and retry.
+ */
+ if (percpu_rcuref_is_sen_node(first))
+ ign_node = first;
+ percpu_rcuref_put_all_sen_nodes_except(ign_node);
+ goto retry_sentinel_get;
+ }
+ prev = first;
+ } else {
+ llist_add(sen_node, &pcpu_rcuref_head);
+ prev = sen_node;
+ head = prev->next;
+ }
+ } else {
+ prev = last_pcu_rcuref_node;
+ head = prev->next;
+ }
+
+ last_pcu_rcuref_node = NULL;
+ llist_for_each_safe(pos, next, head) {
+ /* Free sentinel node which is present in the list */
+ if (percpu_rcuref_is_sen_node(pos)) {
+ prev->next = pos->next;
+ percpu_rcuref_put_sen_node(pos);
+ continue;
+ }
+
+ rcuref = container_of(pos, struct percpu_rcuref, node);
+ percpu_ref_switch_to_atomic_sync(&rcuref->pcpu_ref);
+ /*
+ * Drop the ref while in RCU read critical section, to
+ * prevent obj free while we manipulating node.
+ */
+ rcu_read_lock();
+ percpu_ref_put(&rcuref->pcpu_ref);
+ held = percpu_ref_tryget(&rcuref->pcpu_ref);
+ if (!held) {
+ prev->next = pos->next;
+ init_llist_node(pos);
+ }
+ rcu_read_unlock();
+ if (!held)
+ continue;
+ percpu_ref_switch_to_percpu(&rcuref->pcpu_ref);
+ count++;
+ if (count == max_ref_scan_count) {
+ last_pcu_rcuref_node = pos;
+ break;
+ }
+ prev = pos;
+ }
+
+queue_release_work:
+ queue_delayed_work(percpu_rcuref_wq, &percpu_rcuref_release_work,
+ ref_scan_interval);
+}
+
+static __init int percpu_rcuref_setup(void)
+{
+ percpu_rcuref_wq = alloc_workqueue("percpu_rcuref",
+ WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+ if (!percpu_rcuref_wq)
+ return -ENOMEM;
+
+ queue_delayed_work(percpu_rcuref_wq, &percpu_rcuref_release_work,
+ ref_scan_interval);
+ return 0;
+}
+early_initcall(percpu_rcuref_setup);
--
2.34.1


2024-01-10 11:26:19

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 3/9] apparmor: Switch unconfined namespaces refcount to percpu mode

Switch unconfined labels to percpu mode, to avoid high
memory contention on refcount get and put operations,
when multiple cpus try to perform these operations
at the same time. Unconfined label for root and sub
namespaces are killed at the point of aa_free_root_ns().
Though labels/profiles in various namespaces could
potentially still be active after this point, aa_free_root_ns()
is not typically called when apparmor enforcement is enabled.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
security/apparmor/include/policy.h | 24 ++++++++++++++++++++++++
security/apparmor/include/policy_ns.h | 24 ++++++++++++++++++++++++
security/apparmor/policy_ns.c | 6 ++++--
3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 5849b6b94cea..1e3b29ba6c03 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -379,6 +379,30 @@ static inline void aa_put_profile(struct aa_profile *p)
percpu_ref_put(&p->label.count);
}

+/**
+ * aa_switch_ref_profile - switch percpu-ref mode for profile @p
+ * @p: profile (MAYBE NULL)
+ */
+static inline void aa_switch_ref_profile(struct aa_profile *p, bool percpu)
+{
+ if (p) {
+ if (percpu)
+ percpu_ref_switch_to_percpu(&p->label.count);
+ else
+ percpu_ref_switch_to_atomic_sync(&p->label.count);
+ }
+}
+
+/**
+ * aa_kill_ref_profile - percpu-ref kill for profile @p
+ * @p: profile (MAYBE NULL)
+ */
+static inline void aa_kill_ref_profile(struct aa_profile *p)
+{
+ if (p)
+ percpu_ref_kill(&p->label.count);
+}
+
static inline int AUDIT_MODE(struct aa_profile *profile)
{
if (aa_g_audit != AUDIT_NORMAL)
diff --git a/security/apparmor/include/policy_ns.h b/security/apparmor/include/policy_ns.h
index d646070fd966..f3db01c5e193 100644
--- a/security/apparmor/include/policy_ns.h
+++ b/security/apparmor/include/policy_ns.h
@@ -127,6 +127,30 @@ static inline void aa_put_ns(struct aa_ns *ns)
aa_put_profile(ns->unconfined);
}

+/**
+ * aa_switch_ref_ns - switch percpu-ref mode for @ns
+ * @ns: namespace to switch percpu-ref mode of
+ *
+ * Switch percpu-ref mode of @ns between percpu and atomic
+ */
+static inline void aa_switch_ref_ns(struct aa_ns *ns, bool percpu)
+{
+ if (ns)
+ aa_switch_ref_profile(ns->unconfined, percpu);
+}
+
+/**
+ * aa_kill_ref_ns - do percpu-ref kill for @ns
+ * @ns: namespace to perform percpu-ref kill for
+ *
+ * Do percpu-ref kill of @ns refcount
+ */
+static inline void aa_kill_ref_ns(struct aa_ns *ns)
+{
+ if (ns)
+ aa_kill_ref_profile(ns->unconfined);
+}
+
/**
* __aa_findn_ns - find a namespace on a list by @name
* @head: list to search for namespace on (NOT NULL)
diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index 1f02cfe1d974..ca633cfbd936 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -124,6 +124,7 @@ static struct aa_ns *alloc_ns(const char *prefix, const char *name)
goto fail_unconfined;
/* ns and ns->unconfined share ns->unconfined refcount */
ns->unconfined->ns = ns;
+ aa_switch_ref_ns(ns, true);

atomic_set(&ns->uniq_null, 0);

@@ -336,7 +337,7 @@ void __aa_remove_ns(struct aa_ns *ns)
/* remove ns from namespace list */
list_del_rcu(&ns->base.list);
destroy_ns(ns);
- aa_put_ns(ns);
+ aa_kill_ref_ns(ns);
}

/**
@@ -377,6 +378,7 @@ int __init aa_alloc_root_ns(void)
}
kernel_t = &kernel_p->label;
root_ns->unconfined->ns = aa_get_ns(root_ns);
+ aa_switch_ref_ns(root_ns, true);

return 0;
}
@@ -392,5 +394,5 @@ void __init aa_free_root_ns(void)

aa_label_free(kernel_t);
destroy_ns(ns);
- aa_put_ns(ns);
+ aa_kill_ref_ns(ns);
}
--
2.34.1


2024-01-10 11:26:49

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 8/9] apparmor: Switch labels to percpu rcurefcount in unmanaged mode

Replaces label kref with percpu rcurefcount.
The percpu rcuref is initialized in unmanaged/atomic mode,
as labels do not use RCU grace period based release
for labels which do not have a namespace associated
with them yet. Subsequent patch moves the managed/percpu
mode, at points where rcu grace period based cleanup
is guaranteed.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
include/linux/percpu-refcount.h | 2 -
lib/percpu-refcount.c | 93 -----------
security/apparmor/include/label.h | 14 +-
security/apparmor/include/policy.h | 32 +---
security/apparmor/include/policy_ns.h | 24 ---
security/apparmor/label.c | 8 +-
security/apparmor/lsm.c | 224 --------------------------
security/apparmor/policy_ns.c | 6 +-
8 files changed, 15 insertions(+), 388 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 9e30c458cc00..d73a1c08c3e3 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -131,8 +131,6 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
void percpu_ref_resurrect(struct percpu_ref *ref);
void percpu_ref_reinit(struct percpu_ref *ref);
bool percpu_ref_is_zero(struct percpu_ref *ref);
-void percpu_ref_swap_percpu_sync(struct percpu_ref *ref1, struct percpu_ref *ref2);
-void percpu_ref_transfer_percpu_count(struct percpu_ref *ref1, struct percpu_ref *ref2);

/**
* percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 36814446db34..668f6aa6a75d 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -477,96 +477,3 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
}
EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
-
-static void percpu_ref_swap_percpu_rcu(struct rcu_head *rcu)
-{
- struct percpu_ref_data *data = container_of(rcu,
- struct percpu_ref_data, rcu);
- struct percpu_ref *ref = data->ref;
-
- data->confirm_switch(ref);
- data->confirm_switch = NULL;
- wake_up_all(&percpu_ref_switch_waitq);
-
-}
-
-static void __percpu_ref_swap_percpu(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch)
-{
- ref->data->confirm_switch = confirm_switch ?:
- percpu_ref_noop_confirm_switch;
- call_rcu_hurry(&ref->data->rcu,
- percpu_ref_swap_percpu_rcu);
-}
-
-/**
- * percpuref_swap_percpu_sync - Swap percpu counter of one ref with other
- * @ref1: First perpcu_ref to swap the counter
- * @ref2: Second percpu_ref for counter swap
- */
-void percpu_ref_swap_percpu_sync(struct percpu_ref *ref1, struct percpu_ref *ref2)
-{
- unsigned long __percpu *percpu_count;
- unsigned long flags;
- struct percpu_ref_data *data1 = ref1->data;
- struct percpu_ref_data *data2 = ref2->data;
- unsigned long percpu_cnt_ptr1 = ref1->percpu_count_ptr;
- unsigned long percpu_cnt_ptr2 = ref2->percpu_count_ptr;
- atomic_long_t count1 = ref1->data->count;
- atomic_long_t count2 = ref2->data->count;
-
- spin_lock_irqsave(&percpu_ref_switch_lock, flags);
- wait_event_lock_irq(percpu_ref_switch_waitq,
- !data1->confirm_switch && !data2->confirm_switch,
- percpu_ref_switch_lock);
- if (!__ref_is_percpu(ref1, &percpu_count) ||
- !__ref_is_percpu(ref2, &percpu_count)) {
- spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
- return;
- }
- WRITE_ONCE(ref1->percpu_count_ptr, percpu_cnt_ptr2);
- WRITE_ONCE(ref2->percpu_count_ptr, percpu_cnt_ptr1);
-
- __percpu_ref_swap_percpu(ref1, NULL);
- __percpu_ref_swap_percpu(ref2, NULL);
- ref1->data->count = count2;
- ref2->data->count = count1;
- spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
- wait_event(percpu_ref_switch_waitq, !ref1->data->confirm_switch &&
- !ref2->data->confirm_switch);
-}
-
-/**
- * percpu_ref_transfer_percpu_count - Transfer percpu counts of one ref to other
- * @ref1: perpcu_ref to transfer the counters to
- * @ref2: percpu_ref to transfer the counters from
- *
- * The per cpu counts of ref2 are transferred to the atomic counter of ref1.
- * The ref2 is expected to be inactive.
- */
-void percpu_ref_transfer_percpu_count(struct percpu_ref *ref1, struct percpu_ref *ref2)
-{
- unsigned long __percpu *percpu_count = percpu_count_ptr(ref2);
- struct percpu_ref_data *data1 = ref1->data;
- struct percpu_ref_data *data2 = ref2->data;
- unsigned long count = 0;
- unsigned long flags;
- int cpu;
-
- spin_lock_irqsave(&percpu_ref_switch_lock, flags);
- wait_event_lock_irq(percpu_ref_switch_waitq,
- !data1->confirm_switch && !data2->confirm_switch,
- percpu_ref_switch_lock);
-
- if (!__ref_is_percpu(ref1, &percpu_count) ||
- !__ref_is_percpu(ref2, &percpu_count)) {
- spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
- return;
- }
-
- for_each_possible_cpu(cpu) {
- count += *per_cpu_ptr(percpu_count, cpu);
- *per_cpu_ptr(percpu_count, cpu) = 0;
- }
- atomic_long_add((long)count, &ref1->data->count);
- spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
-}
diff --git a/security/apparmor/include/label.h b/security/apparmor/include/label.h
index 0fc4879930dd..3feb3a65a00c 100644
--- a/security/apparmor/include/label.h
+++ b/security/apparmor/include/label.h
@@ -14,6 +14,7 @@
#include <linux/audit.h>
#include <linux/rbtree.h>
#include <linux/rcupdate.h>
+#include <linux/percpu-rcurefcount.h>

#include "apparmor.h"
#include "lib.h"
@@ -121,11 +122,10 @@ struct label_it {
* @ent: set of profiles for label, actual size determined by @size
*/
struct aa_label {
- struct percpu_ref count;
+ struct percpu_rcuref count;
long flags;
struct aa_proxy *proxy;
struct rb_node node;
- struct llist_node reclaim_node;
struct rcu_head rcu;
__counted char *hname;
u32 secid;
@@ -374,7 +374,7 @@ int aa_label_match(struct aa_profile *profile, struct aa_ruleset *rules,
*/
static inline struct aa_label *__aa_get_label(struct aa_label *l)
{
- if (l && percpu_ref_tryget(&l->count))
+ if (l && percpu_rcuref_tryget(&l->count))
return l;

return NULL;
@@ -383,7 +383,7 @@ static inline struct aa_label *__aa_get_label(struct aa_label *l)
static inline struct aa_label *aa_get_label(struct aa_label *l)
{
if (l)
- percpu_ref_get(&(l->count));
+ percpu_rcuref_get(&(l->count));

return l;
}
@@ -403,7 +403,7 @@ static inline struct aa_label *aa_get_label_rcu(struct aa_label __rcu **l)
rcu_read_lock();
do {
c = rcu_dereference(*l);
- } while (c && !percpu_ref_tryget(&c->count));
+ } while (c && !percpu_rcuref_tryget(&c->count));
rcu_read_unlock();

return c;
@@ -443,7 +443,7 @@ static inline struct aa_label *aa_get_newest_label(struct aa_label *l)
static inline void aa_put_label(struct aa_label *l)
{
if (l)
- percpu_ref_put(&l->count);
+ percpu_rcuref_put(&l->count);
}


@@ -466,6 +466,4 @@ static inline void aa_put_proxy(struct aa_proxy *proxy)

void __aa_proxy_redirect(struct aa_label *orig, struct aa_label *new);

-void aa_label_reclaim_add_label(struct aa_label *label);
-
#endif /* __AA_LABEL_H */
diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 1e3b29ba6c03..5b2473a09103 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -329,7 +329,7 @@ static inline aa_state_t ANY_RULE_MEDIATES(struct list_head *head,
static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
{
if (p)
- percpu_ref_get(&(p->label.count));
+ percpu_rcuref_get(&(p->label.count));

return p;
}
@@ -343,7 +343,7 @@ static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
*/
static inline struct aa_profile *aa_get_profile_not0(struct aa_profile *p)
{
- if (p && percpu_ref_tryget(&p->label.count))
+ if (p && percpu_rcuref_tryget(&p->label.count))
return p;

return NULL;
@@ -363,7 +363,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
rcu_read_lock();
do {
c = rcu_dereference(*p);
- } while (c && !percpu_ref_tryget(&c->label.count));
+ } while (c && !percpu_rcuref_tryget(&c->label.count));
rcu_read_unlock();

return c;
@@ -376,31 +376,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
static inline void aa_put_profile(struct aa_profile *p)
{
if (p)
- percpu_ref_put(&p->label.count);
-}
-
-/**
- * aa_switch_ref_profile - switch percpu-ref mode for profile @p
- * @p: profile (MAYBE NULL)
- */
-static inline void aa_switch_ref_profile(struct aa_profile *p, bool percpu)
-{
- if (p) {
- if (percpu)
- percpu_ref_switch_to_percpu(&p->label.count);
- else
- percpu_ref_switch_to_atomic_sync(&p->label.count);
- }
-}
-
-/**
- * aa_kill_ref_profile - percpu-ref kill for profile @p
- * @p: profile (MAYBE NULL)
- */
-static inline void aa_kill_ref_profile(struct aa_profile *p)
-{
- if (p)
- percpu_ref_kill(&p->label.count);
+ percpu_rcuref_put(&p->label.count);
}

static inline int AUDIT_MODE(struct aa_profile *profile)
diff --git a/security/apparmor/include/policy_ns.h b/security/apparmor/include/policy_ns.h
index f3db01c5e193..d646070fd966 100644
--- a/security/apparmor/include/policy_ns.h
+++ b/security/apparmor/include/policy_ns.h
@@ -127,30 +127,6 @@ static inline void aa_put_ns(struct aa_ns *ns)
aa_put_profile(ns->unconfined);
}

-/**
- * aa_switch_ref_ns - switch percpu-ref mode for @ns
- * @ns: namespace to switch percpu-ref mode of
- *
- * Switch percpu-ref mode of @ns between percpu and atomic
- */
-static inline void aa_switch_ref_ns(struct aa_ns *ns, bool percpu)
-{
- if (ns)
- aa_switch_ref_profile(ns->unconfined, percpu);
-}
-
-/**
- * aa_kill_ref_ns - do percpu-ref kill for @ns
- * @ns: namespace to perform percpu-ref kill for
- *
- * Do percpu-ref kill of @ns refcount
- */
-static inline void aa_kill_ref_ns(struct aa_ns *ns)
-{
- if (ns)
- aa_kill_ref_profile(ns->unconfined);
-}
-
/**
* __aa_findn_ns - find a namespace on a list by @name
* @head: list to search for namespace on (NOT NULL)
diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index 1299262f54e1..f28dec1c3e70 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -336,7 +336,7 @@ void aa_label_destroy(struct aa_label *label)
rcu_assign_pointer(label->proxy->label, NULL);
aa_put_proxy(label->proxy);
}
- percpu_ref_exit(&label->count);
+ percpu_rcuref_exit(&label->count);
aa_free_secid(label->secid);

label->proxy = (struct aa_proxy *) PROXY_POISON + 1;
@@ -372,7 +372,7 @@ static void label_free_rcu(struct rcu_head *head)

void aa_label_percpu_ref(struct percpu_ref *ref)
{
- struct aa_label *label = container_of(ref, struct aa_label, count);
+ struct aa_label *label = container_of(ref, struct aa_label, count.pcpu_ref);
struct aa_ns *ns = labels_ns(label);

if (!ns) {
@@ -409,7 +409,7 @@ bool aa_label_init(struct aa_label *label, int size, gfp_t gfp)

label->size = size; /* doesn't include null */
label->vec[size] = NULL; /* null terminate */
- if (percpu_ref_init(&label->count, aa_label_percpu_ref, PERCPU_REF_INIT_ATOMIC, gfp)) {
+ if (percpu_rcuref_init_unmanaged(&label->count, aa_label_percpu_ref, gfp)) {
aa_free_secid(label->secid);
return false;
}
@@ -710,8 +710,6 @@ static struct aa_label *__label_insert(struct aa_labelset *ls,
rb_link_node(&label->node, parent, new);
rb_insert_color(&label->node, &ls->root);
label->flags |= FLAG_IN_TREE;
- percpu_ref_switch_to_percpu(&label->count);
- aa_label_reclaim_add_label(label);

return aa_get_label(label);
}
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index d0d4ebad1e26..e490a7000408 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -64,204 +64,6 @@ static LIST_HEAD(aa_global_buffers);
static DEFINE_SPINLOCK(aa_buffers_lock);
static DEFINE_PER_CPU(struct aa_local_cache, aa_local_buffers);

-static struct workqueue_struct *aa_label_reclaim_wq;
-static void aa_label_reclaim_work_fn(struct work_struct *work);
-
-/*
- * Dummy llist nodes, for lockless list traveral and deletions by
- * the reclaim worker, while nodes are added from normal label
- * insertion paths.
- */
-struct aa_label_reclaim_node {
- bool inuse;
- struct llist_node node;
-};
-
-/*
- * We need two dummy head nodes for lockless list manipulations from reclaim
- * worker - first dummy node will be used in current reclaim iteration;
- * the second one will be used in next iteration. Next iteration marks
- * the first dummy node as free, for use in following iteration.
- */
-#define AA_LABEL_RECLAIM_NODE_MAX 2
-
-#define AA_MAX_LABEL_RECLAIMS 100
-#define AA_LABEL_RECLAIM_INTERVAL_MS 5000
-
-static LLIST_HEAD(aa_label_reclaim_head);
-static struct llist_node *last_reclaim_label;
-static struct aa_label_reclaim_node aa_label_reclaim_nodes[AA_LABEL_RECLAIM_NODE_MAX];
-static DECLARE_DELAYED_WORK(aa_label_reclaim_work, aa_label_reclaim_work_fn);
-static struct percpu_ref aa_label_reclaim_ref;
-
-void aa_label_reclaim_add_label(struct aa_label *label)
-{
- percpu_ref_get(&label->count);
- llist_add(&label->reclaim_node, &aa_label_reclaim_head);
-}
-
-static bool aa_label_is_reclaim_node(struct llist_node *node)
-{
- return &aa_label_reclaim_nodes[0].node <= node &&
- node <= &aa_label_reclaim_nodes[AA_LABEL_RECLAIM_NODE_MAX - 1].node;
-}
-
-static struct llist_node *aa_label_get_reclaim_node(void)
-{
- int i;
- struct aa_label_reclaim_node *rn;
-
- for (i = 0; i < AA_LABEL_RECLAIM_NODE_MAX; i++) {
- rn = &aa_label_reclaim_nodes[i];
- if (!rn->inuse) {
- rn->inuse = true;
- return &rn->node;
- }
- }
-
- return NULL;
-}
-
-static void aa_label_put_reclaim_node(struct llist_node *node)
-{
- struct aa_label_reclaim_node *rn = container_of(node, struct aa_label_reclaim_node, node);
-
- rn->inuse = false;
-}
-
-static void aa_put_all_reclaim_nodes(void)
-{
- int i;
-
- for (i = 0; i < AA_LABEL_RECLAIM_NODE_MAX; i++)
- aa_label_reclaim_nodes[i].inuse = false;
-}
-static void aa_release_reclaim_ref_noop(struct percpu_ref *ref)
-{
-}
-
-static void aa_label_reclaim_work_fn(struct work_struct *work)
-{
- struct llist_node *pos, *first, *head, *prev, *next;
- static bool reclaim_ref_dead_once;
- struct llist_node *reclaim_node;
- struct aa_label *label;
- int cnt = 0;
- bool held, ref_is_zero;
-
- first = aa_label_reclaim_head.first;
- if (!first)
- goto queue_reclaim_work;
-
- if (last_reclaim_label == NULL || last_reclaim_label->next == NULL) {
- reclaim_node = aa_label_get_reclaim_node();
- WARN_ONCE(!reclaim_node, "Reclaim heads exhausted\n");
- if (unlikely(!reclaim_node)) {
- head = first->next;
- if (!head) {
- aa_put_all_reclaim_nodes();
- goto queue_reclaim_work;
- }
- prev = first;
- } else {
- llist_add(reclaim_node, &aa_label_reclaim_head);
- prev = reclaim_node;
- head = prev->next;
- }
- } else {
- prev = last_reclaim_label;
- head = prev->next;
- }
-
- last_reclaim_label = NULL;
- llist_for_each_safe(pos, next, head) {
- /* Free reclaim node, which is present in the list */
- if (aa_label_is_reclaim_node(pos)) {
- prev->next = pos->next;
- aa_label_put_reclaim_node(pos);
- continue;
- }
-
- label = container_of(pos, struct aa_label, reclaim_node);
- if (reclaim_ref_dead_once)
- percpu_ref_reinit(&aa_label_reclaim_ref);
-
- /*
- * Switch counters of label ref and reclaim ref.
- * Label's refcount becomes 1
- * Percpu refcount has the current refcount value
- * of the label percpu_ref.
- */
- percpu_ref_swap_percpu_sync(&label->count, &aa_label_reclaim_ref);
-
- /* Switch reclaim ref to percpu, to check for 0 */
- percpu_ref_switch_to_atomic_sync(&aa_label_reclaim_ref);
-
- /*
- * Release a count (original label percpu ref had an extra count,
- * from the llist addition).
- * When all percpu references have been released, this should
- * be the initial count, which gets dropped.
- */
- percpu_ref_put(&aa_label_reclaim_ref);
- /*
- * Release function of reclaim ref is noop; we store the result
- * for later processing after common code.
- */
- if (percpu_ref_is_zero(&aa_label_reclaim_ref))
- ref_is_zero = true;
-
- /*
- * Restore back initial count. Switch reclaim ref to
- * percpu, for switching back the label percpu and
- * atomic counters.
- */
- percpu_ref_get(&aa_label_reclaim_ref);
- percpu_ref_switch_to_percpu(&aa_label_reclaim_ref);
- /*
- * Swap the refs again. Label gets all old counts
- * in its atomic counter after this operation.
- */
- percpu_ref_swap_percpu_sync(&label->count, &aa_label_reclaim_ref);
-
- /*
- * Transfer the percpu counts, which got added, while this
- * switch was going on. The counters are accumulated into
- * the label ref's atomic counter.
- */
- percpu_ref_transfer_percpu_count(&label->count, &aa_label_reclaim_ref);
-
- /* Kill reclaim ref for reinitialization, for next iteration */
- percpu_ref_kill(&aa_label_reclaim_ref);
- reclaim_ref_dead_once = true;
-
- /* If refcount of label ref was found to be 0, reclaim it now! */
- if (ref_is_zero) {
- percpu_ref_switch_to_atomic_sync(&label->count);
- rcu_read_lock();
- percpu_ref_put(&label->count);
- held = percpu_ref_tryget(&label->count);
- if (!held)
- prev->next = pos->next;
- rcu_read_unlock();
- if (!held)
- continue;
- percpu_ref_switch_to_percpu(&label->count);
- }
-
- cnt++;
- if (cnt == AA_MAX_LABEL_RECLAIMS) {
- last_reclaim_label = pos;
- break;
- }
- prev = pos;
- }
-
-queue_reclaim_work:
- queue_delayed_work(aa_label_reclaim_wq, &aa_label_reclaim_work,
- msecs_to_jiffies(AA_LABEL_RECLAIM_INTERVAL_MS));
-}
-
/*
* LSM hook functions
*/
@@ -2197,16 +1999,6 @@ static int __init set_init_ctx(void)
return 0;
}

-static int __init clear_init_ctx(void)
-{
- struct cred *cred = (__force struct cred *)current->real_cred;
-
- set_cred_label(cred, NULL);
- aa_put_label(ns_unconfined(root_ns));
-
- return 0;
-}
-
static void destroy_buffers(void)
{
union aa_buffer *aa_buf;
@@ -2485,22 +2277,6 @@ static int __init apparmor_init(void)
aa_free_root_ns();
goto buffers_out;
}
-
- aa_label_reclaim_wq = alloc_workqueue("aa_label_reclaim",
- WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
- WARN_ON(!aa_label_reclaim_wq);
- if (aa_label_reclaim_wq)
- queue_delayed_work(aa_label_reclaim_wq, &aa_label_reclaim_work,
- AA_LABEL_RECLAIM_INTERVAL_MS);
-
- if (!percpu_ref_init(&aa_label_reclaim_ref, aa_release_reclaim_ref_noop,
- PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
- AA_ERROR("Failed to allocate label reclaim percpu ref\n");
- aa_free_root_ns();
- clear_init_ctx();
- goto buffers_out;
- }
-
security_add_hooks(apparmor_hooks, ARRAY_SIZE(apparmor_hooks),
&apparmor_lsmid);

diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index ca633cfbd936..1f02cfe1d974 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -124,7 +124,6 @@ static struct aa_ns *alloc_ns(const char *prefix, const char *name)
goto fail_unconfined;
/* ns and ns->unconfined share ns->unconfined refcount */
ns->unconfined->ns = ns;
- aa_switch_ref_ns(ns, true);

atomic_set(&ns->uniq_null, 0);

@@ -337,7 +336,7 @@ void __aa_remove_ns(struct aa_ns *ns)
/* remove ns from namespace list */
list_del_rcu(&ns->base.list);
destroy_ns(ns);
- aa_kill_ref_ns(ns);
+ aa_put_ns(ns);
}

/**
@@ -378,7 +377,6 @@ int __init aa_alloc_root_ns(void)
}
kernel_t = &kernel_p->label;
root_ns->unconfined->ns = aa_get_ns(root_ns);
- aa_switch_ref_ns(root_ns, true);

return 0;
}
@@ -394,5 +392,5 @@ void __init aa_free_root_ns(void)

aa_label_free(kernel_t);
destroy_ns(ns);
- aa_kill_ref_ns(ns);
+ aa_put_ns(ns);
}
--
2.34.1


2024-01-10 11:27:17

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 9/9] apparmor: Switch unconfined and in tree labels to managed ref mode

Switch unconfined and in-tree labels to percpu managed
mode of percpu rcuref. This helps avoid memory contention
in ref get and put operations.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
security/apparmor/label.c | 1 +
security/apparmor/policy_ns.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index f28dec1c3e70..57fcd5b3e48a 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -710,6 +710,7 @@ static struct aa_label *__label_insert(struct aa_labelset *ls,
rb_link_node(&label->node, parent, new);
rb_insert_color(&label->node, &ls->root);
label->flags |= FLAG_IN_TREE;
+ percpu_rcuref_manage(&label->count);

return aa_get_label(label);
}
diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index 1f02cfe1d974..ff261b119c53 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -124,6 +124,7 @@ static struct aa_ns *alloc_ns(const char *prefix, const char *name)
goto fail_unconfined;
/* ns and ns->unconfined share ns->unconfined refcount */
ns->unconfined->ns = ns;
+ percpu_rcuref_manage(&ns->unconfined->label.count);

atomic_set(&ns->uniq_null, 0);

@@ -377,6 +378,7 @@ int __init aa_alloc_root_ns(void)
}
kernel_t = &kernel_p->label;
root_ns->unconfined->ns = aa_get_ns(root_ns);
+ percpu_rcuref_manage(&root_ns->unconfined->label.count);

return 0;
}
--
2.34.1


2024-01-10 11:30:23

by Neeraj Upadhyay

[permalink] [raw]
Subject: [RFC 5/9] apparmor: Switch intree labels to percpu mode

Now that we have an infrastructure to reclaim
the percpu labels, switch intree labels to
percpu mode, to improve the ref scalability.

Signed-off-by: Neeraj Upadhyay <[email protected]>
---
security/apparmor/label.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index aa9e6eac3ecc..1299262f54e1 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -710,6 +710,8 @@ static struct aa_label *__label_insert(struct aa_labelset *ls,
rb_link_node(&label->node, parent, new);
rb_insert_color(&label->node, &ls->root);
label->flags |= FLAG_IN_TREE;
+ percpu_ref_switch_to_percpu(&label->count);
+ aa_label_reclaim_add_label(label);

return aa_get_label(label);
}
--
2.34.1


2024-02-07 04:42:57

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Gentle ping.

John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?


Thanks
Neeraj

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
> Problem Statement
> =================
>
> Nginx performance testing with Apparmor enabled (with nginx
> running in unconfined profile), on kernel versions 6.1 and 6.5
> show significant drop in throughput scalability, when Nginx
> workers are scaled to higher number of CPUs across various
> L3 cache domains.
>
> Below is one sample data on the throughput scalability loss,
> based on results on AMD Zen4 system with 96 CPUs with SMT
> core count 2; so, overall, 192 CPUs:
>
> Config Cache Domains apparmor=off apparmor=on
> scaling eff (%) scaling eff (%)
> 8C16T 1 100% 100%
> 16C32T 2 95% 94%
> 24C48T 3 94% 93%
> 48C96T 6 92% 88%
> 96C192T 12 85% 68%
>
> If we look at above data, there is a significant drop in
> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
>
> Perf tool shows most of the contention coming from below
> 6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
> 6.22% nginx [kernel.vmlinux] [k] apparmor_file_open
>
> The majority of the CPU cycles is found to be due to memory contention
> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
> kref_put() operations on label.
>
> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
> from 6.7 alleviates the issue to an extent, but not completely:
>
> Config Cache Domains apparmor=on apparmor=on (patched)
> scaling eff (%) scaling eff (%)
> 8C16T 1 100% 100%
> 16C32T 2 97% 93%
> 24C48T 3 94% 92%
> 48C96T 6 88% 88%
> 96C192T 12 65% 79%
>
> This adverse impact gets more pronounced when we move to >192 CPUs.
> The memory contention and impact increases with high frequency label
> update operations and labels are marked stale more frequently.
>
>
> Label Refcount Management
> =========================
>
> Apparmor uses label objects (struct aa_label) to manage refcounts for
> below set of objects:
>
> - Applicable profiles
> - Namespaces (unconfined profile)
> - Other non-profile references
>
> These label references are acquired on various apparmor lsm hooks,
> on operations such as file open, task kill operations, socket bind,
> and other file, socket, misc operations which use current task's cred,
> when the label for the current cred, has been marked stale. This is
> done to check these operations against the set of allowed operations
> for the task performing them.
>
> Use Percpu refcount for ref management?
> =======================================
>
> The ref put operations (percpu_ref_put()) in percpu refcount,
> in active mode, do not check whether ref count has dropped to
> 0. The users of the percpu_ref need to explicitly invoke
> a percpu_ref_kill() operation, to drop the initial reference,
> at shutdown paths. After the percpu_ref_kill() operation, ref
> switches to atomic mode and any new percpu_ref_put() operation
> checks for the drop to 0 case and invokes the release operation
> on that label.
>
> Labels are marked stale is situations like profile removal,
> profile updates. For a namespace, the unconfined label reference
> is dropped when the namespace is destroyed. These points
> are potential shutdown points for labels. However, killing
> the percpu ref from these points has few issues:
>
> - The label could still be referenced by tasks, which are
> still holding the reference to the now stale label.
> Killing the label ref while these operations are in progress
> will make all subsequent ref-put operations on the stale label
> to be atomic, which would still result in memory contention.
> Also, any new reference to the stale label, which is acquired
> with the elevated refcount will have atomic op contention.
>
> - The label is marked stale using a non-atomic write operation.
> It is possible that new operations do not observe this flag
> and still reference it for quite some time.
>
> - Explicitly tracking the shutdown points might not be maintainable
> at a per label granularity, as there can be various paths where
> label reference could get dropped, such as, before the label has
> gone live - object initialization error paths. Also, tracking
> the shutdown points for labels which reference other labels -
> subprofiles, merged labels requires careful analysis, and adds
> heavy burden on ensuring the memory contention is not introduced
> by these ref kill points.
>
>
> Proposed Solution
> =================
>
> One potential solution to the refcount scalability problem is to
> convert the label refcount to a percpu refcount, and manage
> the initial reference from kworker context. The kworker
> keeps an extra reference to the label and periodically scans
> labels and release them if their refcount drops to 0.
>
> Below is the sequence of operations, which shows the refcount
> management with this approach:
>
> 1. During label initialization, the percpu ref is initialized in
> atomic mode. This is done to ensure that, for cases where the
> label hasn't gone live (->ns isn't assigned), mostly during
> initialization error paths.
>
> 2. Labels are switched to percpu mode at various points -
> when a label is added to labelset tree, when a unconfined profile
> has been assigned a namespace.
>
> 3. As part of the initial prototype, only the in tree labels
> are managed by the kworker. These labels are added to a lockless
> list. The unconfined labels invoke a percpu_ref_kill() operation
> when the namespace is destroyed.
>
> 4. The kworker does a periodic scan of all the labels in the
> llist. It does below sequence of operations:
>
> a. Enqueue a dummy node to mark the start of scan. This dummy
> node is used as start point of scan and ensures that we
> there is no additional synchronization required with new
> label node additions to the llist. Any new labels will
> be processed in next run of the kworker.
>
> SCAN START PTR
> |
> v
> +----------+ +------+ +------+ +------+
> | | | | | | | |
> | head ------> dummy|--->|label |--->| label|--->NULL
> | | | node | | | | |
> +----------+ +------+ +------+ +------+
>
>
> New label addition:
>
> SCAN START PTR
> |
> v
> +----------+ +------+ +------+ +------+ +------+
> | | | | | | | | | |
> | head |--> label|--> dummy|--->|label |--->| label|--->NULL
> | | | | | node | | | | |
> +----------+ +------+ +------+ +------+ +------+
>
> b. Traverse through the llist, starting from dummy->next.
> If the node is a dummy node, mark it free.
> If the node is a label node, do,
>
> i) Switch the label ref to atomic mode. The ref switch wait
> for the existing percpu_ref_get() and percpu_ref_put()
> operations to complete, by waiting for a RCU grace period.
>
> Once the switch is complete, from this point onwards, any
> percpu_ref_get(), percpu_ref_put() operations use
> atomic operations.
>
> ii) Drop the initial reference, which was taken while adding
> the label node to the llist.
>
> iii) Use a percpu_ref_tryget() increment operation on the
> ref, to see if we dropped the last ref count. if we
> dropped the last count, we remove the node from the llist.
>
> All of these operations are done inside a RCU critical
> section, to avoid race with the release operations,
> which can potentially trigger, as soon as we drop
> the initial ref count.
>
> iv) If we didn't drop the last ref, switch back the counter
> to percpu mode.
>
> Using this approach, to move the atomic refcount manipulation out of the
> contended paths, there is a significant scalability improvement seen on
> nginx test, and scalability efficiency is close to apparmor-off case.
>
> Config Cache Domains apparmor=on (percpuref)
> scaling eff (%)
> 8C16T 1 100%
> 16C32T 2 96%
> 24C48T 3 94%
> 48C96T 6 93%
> 96C192T 12 90%
>
> Limitations
> ===========
>
> 1. Switching to percpu refcount increases memory size overhead, as
> percpu memory is allocated for all labels.
>
> 2. Deferring labels reclaim could potentially result in memory
> pressure, when there are high frequency of label update operations.
>
> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
> These can impact energy efficiency, due to back to back hurry
> callbacks. Using deferrable workqueue partly mitigates this.
> However, deferring kworker can delay reclaims.
>
> 4. Back to back label switches can delay other percpu users, as
> there is a single global switch spinlock used by percpu refcount
> lib.
>
> 5. Long running kworker can delay other use cases like system suspend.
> This is mitigated using freezable workqueue and litming node
> scans to a max count.
>
> 6. There is a window where label operates is atomic mode, when its
> counter is being checked for zero. This can potentially result
> in high memory contention, during this window which spans RCU
> grace period (plus callback execution). For example, when
> scanning label corresponding to unconfined profile, all
> applications which use unconfined profile would be using
> atomic ref increment and decrement operations.
>
> There are a few apparoaches which were tried to mitigate this issue:
>
> a. At a lower time interval, check if scanned label's counter
> has changed since the start of label scan. If there is a change
> in count, terminate the switch to atomic mode. Below shows the
> apparoch using rcuwait.
>
> static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
> {
> WRITE_ONCE(aa_atomic_switch_complete, true);
> rcuwait_wake_up(&aa_reclaim_rcuwait);
> }
>
> rcuwait_init(&aa_reclaim_rcuwait);
> percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
>
> atomic_count = percpu_ref_count_read(&label->count);
> do {
> rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
> (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
> TASK_IDLE,
> msecs_to_jiffies(5));
> if (percpu_ref_count_read(&label->count) != atomic_count)
> break;
> } while (!READ_ONCE(switch_complete));
>
> However, this approach does not work, as percpu refcount lib does not
> allow termination of an ongoing switch operation. Also, the counter
> can return to the original value with set of get() and put() operations
> before we check the current value.
>
> b. Approaches to notify the reclaim kworker from ref get and put operations
> can potentially disturb cache line state between the various CPU
> contexts, which are referncing the label, and can potentially impact
> scalability again.
>
> c. Swith the label to an immortal percpu ref, while the scan operates
> on the current counter.
>
> Below is the sequence of operations to do this:
>
> 1. Ensure that both immortal ref and label ref are in percpu mode.
> Reinit the immortal ref in percpu mode.
>
> Swap percpu and atomic counters of label refcount and immortal ref
> percpu-ref
> +-------------------+
> +-------+ | percpu-ctr-addr1 |
> | label | --------->|-------------------| +----------------+
> +-------+ | data |--->| Atomic counter1|
> +-------------------+ +----------------+
> +-------+ +-------------------+
> |ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
> +-------+ |-------------------|--->| Atomic counter2|
> | data | +----------------+
> +-------------------+
>
> label ->percpu-ctr-addr = percpu-ctr-addr2
> ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
> label ->data->count = Atomic counter2
> ImmLbl ->data->count = Atomic counter1
>
>
> 2. Check the counters collected in immortal label, by switch it
> to atomic mode.
>
> 3. If the count is 0, do,
> a. Switch immortal counter to percpu again, giving it an
> initial count of 1.
> b. Swap the label and immortal counters again. The immortal
> ref now has the counter values from new percpu ref get
> and get operations on the label ref, from the point
> when we did the initial swap operation.
> c. Transfer the percpu counts in immortal ref to atomic
> counter of label percpu refcount.
> d. Kill immortal ref, for reinit on next iteration.
> e. Switch label percpu ref to atomic mode.
> f. If the counter is 1, drop the initial ref.
>
> 4. If the count is not 0, re-swap the counters.
> a. Switch immortal counter to percpu again, giving it an
> initial count of 1.
> b. Swap the label and immortal counters again. The immortal
> ref now has the counter values from new percpu ref get
> and get operations on the label ref, from the point
> when we did the initial swap operation.
> c. Transfer the percpu counts in immortal ref to atomic
> counter of label percpu refcount.
> d. Kill immortal ref, for reinit on next iteration.
>
>
> Using this approach, we ensure that, label ref users do not switch
> to atomic mode, while there are active references on the label.
> However, this approach requires multiple percpu ref mode switches
> and adds high overhead and complexity to the scanning code.
>
> Extended/Future Work
> ====================
>
> 1. Look for ways to fix the limitations, as described in the "Limitations"
> section.
>
> 2. Generalize the approach to percpu rcuref, which is used for contexts
> where release path uses RCU grace period for release operations. Patch
> 7 creates an initial prototype for this.
>
> 3. Explore hazard pointers for scalable refcounting of labels.
>
> Highly appreciate any feedback/suggestions on the design approach.
>
> The patches of this patchset introduce following changes:
>
> 1. Documentation of Apparmor Refcount management.
>
> 2. Switch labels to percpu refcount in atomic mode.
>
> Use percpu refcount for apparmor labels. Initial patch to init
> the percpu ref in atomic mode, to evaluate the potential
> impact of percpuref on top of kref based implementation.
>
> 3. Switch unconfined namespaces refcount to percpu mode.
>
> Switch unconfined ns labels to percpu mode, and kill the
> initial refcount from namespace destroy path.
>
> 4. Add infrastructure to reclaim percpu labels.
>
> Add a label reclaim infrastructure for labels which are
> in percpu mode, for managing their inital refcount.
>
> 5. Switch intree labels to percpu mode.
>
> Use label reclaim infrastruture to manage intree labels.
>
> 6. Initial prototype for optimizing ref switch.
>
> Prototype for reducing the time window when a label
> scan switches the label ref to atomic mode.
>
> 7. percpu-rcuref: Add basic infrastructure.
>
> Prototype for Percpu refcounts for objects, which protect
> their object reclaims using RCU grace period.
>
> 8. Switch labels to percpu rcurefcount in unmanaged mode.
>
> Use percpu rcuref for labels. Start with unmanaged/atomic
> mode.
>
> 9. Switch unconfined and in tree labels to managed ref mode.
>
> Use percpu mode with manager worker for unconfined and intree
> labels.
>
>
> ------------------------------------------------------------------------
>
> b/Documentation/admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++
> b/Documentation/admin-guide/LSM/index.rst | 1
> b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> b/include/linux/percpu-rcurefcount.h | 115 ++++++++++++++++
> b/include/linux/percpu-refcount.h | 2
> b/lib/Makefile | 2
> b/lib/percpu-rcurefcount.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
> b/lib/percpu-refcount.c | 93 +++++++++++++
> b/security/apparmor/include/label.h | 16 +-
> b/security/apparmor/include/policy.h | 8 -
> b/security/apparmor/include/policy_ns.h | 24 +++
> b/security/apparmor/label.c | 11 +
> b/security/apparmor/lsm.c | 145 ++++++++++++++++++++
> b/security/apparmor/policy_ns.c | 6
> include/linux/percpu-refcount.h | 2
> lib/percpu-refcount.c | 93 -------------
> security/apparmor/include/label.h | 17 +-
> security/apparmor/include/policy.h | 56 +++----
> security/apparmor/include/policy_ns.h | 24 ---
> security/apparmor/label.c | 11 -
> security/apparmor/lsm.c | 325 ++++++++++++----------------------------------
> security/apparmor/policy_ns.c | 8 -
> 22 files changed, 1237 insertions(+), 417 deletions(-)
>
> base-commit: ab27740f7665

2024-02-09 17:33:48

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 2/6/24 20:40, Neeraj Upadhyay wrote:
> Gentle ping.
>
> John,
>
> Could you please confirm that:
>
> a. The AppArmor refcount usage described in the RFC is correct?
> b. Approach taken to fix the scalability issue is valid/correct?
>

Hi Neeraj,

I know your patchset has been waiting on review for a long time.
Unfortunately I have been very, very busy lately. I will try to
get to it this weekend, but I can't promise that I will be able
to get the review fully done.

john


>
> On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
>> Problem Statement
>> =================
>>
>> Nginx performance testing with Apparmor enabled (with nginx
>> running in unconfined profile), on kernel versions 6.1 and 6.5
>> show significant drop in throughput scalability, when Nginx
>> workers are scaled to higher number of CPUs across various
>> L3 cache domains.
>>
>> Below is one sample data on the throughput scalability loss,
>> based on results on AMD Zen4 system with 96 CPUs with SMT
>> core count 2; so, overall, 192 CPUs:
>>
>> Config Cache Domains apparmor=off apparmor=on
>> scaling eff (%) scaling eff (%)
>> 8C16T 1 100% 100%
>> 16C32T 2 95% 94%
>> 24C48T 3 94% 93%
>> 48C96T 6 92% 88%
>> 96C192T 12 85% 68%
>>
>> If we look at above data, there is a significant drop in
>> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
>>
>> Perf tool shows most of the contention coming from below
>> 6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
>> 6.22% nginx [kernel.vmlinux] [k] apparmor_file_open
>>
>> The majority of the CPU cycles is found to be due to memory contention
>> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
>> kref_put() operations on label.
>>
>> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
>> from 6.7 alleviates the issue to an extent, but not completely:
>>
>> Config Cache Domains apparmor=on apparmor=on (patched)
>> scaling eff (%) scaling eff (%)
>> 8C16T 1 100% 100%
>> 16C32T 2 97% 93%
>> 24C48T 3 94% 92%
>> 48C96T 6 88% 88%
>> 96C192T 12 65% 79%
>>
>> This adverse impact gets more pronounced when we move to >192 CPUs.
>> The memory contention and impact increases with high frequency label
>> update operations and labels are marked stale more frequently.
>>
>>
>> Label Refcount Management
>> =========================
>>
>> Apparmor uses label objects (struct aa_label) to manage refcounts for
>> below set of objects:
>>
>> - Applicable profiles
>> - Namespaces (unconfined profile)
>> - Other non-profile references
>>
>> These label references are acquired on various apparmor lsm hooks,
>> on operations such as file open, task kill operations, socket bind,
>> and other file, socket, misc operations which use current task's cred,
>> when the label for the current cred, has been marked stale. This is
>> done to check these operations against the set of allowed operations
>> for the task performing them.
>>
>> Use Percpu refcount for ref management?
>> =======================================
>>
>> The ref put operations (percpu_ref_put()) in percpu refcount,
>> in active mode, do not check whether ref count has dropped to
>> 0. The users of the percpu_ref need to explicitly invoke
>> a percpu_ref_kill() operation, to drop the initial reference,
>> at shutdown paths. After the percpu_ref_kill() operation, ref
>> switches to atomic mode and any new percpu_ref_put() operation
>> checks for the drop to 0 case and invokes the release operation
>> on that label.
>>
>> Labels are marked stale is situations like profile removal,
>> profile updates. For a namespace, the unconfined label reference
>> is dropped when the namespace is destroyed. These points
>> are potential shutdown points for labels. However, killing
>> the percpu ref from these points has few issues:
>>
>> - The label could still be referenced by tasks, which are
>> still holding the reference to the now stale label.
>> Killing the label ref while these operations are in progress
>> will make all subsequent ref-put operations on the stale label
>> to be atomic, which would still result in memory contention.
>> Also, any new reference to the stale label, which is acquired
>> with the elevated refcount will have atomic op contention.
>>
>> - The label is marked stale using a non-atomic write operation.
>> It is possible that new operations do not observe this flag
>> and still reference it for quite some time.
>>
>> - Explicitly tracking the shutdown points might not be maintainable
>> at a per label granularity, as there can be various paths where
>> label reference could get dropped, such as, before the label has
>> gone live - object initialization error paths. Also, tracking
>> the shutdown points for labels which reference other labels -
>> subprofiles, merged labels requires careful analysis, and adds
>> heavy burden on ensuring the memory contention is not introduced
>> by these ref kill points.
>>
>>
>> Proposed Solution
>> =================
>>
>> One potential solution to the refcount scalability problem is to
>> convert the label refcount to a percpu refcount, and manage
>> the initial reference from kworker context. The kworker
>> keeps an extra reference to the label and periodically scans
>> labels and release them if their refcount drops to 0.
>>
>> Below is the sequence of operations, which shows the refcount
>> management with this approach:
>>
>> 1. During label initialization, the percpu ref is initialized in
>> atomic mode. This is done to ensure that, for cases where the
>> label hasn't gone live (->ns isn't assigned), mostly during
>> initialization error paths.
>>
>> 2. Labels are switched to percpu mode at various points -
>> when a label is added to labelset tree, when a unconfined profile
>> has been assigned a namespace.
>>
>> 3. As part of the initial prototype, only the in tree labels
>> are managed by the kworker. These labels are added to a lockless
>> list. The unconfined labels invoke a percpu_ref_kill() operation
>> when the namespace is destroyed.
>>
>> 4. The kworker does a periodic scan of all the labels in the
>> llist. It does below sequence of operations:
>>
>> a. Enqueue a dummy node to mark the start of scan. This dummy
>> node is used as start point of scan and ensures that we
>> there is no additional synchronization required with new
>> label node additions to the llist. Any new labels will
>> be processed in next run of the kworker.
>>
>> SCAN START PTR
>> |
>> v
>> +----------+ +------+ +------+ +------+
>> | | | | | | | |
>> | head ------> dummy|--->|label |--->| label|--->NULL
>> | | | node | | | | |
>> +----------+ +------+ +------+ +------+
>>
>>
>> New label addition:
>>
>> SCAN START PTR
>> |
>> v
>> +----------+ +------+ +------+ +------+ +------+
>> | | | | | | | | | |
>> | head |--> label|--> dummy|--->|label |--->| label|--->NULL
>> | | | | | node | | | | |
>> +----------+ +------+ +------+ +------+ +------+
>>
>> b. Traverse through the llist, starting from dummy->next.
>> If the node is a dummy node, mark it free.
>> If the node is a label node, do,
>>
>> i) Switch the label ref to atomic mode. The ref switch wait
>> for the existing percpu_ref_get() and percpu_ref_put()
>> operations to complete, by waiting for a RCU grace period.
>>
>> Once the switch is complete, from this point onwards, any
>> percpu_ref_get(), percpu_ref_put() operations use
>> atomic operations.
>>
>> ii) Drop the initial reference, which was taken while adding
>> the label node to the llist.
>>
>> iii) Use a percpu_ref_tryget() increment operation on the
>> ref, to see if we dropped the last ref count. if we
>> dropped the last count, we remove the node from the llist.
>>
>> All of these operations are done inside a RCU critical
>> section, to avoid race with the release operations,
>> which can potentially trigger, as soon as we drop
>> the initial ref count.
>>
>> iv) If we didn't drop the last ref, switch back the counter
>> to percpu mode.
>>
>> Using this approach, to move the atomic refcount manipulation out of the
>> contended paths, there is a significant scalability improvement seen on
>> nginx test, and scalability efficiency is close to apparmor-off case.
>>
>> Config Cache Domains apparmor=on (percpuref)
>> scaling eff (%)
>> 8C16T 1 100%
>> 16C32T 2 96%
>> 24C48T 3 94%
>> 48C96T 6 93%
>> 96C192T 12 90%
>>
>> Limitations
>> ===========
>>
>> 1. Switching to percpu refcount increases memory size overhead, as
>> percpu memory is allocated for all labels.
>>
>> 2. Deferring labels reclaim could potentially result in memory
>> pressure, when there are high frequency of label update operations.
>>
>> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
>> These can impact energy efficiency, due to back to back hurry
>> callbacks. Using deferrable workqueue partly mitigates this.
>> However, deferring kworker can delay reclaims.
>>
>> 4. Back to back label switches can delay other percpu users, as
>> there is a single global switch spinlock used by percpu refcount
>> lib.
>>
>> 5. Long running kworker can delay other use cases like system suspend.
>> This is mitigated using freezable workqueue and litming node
>> scans to a max count.
>>
>> 6. There is a window where label operates is atomic mode, when its
>> counter is being checked for zero. This can potentially result
>> in high memory contention, during this window which spans RCU
>> grace period (plus callback execution). For example, when
>> scanning label corresponding to unconfined profile, all
>> applications which use unconfined profile would be using
>> atomic ref increment and decrement operations.
>>
>> There are a few apparoaches which were tried to mitigate this issue:
>>
>> a. At a lower time interval, check if scanned label's counter
>> has changed since the start of label scan. If there is a change
>> in count, terminate the switch to atomic mode. Below shows the
>> apparoch using rcuwait.
>>
>> static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
>> {
>> WRITE_ONCE(aa_atomic_switch_complete, true);
>> rcuwait_wake_up(&aa_reclaim_rcuwait);
>> }
>>
>> rcuwait_init(&aa_reclaim_rcuwait);
>> percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
>>
>> atomic_count = percpu_ref_count_read(&label->count);
>> do {
>> rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
>> (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
>> TASK_IDLE,
>> msecs_to_jiffies(5));
>> if (percpu_ref_count_read(&label->count) != atomic_count)
>> break;
>> } while (!READ_ONCE(switch_complete));
>>
>> However, this approach does not work, as percpu refcount lib does not
>> allow termination of an ongoing switch operation. Also, the counter
>> can return to the original value with set of get() and put() operations
>> before we check the current value.
>>
>> b. Approaches to notify the reclaim kworker from ref get and put operations
>> can potentially disturb cache line state between the various CPU
>> contexts, which are referncing the label, and can potentially impact
>> scalability again.
>>
>> c. Swith the label to an immortal percpu ref, while the scan operates
>> on the current counter.
>>
>> Below is the sequence of operations to do this:
>>
>> 1. Ensure that both immortal ref and label ref are in percpu mode.
>> Reinit the immortal ref in percpu mode.
>>
>> Swap percpu and atomic counters of label refcount and immortal ref
>> percpu-ref
>> +-------------------+
>> +-------+ | percpu-ctr-addr1 |
>> | label | --------->|-------------------| +----------------+
>> +-------+ | data |--->| Atomic counter1|
>> +-------------------+ +----------------+
>> +-------+ +-------------------+
>> |ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
>> +-------+ |-------------------|--->| Atomic counter2|
>> | data | +----------------+
>> +-------------------+
>>
>> label ->percpu-ctr-addr = percpu-ctr-addr2
>> ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
>> label ->data->count = Atomic counter2
>> ImmLbl ->data->count = Atomic counter1
>>
>>
>> 2. Check the counters collected in immortal label, by switch it
>> to atomic mode.
>>
>> 3. If the count is 0, do,
>> a. Switch immortal counter to percpu again, giving it an
>> initial count of 1.
>> b. Swap the label and immortal counters again. The immortal
>> ref now has the counter values from new percpu ref get
>> and get operations on the label ref, from the point
>> when we did the initial swap operation.
>> c. Transfer the percpu counts in immortal ref to atomic
>> counter of label percpu refcount.
>> d. Kill immortal ref, for reinit on next iteration.
>> e. Switch label percpu ref to atomic mode.
>> f. If the counter is 1, drop the initial ref.
>>
>> 4. If the count is not 0, re-swap the counters.
>> a. Switch immortal counter to percpu again, giving it an
>> initial count of 1.
>> b. Swap the label and immortal counters again. The immortal
>> ref now has the counter values from new percpu ref get
>> and get operations on the label ref, from the point
>> when we did the initial swap operation.
>> c. Transfer the percpu counts in immortal ref to atomic
>> counter of label percpu refcount.
>> d. Kill immortal ref, for reinit on next iteration.
>>
>>
>> Using this approach, we ensure that, label ref users do not switch
>> to atomic mode, while there are active references on the label.
>> However, this approach requires multiple percpu ref mode switches
>> and adds high overhead and complexity to the scanning code.
>>
>> Extended/Future Work
>> ====================
>>
>> 1. Look for ways to fix the limitations, as described in the "Limitations"
>> section.
>>
>> 2. Generalize the approach to percpu rcuref, which is used for contexts
>> where release path uses RCU grace period for release operations. Patch
>> 7 creates an initial prototype for this.
>>
>> 3. Explore hazard pointers for scalable refcounting of labels.
>>
>> Highly appreciate any feedback/suggestions on the design approach.
>>
>> The patches of this patchset introduce following changes:
>>
>> 1. Documentation of Apparmor Refcount management.
>>
>> 2. Switch labels to percpu refcount in atomic mode.
>>
>> Use percpu refcount for apparmor labels. Initial patch to init
>> the percpu ref in atomic mode, to evaluate the potential
>> impact of percpuref on top of kref based implementation.
>>
>> 3. Switch unconfined namespaces refcount to percpu mode.
>>
>> Switch unconfined ns labels to percpu mode, and kill the
>> initial refcount from namespace destroy path.
>>
>> 4. Add infrastructure to reclaim percpu labels.
>>
>> Add a label reclaim infrastructure for labels which are
>> in percpu mode, for managing their inital refcount.
>>
>> 5. Switch intree labels to percpu mode.
>>
>> Use label reclaim infrastruture to manage intree labels.
>>
>> 6. Initial prototype for optimizing ref switch.
>>
>> Prototype for reducing the time window when a label
>> scan switches the label ref to atomic mode.
>>
>> 7. percpu-rcuref: Add basic infrastructure.
>>
>> Prototype for Percpu refcounts for objects, which protect
>> their object reclaims using RCU grace period.
>>
>> 8. Switch labels to percpu rcurefcount in unmanaged mode.
>>
>> Use percpu rcuref for labels. Start with unmanaged/atomic
>> mode.
>>
>> 9. Switch unconfined and in tree labels to managed ref mode.
>>
>> Use percpu mode with manager worker for unconfined and intree
>> labels.
>>
>>
>> ------------------------------------------------------------------------
>>
>> b/Documentation/admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> b/Documentation/admin-guide/LSM/index.rst | 1
>> b/Documentation/admin-guide/kernel-parameters.txt | 8 +
>> b/include/linux/percpu-rcurefcount.h | 115 ++++++++++++++++
>> b/include/linux/percpu-refcount.h | 2
>> b/lib/Makefile | 2
>> b/lib/percpu-rcurefcount.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
>> b/lib/percpu-refcount.c | 93 +++++++++++++
>> b/security/apparmor/include/label.h | 16 +-
>> b/security/apparmor/include/policy.h | 8 -
>> b/security/apparmor/include/policy_ns.h | 24 +++
>> b/security/apparmor/label.c | 11 +
>> b/security/apparmor/lsm.c | 145 ++++++++++++++++++++
>> b/security/apparmor/policy_ns.c | 6
>> include/linux/percpu-refcount.h | 2
>> lib/percpu-refcount.c | 93 -------------
>> security/apparmor/include/label.h | 17 +-
>> security/apparmor/include/policy.h | 56 +++----
>> security/apparmor/include/policy_ns.h | 24 ---
>> security/apparmor/label.c | 11 -
>> security/apparmor/lsm.c | 325 ++++++++++++----------------------------------
>> security/apparmor/policy_ns.c | 8 -
>> 22 files changed, 1237 insertions(+), 417 deletions(-)
>>
>> base-commit: ab27740f7665


2024-03-02 10:23:17

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 2/9/24, John Johansen <[email protected]> wrote:
> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>> Gentle ping.
>>
>> John,
>>
>> Could you please confirm that:
>>
>> a. The AppArmor refcount usage described in the RFC is correct?
>> b. Approach taken to fix the scalability issue is valid/correct?
>>
>
> Hi Neeraj,
>
> I know your patchset has been waiting on review for a long time.
> Unfortunately I have been very, very busy lately. I will try to
> get to it this weekend, but I can't promise that I will be able
> to get the review fully done.
>

Gentle prod.

Any chances of this getting reviewed in the foreseeable future? Would
be a real bummer if the patchset fell through the cracks.

--
Mateusz Guzik <mjguzik gmail.com>

2024-03-08 20:10:12

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 3/2/24 02:23, Mateusz Guzik wrote:
> On 2/9/24, John Johansen <[email protected]> wrote:
>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>> Gentle ping.
>>>
>>> John,
>>>
>>> Could you please confirm that:
>>>
>>> a. The AppArmor refcount usage described in the RFC is correct?
>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>
>>
>> Hi Neeraj,
>>
>> I know your patchset has been waiting on review for a long time.
>> Unfortunately I have been very, very busy lately. I will try to
>> get to it this weekend, but I can't promise that I will be able
>> to get the review fully done.
>>
>
> Gentle prod.
>
> Any chances of this getting reviewed in the foreseeable future? Would
> be a real bummer if the patchset fell through the cracks.
>

yes, sorry I have been unavailable for the last couple of weeks. I am
now back, I have a rather large backlog to try catching up on but this
is has an entry on the list.



2024-05-24 21:10:41

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On Fri, Mar 8, 2024 at 9:09 PM John Johansen
<[email protected]> wrote:
>
> On 3/2/24 02:23, Mateusz Guzik wrote:
> > On 2/9/24, John Johansen <[email protected]> wrote:
> >> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>> Gentle ping.
> >>>
> >>> John,
> >>>
> >>> Could you please confirm that:
> >>>
> >>> a. The AppArmor refcount usage described in the RFC is correct?
> >>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>
> >>
> >> Hi Neeraj,
> >>
> >> I know your patchset has been waiting on review for a long time.
> >> Unfortunately I have been very, very busy lately. I will try to
> >> get to it this weekend, but I can't promise that I will be able
> >> to get the review fully done.
> >>
> >
> > Gentle prod.
> >
> > Any chances of this getting reviewed in the foreseeable future? Would
> > be a real bummer if the patchset fell through the cracks.
> >
>
> yes, sorry I have been unavailable for the last couple of weeks. I am
> now back, I have a rather large backlog to try catching up on but this
> is has an entry on the list.
>

So where do we stand here?

--
Mateusz Guzik <mjguzik gmail.com>

2024-05-24 21:52:20

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 5/24/24 14:10, Mateusz Guzik wrote:
> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> <[email protected]> wrote:
>>
>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>> Gentle ping.
>>>>>
>>>>> John,
>>>>>
>>>>> Could you please confirm that:
>>>>>
>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>
>>>>
>>>> Hi Neeraj,
>>>>
>>>> I know your patchset has been waiting on review for a long time.
>>>> Unfortunately I have been very, very busy lately. I will try to
>>>> get to it this weekend, but I can't promise that I will be able
>>>> to get the review fully done.
>>>>
>>>
>>> Gentle prod.
>>>
>>> Any chances of this getting reviewed in the foreseeable future? Would
>>> be a real bummer if the patchset fell through the cracks.
>>>
>>
>> yes, sorry I have been unavailable for the last couple of weeks. I am
>> now back, I have a rather large backlog to try catching up on but this
>> is has an entry on the list.
>>
>
> So where do we stand here?
>
sorry I am still trying to dig out of my backlog, I will look at this,
this weekend.


2024-05-28 13:30:22

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On Fri, May 24, 2024 at 11:52 PM John Johansen
<[email protected]> wrote:
>
> On 5/24/24 14:10, Mateusz Guzik wrote:
> > On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> > <[email protected]> wrote:
> >>
> >> On 3/2/24 02:23, Mateusz Guzik wrote:
> >>> On 2/9/24, John Johansen <[email protected]> wrote:
> >>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>>>> Gentle ping.
> >>>>>
> >>>>> John,
> >>>>>
> >>>>> Could you please confirm that:
> >>>>>
> >>>>> a. The AppArmor refcount usage described in the RFC is correct?
> >>>>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>>>
> >>>>
> >>>> Hi Neeraj,
> >>>>
> >>>> I know your patchset has been waiting on review for a long time.
> >>>> Unfortunately I have been very, very busy lately. I will try to
> >>>> get to it this weekend, but I can't promise that I will be able
> >>>> to get the review fully done.
> >>>>
> >>>
> >>> Gentle prod.
> >>>
> >>> Any chances of this getting reviewed in the foreseeable future? Would
> >>> be a real bummer if the patchset fell through the cracks.
> >>>
> >>
> >> yes, sorry I have been unavailable for the last couple of weeks. I am
> >> now back, I have a rather large backlog to try catching up on but this
> >> is has an entry on the list.
> >>
> >
> > So where do we stand here?
> >
> sorry I am still trying to dig out of my backlog, I will look at this,
> this weekend.
>

How was the weekend? ;)

--
Mateusz Guzik <mjguzik gmail.com>

2024-05-29 00:45:22

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 5/28/24 06:29, Mateusz Guzik wrote:
> On Fri, May 24, 2024 at 11:52 PM John Johansen
> <[email protected]> wrote:
>>
>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>> <[email protected]> wrote:
>>>>
>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>> Gentle ping.
>>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> Could you please confirm that:
>>>>>>>
>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>
>>>>>>
>>>>>> Hi Neeraj,
>>>>>>
>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>> to get the review fully done.
>>>>>>
>>>>>
>>>>> Gentle prod.
>>>>>
>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>
>>>>
>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>> now back, I have a rather large backlog to try catching up on but this
>>>> is has an entry on the list.
>>>>
>>>
>>> So where do we stand here?
>>>
>> sorry I am still trying to dig out of my backlog, I will look at this,
>> this weekend.
>>
>
> How was the weekend? ;)
>

lets say it was busy. Have I looked at this, yes. I am still digesting it.
I don't have objections to moving towards percpu refcounts, but the overhead
of a percpu stuct per label is a problem when we have thousands of labels
on the system. That is to say, this would have to be a config option. We
moved buffers from kmalloc to percpu to reduce memory overhead to reduce
contention. The to percpu, to a global pool because the percpu overhead was
too high for some machines, and then from a global pool to a hybrid scheme
because of global lock contention. I don't see a way of doing that with the
label, which means a config would be the next best thing.

Not part of your patch but something to be considered is that the label tree
needs a rework, its locking needs to move to read side a read side lock less
scheme, and the plan was to make it also use a linked list such that new
labels are always queued at the end, allowing dynamically created labels to
be lazily added to the tree.

I see the use of the kworker as problematic as well, especially if we are
talking using kconfig to switch reference counting modes. I am futzing with
some ideas, on how to deal with this.

Like I said I am still digesting.


2024-05-29 09:39:16

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On Wed, May 29, 2024 at 2:37 AM John Johansen
<[email protected]> wrote:
> I don't have objections to moving towards percpu refcounts, but the overhead
> of a percpu stuct per label is a problem when we have thousands of labels
> on the system. That is to say, this would have to be a config option. We
> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> contention. The to percpu, to a global pool because the percpu overhead was
> too high for some machines, and then from a global pool to a hybrid scheme
> because of global lock contention. I don't see a way of doing that with the
> label, which means a config would be the next best thing.
>

There was a patchset somewhere which adds counters starting as atomic
and automagically converting themselves per-cpu if there as enough
load applied to them. General point being it is plausible this may
autotune itself.

Another option would be a boot-time tunable.

> Not part of your patch but something to be considered is that the label tree
> needs a rework, its locking needs to move to read side a read side lock less
> scheme, and the plan was to make it also use a linked list such that new
> labels are always queued at the end, allowing dynamically created labels to
> be lazily added to the tree.
>

It's not *my* patchset. ;)

> I see the use of the kworker as problematic as well, especially if we are
> talking using kconfig to switch reference counting modes. I am futzing with
> some ideas, on how to deal with this.
>

Thanks for the update. Hopefully this is going to get sorted out in
the foreseeable future.

--
Mateusz Guzik <mjguzik gmail.com>

2024-05-30 04:20:26

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

Hi John,

Thanks for taking a look at the series!

On 5/29/2024 6:07 AM, John Johansen wrote:
> On 5/28/24 06:29, Mateusz Guzik wrote:
>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>> <[email protected]> wrote:
>>>
>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>> <[email protected]> wrote:
>>>>>
>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>> Gentle ping.
>>>>>>>>
>>>>>>>> John,
>>>>>>>>
>>>>>>>> Could you please confirm that:
>>>>>>>>
>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>
>>>>>>>
>>>>>>> Hi Neeraj,
>>>>>>>
>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>> to get the review fully done.
>>>>>>>
>>>>>>
>>>>>> Gentle prod.
>>>>>>
>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>
>>>>>
>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>> is has an entry on the list.
>>>>>
>>>>
>>>> So where do we stand here?
>>>>
>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>> this weekend.
>>>
>>
>> How was the weekend? ;)
>>
>
> lets say it was busy. Have I looked at this, yes. I am still digesting it.
> I don't have objections to moving towards percpu refcounts, but the overhead
> of a percpu stuct per label is a problem when we have thousands of labels
> on the system. That is to say, this would have to be a config option. We
> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> contention. The to percpu, to a global pool because the percpu overhead was
> too high for some machines, and then from a global pool to a hybrid scheme
> because of global lock contention. I don't see a way of doing that with the
> label, which means a config would be the next best thing.
>

For the buffers, what was the percpu overhead roughly? For
thousands of labels, I think, the extra memory overhead roughly would
be in the range of few MBs (need to be profiled though). This extra
label overhead would be considered high for the machines where percpu
buffer overhead was considered high?

Please correct me here, so you are proposing that we use a kconfig to
use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
inside the 'struct aa_label' and update the refcount operations accordingly?
If yes, I will work on a patch with this kconfig based selection of
refcounting mode to see how it pans out.

@Mateusz can you share the dynamic switching counter mode patch series please?

In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
which would not have memory usage overhead as in percpu refcount. At this point the
API design/implementation is in early prototype stage.


[1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing

> Not part of your patch but something to be considered is that the label tree
> needs a rework, its locking needs to move to read side a read side lock less
> scheme, and the plan was to make it also use a linked list such that new
> labels are always queued at the end, allowing dynamically created labels to
> be lazily added to the tree.
>

Read side would be rcu read lock protected in this scheme?
The linked list would store the dynamically created compound labels?
What is the advantage of using this lazy addition to the tree? We optimize
on the label search, addition/deletion for dynamic labels? The lazy addition
to the tree is done when a label find operation on the list succeeds?

> I see the use of the kworker as problematic as well, especially if we are
> talking using kconfig to switch reference counting modes. I am futzing with
> some ideas, on how to deal with this.
>

We can disable queuing of label reclaim work for non-percpu case?

> Like I said I am still digesting.
>

Thank you!


Thanks
Neeraj


2024-05-30 05:59:51

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 5/29/24 21:19, Neeraj Upadhyay wrote:
> Hi John,
>
> Thanks for taking a look at the series!
>
> On 5/29/2024 6:07 AM, John Johansen wrote:
>> On 5/28/24 06:29, Mateusz Guzik wrote:
>>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>>> <[email protected]> wrote:
>>>>
>>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>>> Gentle ping.
>>>>>>>>>
>>>>>>>>> John,
>>>>>>>>>
>>>>>>>>> Could you please confirm that:
>>>>>>>>>
>>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Neeraj,
>>>>>>>>
>>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>>> to get the review fully done.
>>>>>>>>
>>>>>>>
>>>>>>> Gentle prod.
>>>>>>>
>>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>>
>>>>>>
>>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>>> is has an entry on the list.
>>>>>>
>>>>>
>>>>> So where do we stand here?
>>>>>
>>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>>> this weekend.
>>>>
>>>
>>> How was the weekend? ;)
>>>
>>
>> lets say it was busy. Have I looked at this, yes. I am still digesting it.
>> I don't have objections to moving towards percpu refcounts, but the overhead
>> of a percpu stuct per label is a problem when we have thousands of labels
>> on the system. That is to say, this would have to be a config option. We
>> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
>> contention. The to percpu, to a global pool because the percpu overhead was
>> too high for some machines, and then from a global pool to a hybrid scheme
>> because of global lock contention. I don't see a way of doing that with the
>> label, which means a config would be the next best thing.
>>
>
> For the buffers, what was the percpu overhead roughly? For
> thousands of labels, I think, the extra memory overhead roughly would
> be in the range of few MBs (need to be profiled though). This extra
> label overhead would be considered high for the machines where percpu
> buffer overhead was considered high?
>

It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
on bigger machines. The problem here is say the percpu refcount while smaller
per label, will be more in situations with lots of cpus. Which is fine if that
is what it needs to be, but for other use cases tuning it to be smaller would
be nice.


> Please correct me here, so you are proposing that we use a kconfig to
> use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
> inside the 'struct aa_label' and update the refcount operations accordingly?
> If yes, I will work on a patch with this kconfig based selection of
> refcounting mode to see how it pans out.
>
possibly, I am still mulling over how we want to approach this

> @Mateusz can you share the dynamic switching counter mode patch series please?
>
yes I am interested in looking at this as well.

> In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
> on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
> which would not have memory usage overhead as in percpu refcount. At this point the
> API design/implementation is in early prototype stage.
>
>
> [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing

okay, I will take a look

>
>> Not part of your patch but something to be considered is that the label tree
>> needs a rework, its locking needs to move to read side a read side lock less
>> scheme, and the plan was to make it also use a linked list such that new
>> labels are always queued at the end, allowing dynamically created labels to
>> be lazily added to the tree.
>>
>
> Read side would be rcu read lock protected in this scheme?
> The linked list would store the dynamically created compound labels?
> What is the advantage of using this lazy addition to the tree? We optimize
> on the label search, addition/deletion for dynamic labels? The lazy addition
> to the tree is done when a label find operation on the list succeeds?
>
there are contexts where we are creating labels, and do not want to wait on
some of the longer tree walk profile updates/replacements. If a replacement is
on going the idea is to just add the label to the end of a list and let the
process that is doing the tree update take the hit of inserting and rebalancing
the tree.


>> I see the use of the kworker as problematic as well, especially if we are
>> talking using kconfig to switch reference counting modes. I am futzing with
>> some ideas, on how to deal with this.
>>
>
> We can disable queuing of label reclaim work for non-percpu case?
>
maybe, I am pondering ways we can deal with this. I have been pondering the
if we might be able to leverage a seqlock here, but I will also take a look
at hazard pointers.

>> Like I said I am still digesting.
>>
>
> Thank you!
>
>
> Thanks
> Neeraj
>


2024-05-30 09:48:05

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On Thu, May 30, 2024 at 7:59 AM John Johansen
<[email protected]> wrote:
>
> On 5/29/24 21:19, Neeraj Upadhyay wrote:
> > Hi John,
> >
> > Thanks for taking a look at the series!
> >
> > On 5/29/2024 6:07 AM, John Johansen wrote:
> >> On 5/28/24 06:29, Mateusz Guzik wrote:
> >>> On Fri, May 24, 2024 at 11:52 PM John Johansen
> >>> <[email protected]> wrote:
> >>>>
> >>>> On 5/24/24 14:10, Mateusz Guzik wrote:
> >>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
> >>>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
> >>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
> >>>>>>>>> Gentle ping.
> >>>>>>>>>
> >>>>>>>>> John,
> >>>>>>>>>
> >>>>>>>>> Could you please confirm that:
> >>>>>>>>>
> >>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
> >>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Neeraj,
> >>>>>>>>
> >>>>>>>> I know your patchset has been waiting on review for a long time.
> >>>>>>>> Unfortunately I have been very, very busy lately. I will try to
> >>>>>>>> get to it this weekend, but I can't promise that I will be able
> >>>>>>>> to get the review fully done.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Gentle prod.
> >>>>>>>
> >>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
> >>>>>>> be a real bummer if the patchset fell through the cracks.
> >>>>>>>
> >>>>>>
> >>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
> >>>>>> now back, I have a rather large backlog to try catching up on but this
> >>>>>> is has an entry on the list.
> >>>>>>
> >>>>>
> >>>>> So where do we stand here?
> >>>>>
> >>>> sorry I am still trying to dig out of my backlog, I will look at this,
> >>>> this weekend.
> >>>>
> >>>
> >>> How was the weekend? ;)
> >>>
> >>
> >> lets say it was busy. Have I looked at this, yes. I am still digesting it.
> >> I don't have objections to moving towards percpu refcounts, but the overhead
> >> of a percpu stuct per label is a problem when we have thousands of labels
> >> on the system. That is to say, this would have to be a config option. We
> >> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
> >> contention. The to percpu, to a global pool because the percpu overhead was
> >> too high for some machines, and then from a global pool to a hybrid scheme
> >> because of global lock contention. I don't see a way of doing that with the
> >> label, which means a config would be the next best thing.
> >>
> >
> > For the buffers, what was the percpu overhead roughly? For
> > thousands of labels, I think, the extra memory overhead roughly would
> > be in the range of few MBs (need to be profiled though). This extra
> > label overhead would be considered high for the machines where percpu
> > buffer overhead was considered high?
> >
>
> It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
> size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
> on bigger machines. The problem here is say the percpu refcount while smaller
> per label, will be more in situations with lots of cpus. Which is fine if that
> is what it needs to be, but for other use cases tuning it to be smaller would
> be nice.
>
>
> > Please correct me here, so you are proposing that we use a kconfig to
> > use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
> > inside the 'struct aa_label' and update the refcount operations accordingly?
> > If yes, I will work on a patch with this kconfig based selection of
> > refcounting mode to see how it pans out.
> >
> possibly, I am still mulling over how we want to approach this
>
> > @Mateusz can you share the dynamic switching counter mode patch series please?
> >
> yes I am interested in looking at this as well.
>

https://lore.kernel.org/lkml/[email protected]/

> > In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
> > on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
> > which would not have memory usage overhead as in percpu refcount. At this point the
> > API design/implementation is in early prototype stage.
> >
> >
> > [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing
>
> okay, I will take a look
>
> >
> >> Not part of your patch but something to be considered is that the label tree
> >> needs a rework, its locking needs to move to read side a read side lock less
> >> scheme, and the plan was to make it also use a linked list such that new
> >> labels are always queued at the end, allowing dynamically created labels to
> >> be lazily added to the tree.
> >>
> >
> > Read side would be rcu read lock protected in this scheme?
> > The linked list would store the dynamically created compound labels?
> > What is the advantage of using this lazy addition to the tree? We optimize
> > on the label search, addition/deletion for dynamic labels? The lazy addition
> > to the tree is done when a label find operation on the list succeeds?
> >
> there are contexts where we are creating labels, and do not want to wait on
> some of the longer tree walk profile updates/replacements. If a replacement is
> on going the idea is to just add the label to the end of a list and let the
> process that is doing the tree update take the hit of inserting and rebalancing
> the tree.
>
>
> >> I see the use of the kworker as problematic as well, especially if we are
> >> talking using kconfig to switch reference counting modes. I am futzing with
> >> some ideas, on how to deal with this.
> >>
> >
> > We can disable queuing of label reclaim work for non-percpu case?
> >
> maybe, I am pondering ways we can deal with this. I have been pondering the
> if we might be able to leverage a seqlock here, but I will also take a look
> at hazard pointers.
>

Since there is some elaborate talk going about this, let me throw in
my own $0,03 -- I may happen to have a simple solution which will sort
it out and it boils down to storing local ref updates in task_struct.

Here is the context: creds are always refed and unrefed when creating
and destroying a file object. Should you have one instance of
credentials for a given user across different processes they would
serialize on updating the ref. Linux mostly dodges the problem by
always creating a copy on fork, thus only serializing within threads
of a given process. Even then that induces avoidable overhead if only
from single-threaded standpoint -- that's 2 atomics slapped for every
open/close cycle.

$elsewhere I devised an idea where cred ref updates to creds matching
current->cred only modify a local counter. They get rolled up when
current->creds is changed. That is to say there is 0 atomics or
modifying memory seen by other cpus as long as the process does not
change creds, which almost never happens compared to how often refing
and unrefing is implemented.

In struct cred apart from regular refs you would have "user" counter
and "distributed" counter. switching user to > 0 grabs a normal ref on
creds, the value of the "distributed" counter is arbitrary as long as
user > 0. users going back to 0 means we can release the special ref
held for that purpose.

I was considering implementing this for Linux. In my original code all
cred handling is augmented like this, but for Linux one could add a
dedicated get_cred_localref or similar machinery.

Skimming through apparmor suggests the bit which does cause
performance problems can be sorted out in the same manner.

Maybe i'll hack it up as a demo just for apparmor.

This would come with some extra space usage in task_struct which on
the surface may sounds like a non-starter. However, should you take a
look at the struct with pahole you will find it is riddles with
padding. If I wanted to I could add all fields I need to the struct
and not grow it on LP64.

--
Mateusz Guzik <mjguzik gmail.com>

2024-05-31 00:17:42

by John Johansen

[permalink] [raw]
Subject: Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

On 5/30/24 02:47, Mateusz Guzik wrote:
> On Thu, May 30, 2024 at 7:59 AM John Johansen
> <[email protected]> wrote:
>>
>> On 5/29/24 21:19, Neeraj Upadhyay wrote:
>>> Hi John,
>>>
>>> Thanks for taking a look at the series!
>>>
>>> On 5/29/2024 6:07 AM, John Johansen wrote:
>>>> On 5/28/24 06:29, Mateusz Guzik wrote:
>>>>> On Fri, May 24, 2024 at 11:52 PM John Johansen
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 5/24/24 14:10, Mateusz Guzik wrote:
>>>>>>> On Fri, Mar 8, 2024 at 9:09 PM John Johansen
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 3/2/24 02:23, Mateusz Guzik wrote:
>>>>>>>>> On 2/9/24, John Johansen <[email protected]> wrote:
>>>>>>>>>> On 2/6/24 20:40, Neeraj Upadhyay wrote:
>>>>>>>>>>> Gentle ping.
>>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> Could you please confirm that:
>>>>>>>>>>>
>>>>>>>>>>> a. The AppArmor refcount usage described in the RFC is correct?
>>>>>>>>>>> b. Approach taken to fix the scalability issue is valid/correct?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Neeraj,
>>>>>>>>>>
>>>>>>>>>> I know your patchset has been waiting on review for a long time.
>>>>>>>>>> Unfortunately I have been very, very busy lately. I will try to
>>>>>>>>>> get to it this weekend, but I can't promise that I will be able
>>>>>>>>>> to get the review fully done.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gentle prod.
>>>>>>>>>
>>>>>>>>> Any chances of this getting reviewed in the foreseeable future? Would
>>>>>>>>> be a real bummer if the patchset fell through the cracks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> yes, sorry I have been unavailable for the last couple of weeks. I am
>>>>>>>> now back, I have a rather large backlog to try catching up on but this
>>>>>>>> is has an entry on the list.
>>>>>>>>
>>>>>>>
>>>>>>> So where do we stand here?
>>>>>>>
>>>>>> sorry I am still trying to dig out of my backlog, I will look at this,
>>>>>> this weekend.
>>>>>>
>>>>>
>>>>> How was the weekend? ;)
>>>>>
>>>>
>>>> lets say it was busy. Have I looked at this, yes. I am still digesting it.
>>>> I don't have objections to moving towards percpu refcounts, but the overhead
>>>> of a percpu stuct per label is a problem when we have thousands of labels
>>>> on the system. That is to say, this would have to be a config option. We
>>>> moved buffers from kmalloc to percpu to reduce memory overhead to reduce
>>>> contention. The to percpu, to a global pool because the percpu overhead was
>>>> too high for some machines, and then from a global pool to a hybrid scheme
>>>> because of global lock contention. I don't see a way of doing that with the
>>>> label, which means a config would be the next best thing.
>>>>
>>>
>>> For the buffers, what was the percpu overhead roughly? For
>>> thousands of labels, I think, the extra memory overhead roughly would
>>> be in the range of few MBs (need to be profiled though). This extra
>>> label overhead would be considered high for the machines where percpu
>>> buffer overhead was considered high?
>>>
>>
>> It of course varies. It was fixed at 2-8K per cpu core depending on the buffer
>> size. So on a 192 cpu machine you we are talking a couple MBs. Obviously more
>> on bigger machines. The problem here is say the percpu refcount while smaller
>> per label, will be more in situations with lots of cpus. Which is fine if that
>> is what it needs to be, but for other use cases tuning it to be smaller would
>> be nice.
>>
>>
>>> Please correct me here, so you are proposing that we use a kconfig to
>>> use either 'struct percpu_ref' or a 'struct kref' (using a union maybe)
>>> inside the 'struct aa_label' and update the refcount operations accordingly?
>>> If yes, I will work on a patch with this kconfig based selection of
>>> refcounting mode to see how it pans out.
>>>
>> possibly, I am still mulling over how we want to approach this
>>
>>> @Mateusz can you share the dynamic switching counter mode patch series please?
>>>
>> yes I am interested in looking at this as well.
>>
>
> https://lore.kernel.org/lkml/[email protected]/
>
>>> In addition, for long term, there is an ongoing work (by Paul, Boqun and myself)
>>> on implementing hazard pointers as a scalable refcounting scheme [1] in kernel,
>>> which would not have memory usage overhead as in percpu refcount. At this point the
>>> API design/implementation is in early prototype stage.
>>>
>>>
>>> [1] https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/edit?usp=sharing
>>
>> okay, I will take a look
>>
>>>
>>>> Not part of your patch but something to be considered is that the label tree
>>>> needs a rework, its locking needs to move to read side a read side lock less
>>>> scheme, and the plan was to make it also use a linked list such that new
>>>> labels are always queued at the end, allowing dynamically created labels to
>>>> be lazily added to the tree.
>>>>
>>>
>>> Read side would be rcu read lock protected in this scheme?
>>> The linked list would store the dynamically created compound labels?
>>> What is the advantage of using this lazy addition to the tree? We optimize
>>> on the label search, addition/deletion for dynamic labels? The lazy addition
>>> to the tree is done when a label find operation on the list succeeds?
>>>
>> there are contexts where we are creating labels, and do not want to wait on
>> some of the longer tree walk profile updates/replacements. If a replacement is
>> on going the idea is to just add the label to the end of a list and let the
>> process that is doing the tree update take the hit of inserting and rebalancing
>> the tree.
>>
>>
>>>> I see the use of the kworker as problematic as well, especially if we are
>>>> talking using kconfig to switch reference counting modes. I am futzing with
>>>> some ideas, on how to deal with this.
>>>>
>>>
>>> We can disable queuing of label reclaim work for non-percpu case?
>>>
>> maybe, I am pondering ways we can deal with this. I have been pondering the
>> if we might be able to leverage a seqlock here, but I will also take a look
>> at hazard pointers.
>>
>
> Since there is some elaborate talk going about this, let me throw in
> my own $0,03 -- I may happen to have a simple solution which will sort
> it out and it boils down to storing local ref updates in task_struct.
>
> Here is the context: creds are always refed and unrefed when creating
> and destroying a file object. Should you have one instance of
> credentials for a given user across different processes they would
> serialize on updating the ref. Linux mostly dodges the problem by
> always creating a copy on fork, thus only serializing within threads
> of a given process. Even then that induces avoidable overhead if only
> from single-threaded standpoint -- that's 2 atomics slapped for every
> open/close cycle.
>
so the apparmor label can and will update beyond the open/close cycle.
Yes they are used in the cred as well but, for more than that. The
apparmor label on file can be updated by other tasks, for various
reasons.

> $elsewhere I devised an idea where cred ref updates to creds matching
> current->cred only modify a local counter. They get rolled up when
> current->creds is changed. That is to say there is 0 atomics or
> modifying memory seen by other cpus as long as the process does not
> change creds, which almost never happens compared to how often refing
> and unrefing is implemented.
>
right, we do something like this for the task cred with a crit section
marked out by

label = begin_current_label_crit_section()

end_current_label_crit_section(label);

if everything works out, no reference counts are taken. The purpose
of the fns is to deal with the cases where for one reason or another
a refcount needs to be taken (generally because of live policy
replacement, and the task hasn't been able to update its cred yet).

> In struct cred apart from regular refs you would have "user" counter
> and "distributed" counter. switching user to > 0 grabs a normal ref on
> creds, the value of the "distributed" counter is arbitrary as long as
> user > 0. users going back to 0 means we can release the special ref
> held for that purpose.
>
So I don't see how this will generally help for labels which exist
on many different objects.

> I was considering implementing this for Linux. In my original code all
> cred handling is augmented like this, but for Linux one could add a
> dedicated get_cred_localref or similar machinery.
>

sure, but I am not sure its needed. The rules for task creds is only
task can update its cred. The task can look at its cred and do most
things without having to take a count. Most cred refs should just
be being taken for objects.

> Skimming through apparmor suggests the bit which does cause
> performance problems can be sorted out in the same manner.
>
I don't see it. The file cred is very much updated live, async to
the task cred. And while currently it always starts as the task
cred, that won't even be true much longer.

> Maybe i'll hack it up as a demo just for apparmor.
>
> This would come with some extra space usage in task_struct which on
> the surface may sounds like a non-starter. However, should you take a
> look at the struct with pahole you will find it is riddles with
> padding. If I wanted to I could add all fields I need to the struct
> and not grow it on LP64.
>