2022-09-27 23:10:45

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

ctx->major contains the current syscall number. This is, of course, a
constant for the duration of the syscall. Unfortunately, GCC's alias
analysis cannot prove that it is not modified via a pointer in the
audit_filter_syscall() loop, and so always loads it from memory.

In and of itself the load isn't very expensive (ops dependent on the
ctx->major load are only used to determine the direction of control flow
and have short dependence chains and, in any case the related branches
get predicted perfectly in the fastpath) but still cache ctx->major
in a local for two reasons:

* ctx->major is in the first cacheline of struct audit_context and has
similar alignment as audit_entry::list audit_entry. For cases
with a lot of audit rules, doing this reduces one source of contention
from a potentially busy cache-set.

* audit_in_mask() (called in the hot loop in audit_filter_syscall())
does cast manipulation and error checking on ctx->major:

audit_in_mask(const struct audit_krule *rule, unsigned long val):
if (val > 0xffffffff)
return false;

word = AUDIT_WORD(val);
if (word >= AUDIT_BITMASK_SIZE)
return false;

bit = AUDIT_BIT(val);

return rule->mask[word] & bit;

The clauses related to the rule need to be evaluated in the loop, but
the rest is unnecessarily re-evaluated for every loop iteration.
(Note, however, that most of these are cheap ALU ops and the branches
are perfectly predicted. However, see discussion on cycles
improvement below for more on why it is still worth hoisting.)

On a Skylakex system change in getpid() latency (aggregated over
12 boot cycles):

Min Mean Median Max pstdev
(ns) (ns) (ns) (ns)

- 201.30 216.14 216.22 228.46 (+- 1.45%)
+ 196.63 207.86 206.60 230.98 (+- 3.92%)

Performance counter stats for 'bin/getpid' (3 runs) go from:
cycles 836.89 ( +- .80% )
instructions 2000.19 ( +- .03% )
IPC 2.39 ( +- .83% )
branches 430.14 ( +- .03% )
branch-misses 1.48 ( +- 3.37% )
L1-dcache-loads 471.11 ( +- .05% )
L1-dcache-load-misses 7.62 ( +- 46.98% )

to:
cycles 805.58 ( +- 4.11% )
instructions 1654.11 ( +- .05% )
IPC 2.06 ( +- 3.39% )
branches 430.02 ( +- .05% )
branch-misses 1.55 ( +- 7.09% )
L1-dcache-loads 440.01 ( +- .09% )
L1-dcache-load-misses 9.05 ( +- 74.03% )

(Both aggregated over 12 boot cycles.)

instructions: we reduce around 8 instructions/iteration because some of
the computation is now hoisted out of the loop (branch count does not
change because GCC, for reasons unclear, only hoists the computations
while keeping the basic-blocks.)

cycles: improve by about 5% (in aggregate and looking at individual run
numbers.) This is likely because we now waste fewer pipeline resources
on unnecessary instructions which allows the control flow to
speculatively execute further ahead shortening the execution of the loop
a little. The final gating factor on the performance of this loop
remains the long dependence chain due to the linked-list load.

Signed-off-by: Ankur Arora <[email protected]>
---
kernel/auditsc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 79a5da1bc5bb..533b087c3c02 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
{
struct audit_entry *e;
enum audit_state state;
+ unsigned long major = ctx->major;

if (auditd_test_task(tsk))
return;

rcu_read_lock();
list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
- if (audit_in_mask(&e->rule, ctx->major) &&
+ if (audit_in_mask(&e->rule, major) &&
audit_filter_rules(tsk, &e->rule, ctx, NULL,
&state, false)) {
rcu_read_unlock();
--
2.31.1


2022-09-28 22:28:50

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <[email protected]> wrote:
>
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
>
> ...
>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/auditsc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)

This looks pretty trivial to me, but it's too late in the current -rc
cycle for this to be merged, I'll queue it up for after the upcoming
merge window closes. Thanks.

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
> {
> struct audit_entry *e;
> enum audit_state state;
> + unsigned long major = ctx->major;
>
> if (auditd_test_task(tsk))
> return;
>
> rcu_read_lock();
> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
> - if (audit_in_mask(&e->rule, ctx->major) &&
> + if (audit_in_mask(&e->rule, major) &&
> audit_filter_rules(tsk, &e->rule, ctx, NULL,
> &state, false)) {
> rcu_read_unlock();
> --
> 2.31.1

--
paul-moore.com

2022-09-29 20:38:42

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()


Paul Moore <[email protected]> writes:

> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <[email protected]> wrote:
>>
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> ...
>>
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> kernel/auditsc.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes. Thanks.

Thanks.

Ankur

>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct *tsk,
>> {
>> struct audit_entry *e;
>> enum audit_state state;
>> + unsigned long major = ctx->major;
>>
>> if (auditd_test_task(tsk))
>> return;
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], list) {
>> - if (audit_in_mask(&e->rule, ctx->major) &&
>> + if (audit_in_mask(&e->rule, major) &&
>> audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> &state, false)) {
>> rcu_read_unlock();
>> --
>> 2.31.1


--
ankur

2022-09-30 18:47:52

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

On Fri, Sep 30, 2022 at 1:45 PM Steve Grubb <[email protected]> wrote:
> Hello,
>
> Thanks for the detailed notes on this investigation. It really is a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Please do not send a v2 with the changes Steve is suggesting, I think
this patch is fine as-is.

--
paul-moore.com

2022-09-30 19:09:35

by Steve Grubb

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

Hello,

Thanks for the detailed notes on this investigation. It really is a lot of
good information backing this up. However, there will come a day when someone
sees this "major = ctx->major" and they will send a patch to "fix" this
unnecessary assignment. If you are sending a V2 of this set, I would suggest
adding some comment in the code that this is for a performance improvement
and to see the commit message for additional info.

Thanks,
-Steve

On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
>
> In and of itself the load isn't very expensive (ops dependent on the
> ctx->major load are only used to determine the direction of control flow
> and have short dependence chains and, in any case the related branches
> get predicted perfectly in the fastpath) but still cache ctx->major
> in a local for two reasons:
>
> * ctx->major is in the first cacheline of struct audit_context and has
> similar alignment as audit_entry::list audit_entry. For cases
> with a lot of audit rules, doing this reduces one source of contention
> from a potentially busy cache-set.
>
> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
> does cast manipulation and error checking on ctx->major:
>
> audit_in_mask(const struct audit_krule *rule, unsigned long val):
> if (val > 0xffffffff)
> return false;
>
> word = AUDIT_WORD(val);
> if (word >= AUDIT_BITMASK_SIZE)
> return false;
>
> bit = AUDIT_BIT(val);
>
> return rule->mask[word] & bit;
>
> The clauses related to the rule need to be evaluated in the loop, but
> the rest is unnecessarily re-evaluated for every loop iteration.
> (Note, however, that most of these are cheap ALU ops and the branches
> are perfectly predicted. However, see discussion on cycles
> improvement below for more on why it is still worth hoisting.)
>
> On a Skylakex system change in getpid() latency (aggregated over
> 12 boot cycles):
>
> Min Mean Median Max pstdev
> (ns) (ns) (ns) (ns)
>
> - 201.30 216.14 216.22 228.46 (+- 1.45%)
> + 196.63 207.86 206.60 230.98 (+- 3.92%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
> cycles 836.89 ( +- .80% )
> instructions 2000.19 ( +- .03% )
> IPC 2.39 ( +- .83% )
> branches 430.14 ( +- .03% )
> branch-misses 1.48 ( +- 3.37% )
> L1-dcache-loads 471.11 ( +- .05% )
> L1-dcache-load-misses 7.62 ( +- 46.98% )
>
> to:
> cycles 805.58 ( +- 4.11% )
> instructions 1654.11 ( +- .05% )
> IPC 2.06 ( +- 3.39% )
> branches 430.02 ( +- .05% )
> branch-misses 1.55 ( +- 7.09% )
> L1-dcache-loads 440.01 ( +- .09% )
> L1-dcache-load-misses 9.05 ( +- 74.03% )
>
> (Both aggregated over 12 boot cycles.)
>
> instructions: we reduce around 8 instructions/iteration because some of
> the computation is now hoisted out of the loop (branch count does not
> change because GCC, for reasons unclear, only hoists the computations
> while keeping the basic-blocks.)
>
> cycles: improve by about 5% (in aggregate and looking at individual run
> numbers.) This is likely because we now waste fewer pipeline resources
> on unnecessary instructions which allows the control flow to
> speculatively execute further ahead shortening the execution of the loop
> a little. The final gating factor on the performance of this loop
> remains the long dependence chain due to the linked-list load.
>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/auditsc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
> *tsk, {
> struct audit_entry *e;
> enum audit_state state;
> + unsigned long major = ctx->major;
>
> if (auditd_test_task(tsk))
> return;
>
> rcu_read_lock();
> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
list) {
> - if (audit_in_mask(&e->rule, ctx->major) &&
> + if (audit_in_mask(&e->rule, major) &&
> audit_filter_rules(tsk, &e->rule, ctx, NULL,
> &state, false)) {
> rcu_read_unlock();




2022-10-07 01:02:34

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()


Steve Grubb <[email protected]> writes:

> Hello,
>
> Thanks for the detailed notes on this investigation. It really is a lot of
> good information backing this up. However, there will come a day when someone
> sees this "major = ctx->major" and they will send a patch to "fix" this
> unnecessary assignment. If you are sending a V2 of this set, I would suggest
> adding some comment in the code that this is for a performance improvement
> and to see the commit message for additional info.

Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.

Thanks
Ankur

>
> Thanks,
> -Steve
>
> On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
>> ctx->major contains the current syscall number. This is, of course, a
>> constant for the duration of the syscall. Unfortunately, GCC's alias
>> analysis cannot prove that it is not modified via a pointer in the
>> audit_filter_syscall() loop, and so always loads it from memory.
>>
>> In and of itself the load isn't very expensive (ops dependent on the
>> ctx->major load are only used to determine the direction of control flow
>> and have short dependence chains and, in any case the related branches
>> get predicted perfectly in the fastpath) but still cache ctx->major
>> in a local for two reasons:
>>
>> * ctx->major is in the first cacheline of struct audit_context and has
>> similar alignment as audit_entry::list audit_entry. For cases
>> with a lot of audit rules, doing this reduces one source of contention
>> from a potentially busy cache-set.
>>
>> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>> does cast manipulation and error checking on ctx->major:
>>
>> audit_in_mask(const struct audit_krule *rule, unsigned long val):
>> if (val > 0xffffffff)
>> return false;
>>
>> word = AUDIT_WORD(val);
>> if (word >= AUDIT_BITMASK_SIZE)
>> return false;
>>
>> bit = AUDIT_BIT(val);
>>
>> return rule->mask[word] & bit;
>>
>> The clauses related to the rule need to be evaluated in the loop, but
>> the rest is unnecessarily re-evaluated for every loop iteration.
>> (Note, however, that most of these are cheap ALU ops and the branches
>> are perfectly predicted. However, see discussion on cycles
>> improvement below for more on why it is still worth hoisting.)
>>
>> On a Skylakex system change in getpid() latency (aggregated over
>> 12 boot cycles):
>>
>> Min Mean Median Max pstdev
>> (ns) (ns) (ns) (ns)
>>
>> - 201.30 216.14 216.22 228.46 (+- 1.45%)
>> + 196.63 207.86 206.60 230.98 (+- 3.92%)
>>
>> Performance counter stats for 'bin/getpid' (3 runs) go from:
>> cycles 836.89 ( +- .80% )
>> instructions 2000.19 ( +- .03% )
>> IPC 2.39 ( +- .83% )
>> branches 430.14 ( +- .03% )
>> branch-misses 1.48 ( +- 3.37% )
>> L1-dcache-loads 471.11 ( +- .05% )
>> L1-dcache-load-misses 7.62 ( +- 46.98% )
>>
>> to:
>> cycles 805.58 ( +- 4.11% )
>> instructions 1654.11 ( +- .05% )
>> IPC 2.06 ( +- 3.39% )
>> branches 430.02 ( +- .05% )
>> branch-misses 1.55 ( +- 7.09% )
>> L1-dcache-loads 440.01 ( +- .09% )
>> L1-dcache-load-misses 9.05 ( +- 74.03% )
>>
>> (Both aggregated over 12 boot cycles.)
>>
>> instructions: we reduce around 8 instructions/iteration because some of
>> the computation is now hoisted out of the loop (branch count does not
>> change because GCC, for reasons unclear, only hoists the computations
>> while keeping the basic-blocks.)
>>
>> cycles: improve by about 5% (in aggregate and looking at individual run
>> numbers.) This is likely because we now waste fewer pipeline resources
>> on unnecessary instructions which allows the control flow to
>> speculatively execute further ahead shortening the execution of the loop
>> a little. The final gating factor on the performance of this loop
>> remains the long dependence chain due to the linked-list load.
>>
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> kernel/auditsc.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 79a5da1bc5bb..533b087c3c02 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
>> *tsk, {
>> struct audit_entry *e;
>> enum audit_state state;
>> + unsigned long major = ctx->major;
>>
>> if (auditd_test_task(tsk))
>> return;
>>
>> rcu_read_lock();
>> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
> list) {
>> - if (audit_in_mask(&e->rule, ctx->major) &&
>> + if (audit_in_mask(&e->rule, major) &&
>> audit_filter_rules(tsk, &e->rule, ctx, NULL,
>> &state, false)) {
>> rcu_read_unlock();


--
ankur

2022-10-17 18:58:51

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

On Wed, Sep 28, 2022 at 6:03 PM Paul Moore <[email protected]> wrote:
> On Tue, Sep 27, 2022 at 6:59 PM Ankur Arora <[email protected]> wrote:
> >
> > ctx->major contains the current syscall number. This is, of course, a
> > constant for the duration of the syscall. Unfortunately, GCC's alias
> > analysis cannot prove that it is not modified via a pointer in the
> > audit_filter_syscall() loop, and so always loads it from memory.
> >
> > ...
> >
> > Signed-off-by: Ankur Arora <[email protected]>
> > ---
> > kernel/auditsc.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
>
> This looks pretty trivial to me, but it's too late in the current -rc
> cycle for this to be merged, I'll queue it up for after the upcoming
> merge window closes. Thanks.

I just merged this into audit/next, thanks again!

--
paul-moore.com