2019-05-24 16:57:19

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

The kernel test robot has reported that the use of __this_cpu_add()
causes bug messages like:

BUG: using __this_cpu_add() in preemptible [00000000] code: ...

This is only an issue on preempt kernel where preemption can happen in
the middle of a percpu operation. We are still using __this_cpu_*() for
!preempt kernel to avoid additional overhead in case CONFIG_PREEMPT_COUNT
is set.

v2: Simplify the condition to just preempt or !preempt.

Fixes: a8654596f0371 ("locking/rwsem: Enable lock event counting")
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events.h | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
index feb1acc54611..05f34068ec06 100644
--- a/kernel/locking/lock_events.h
+++ b/kernel/locking/lock_events.h
@@ -30,13 +30,32 @@ enum lock_events {
*/
DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);

+/*
+ * The purpose of the lock event counting subsystem is to provide a low
+ * overhead way to record the number of specific locking events by using
+ * percpu counters. It is the percpu sum that matters, not specifically
+ * how many of them happens in each cpu.
+ *
+ * In !preempt kernel, we can just use __this_cpu_*() as preemption
+ * won't happen in the middle of the percpu operation. In preempt kernel,
+ * preemption happens in the middle of the percpu operation may produce
+ * incorrect result.
+ */
+#ifdef CONFIG_PREEMPT
+#define lockevent_percpu_inc(x) this_cpu_inc(x)
+#define lockevent_percpu_add(x, v) this_cpu_add(x, v)
+#else
+#define lockevent_percpu_inc(x) __this_cpu_inc(x)
+#define lockevent_percpu_add(x, v) __this_cpu_add(x, v)
+#endif
+
/*
* Increment the PV qspinlock statistical counters
*/
static inline void __lockevent_inc(enum lock_events event, bool cond)
{
if (cond)
- __this_cpu_inc(lockevents[event]);
+ lockevent_percpu_inc(lockevents[event]);
}

#define lockevent_inc(ev) __lockevent_inc(LOCKEVENT_ ##ev, true)
@@ -44,7 +63,7 @@ static inline void __lockevent_inc(enum lock_events event, bool cond)

static inline void __lockevent_add(enum lock_events event, int inc)
{
- __this_cpu_add(lockevents[event], inc);
+ lockevent_percpu_add(lockevents[event], inc);
}

#define lockevent_add(ev, c) __lockevent_add(LOCKEVENT_ ##ev, c)
--
2.18.1


2019-05-24 17:23:16

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On Fri, May 24, 2019 at 12:53:46PM -0400, Waiman Long wrote:
> The kernel test robot has reported that the use of __this_cpu_add()
> causes bug messages like:
>
> BUG: using __this_cpu_add() in preemptible [00000000] code: ...
>
> This is only an issue on preempt kernel where preemption can happen in
> the middle of a percpu operation. We are still using __this_cpu_*() for
> !preempt kernel to avoid additional overhead in case CONFIG_PREEMPT_COUNT
> is set.
>
> v2: Simplify the condition to just preempt or !preempt.
>
> Fixes: a8654596f0371 ("locking/rwsem: Enable lock event counting")
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/locking/lock_events.h | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
> index feb1acc54611..05f34068ec06 100644
> --- a/kernel/locking/lock_events.h
> +++ b/kernel/locking/lock_events.h
> @@ -30,13 +30,32 @@ enum lock_events {
> */
> DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);
>
> +/*
> + * The purpose of the lock event counting subsystem is to provide a low
> + * overhead way to record the number of specific locking events by using
> + * percpu counters. It is the percpu sum that matters, not specifically
> + * how many of them happens in each cpu.
> + *
> + * In !preempt kernel, we can just use __this_cpu_*() as preemption
> + * won't happen in the middle of the percpu operation. In preempt kernel,
> + * preemption happens in the middle of the percpu operation may produce
> + * incorrect result.
> + */
> +#ifdef CONFIG_PREEMPT
> +#define lockevent_percpu_inc(x) this_cpu_inc(x)
> +#define lockevent_percpu_add(x, v) this_cpu_add(x, v)
> +#else
> +#define lockevent_percpu_inc(x) __this_cpu_inc(x)
> +#define lockevent_percpu_add(x, v) __this_cpu_add(x, v)

Are you sure this works wrt IRQs? For example, if I take an interrupt when
trying to update the counter, and then the irq handler takes a qspinlock
which in turn tries to update the counter. Would I lose an update in that
scenario?

Will

2019-05-24 17:29:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On Fri, May 24, 2019 at 10:19 AM Will Deacon <[email protected]> wrote:
>
> Are you sure this works wrt IRQs? For example, if I take an interrupt when
> trying to update the counter, and then the irq handler takes a qspinlock
> which in turn tries to update the counter. Would I lose an update in that
> scenario?

Sounds about right.

We might decide that the lock event counters are not necessarily
precise, but just rough guide-line statistics ("close enough in
practice")

But that would imply that it shouldn't be dependent on CONFIG_PREEMPT
at all, and we should always use the double-underscore version, except
without the debug checking.

Maybe the #ifdef should just be CONFIG_PREEMPT_DEBUG, with a comment
saying "we're not exact, but debugging complains, so if you enable
debugging it will be slower and precise". Because I don't think we
have a "do this unsafely and without any debugging" option.

And the whole "not precise" thing should be documented, of course.

I can't imagine that people would rely on _exact_ lock statistics, but
hey, there are a lot of things people do that I can't fathom, so
that's not necessarily a strong argument.

Comments?

Linus

2019-05-24 17:30:50

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On 5/24/19 1:19 PM, Will Deacon wrote:
> On Fri, May 24, 2019 at 12:53:46PM -0400, Waiman Long wrote:
>> The kernel test robot has reported that the use of __this_cpu_add()
>> causes bug messages like:
>>
>> BUG: using __this_cpu_add() in preemptible [00000000] code: ...
>>
>> This is only an issue on preempt kernel where preemption can happen in
>> the middle of a percpu operation. We are still using __this_cpu_*() for
>> !preempt kernel to avoid additional overhead in case CONFIG_PREEMPT_COUNT
>> is set.
>>
>> v2: Simplify the condition to just preempt or !preempt.
>>
>> Fixes: a8654596f0371 ("locking/rwsem: Enable lock event counting")
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> kernel/locking/lock_events.h | 23 +++++++++++++++++++++--
>> 1 file changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
>> index feb1acc54611..05f34068ec06 100644
>> --- a/kernel/locking/lock_events.h
>> +++ b/kernel/locking/lock_events.h
>> @@ -30,13 +30,32 @@ enum lock_events {
>> */
>> DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);
>>
>> +/*
>> + * The purpose of the lock event counting subsystem is to provide a low
>> + * overhead way to record the number of specific locking events by using
>> + * percpu counters. It is the percpu sum that matters, not specifically
>> + * how many of them happens in each cpu.
>> + *
>> + * In !preempt kernel, we can just use __this_cpu_*() as preemption
>> + * won't happen in the middle of the percpu operation. In preempt kernel,
>> + * preemption happens in the middle of the percpu operation may produce
>> + * incorrect result.
>> + */
>> +#ifdef CONFIG_PREEMPT
>> +#define lockevent_percpu_inc(x) this_cpu_inc(x)
>> +#define lockevent_percpu_add(x, v) this_cpu_add(x, v)
>> +#else
>> +#define lockevent_percpu_inc(x) __this_cpu_inc(x)
>> +#define lockevent_percpu_add(x, v) __this_cpu_add(x, v)
> Are you sure this works wrt IRQs? For example, if I take an interrupt when
> trying to update the counter, and then the irq handler takes a qspinlock
> which in turn tries to update the counter. Would I lose an update in that
> scenario?
>
> Will

Good point! But this will be an issue even if we use the non-underscore
version as I don't think it will disable interrupt. Also it is only a
problem if the percpu operation is more than 1 instruction. It is a
single instruction for x86. Other architectures may require more than 1
instruction. In those cases, we may lose count, but it is still better
than getting the count from one CPU and put it into another CPU.

Cheers,
Longman

2019-05-24 17:37:35

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On 5/24/19 1:27 PM, Linus Torvalds wrote:
> On Fri, May 24, 2019 at 10:19 AM Will Deacon <[email protected]> wrote:
>> Are you sure this works wrt IRQs? For example, if I take an interrupt when
>> trying to update the counter, and then the irq handler takes a qspinlock
>> which in turn tries to update the counter. Would I lose an update in that
>> scenario?
> Sounds about right.
>
> We might decide that the lock event counters are not necessarily
> precise, but just rough guide-line statistics ("close enough in
> practice")
>
> But that would imply that it shouldn't be dependent on CONFIG_PREEMPT
> at all, and we should always use the double-underscore version, except
> without the debug checking.
>
> Maybe the #ifdef should just be CONFIG_PREEMPT_DEBUG, with a comment
> saying "we're not exact, but debugging complains, so if you enable
> debugging it will be slower and precise". Because I don't think we
> have a "do this unsafely and without any debugging" option.

I am not too worry about losing count here and there once in a while
because of interrupts, but the possibility of having the count from one
CPU to be put into another CPU in a preempt kernel may distort the total
count significantly. This is what I want to avoid.


>
> And the whole "not precise" thing should be documented, of course.

Yes, I will update the patch to document that fact that the count may
not be precise. Anyway even if we have a 1-2% error, it is not a big
deal in term of presenting a global picture of what operations are being
done.

Cheers,
Longman

2019-05-24 17:41:39

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On Fri, May 24, 2019 at 01:35:39PM -0400, Waiman Long wrote:
> On 5/24/19 1:27 PM, Linus Torvalds wrote:
> > On Fri, May 24, 2019 at 10:19 AM Will Deacon <[email protected]> wrote:
> >> Are you sure this works wrt IRQs? For example, if I take an interrupt when
> >> trying to update the counter, and then the irq handler takes a qspinlock
> >> which in turn tries to update the counter. Would I lose an update in that
> >> scenario?
> > Sounds about right.
> >
> > We might decide that the lock event counters are not necessarily
> > precise, but just rough guide-line statistics ("close enough in
> > practice")
> >
> > But that would imply that it shouldn't be dependent on CONFIG_PREEMPT
> > at all, and we should always use the double-underscore version, except
> > without the debug checking.
> >
> > Maybe the #ifdef should just be CONFIG_PREEMPT_DEBUG, with a comment
> > saying "we're not exact, but debugging complains, so if you enable
> > debugging it will be slower and precise". Because I don't think we
> > have a "do this unsafely and without any debugging" option.
>
> I am not too worry about losing count here and there once in a while
> because of interrupts, but the possibility of having the count from one
> CPU to be put into another CPU in a preempt kernel may distort the total
> count significantly. This is what I want to avoid.
>
>
> >
> > And the whole "not precise" thing should be documented, of course.
>
> Yes, I will update the patch to document that fact that the count may
> not be precise. Anyway even if we have a 1-2% error, it is not a big
> deal in term of presenting a global picture of what operations are being
> done.

I suppose one alternative would be to have a per-cpu local_t variable,
and do the increments on that. However, that's probably worse than the
current approach for x86.

Will

2019-05-24 18:34:26

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On Fri, May 24, 2019 at 02:11:23PM -0400, Waiman Long wrote:
> On 5/24/19 1:39 PM, Will Deacon wrote:
>
> And the whole "not precise" thing should be documented, of course.
>
> Yes, I will update the patch to document that fact that the count may
> not be precise. Anyway even if we have a 1-2% error, it is not a big
> deal in term of presenting a global picture of what operations are being
> done.
>
> I suppose one alternative would be to have a per-cpu local_t variable,
> and do the increments on that. However, that's probably worse than the
> current approach for x86.
>
> I don't quite understand what you mean by per-cpu local_t variable. A per-cpu
> variable is either statically allocated or dynamically allocated. Even with
> dynamical allocation, the same problem exists, I think unless you differentiate
> between irq context and process context. That will make it a lot more messier,
> I think.

So I haven't actually tried this to see if it works, but all I meant was
that you could replace the current:

DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);

with:

DECLARE_PER_CPU(local_t, lockevents[lockevent_num]);

and then rework the inc/add macros to use a combination of raw_cpu_ptr
and local_inc().

I think that would allow you to get rid of the #ifdeffery, but it may
introduce a small overhead for x86.

Will

2019-05-24 18:54:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] locking/lock_events: Use this_cpu_add() when necessary

On 5/24/19 2:32 PM, Will Deacon wrote:
> On Fri, May 24, 2019 at 02:11:23PM -0400, Waiman Long wrote:
>> On 5/24/19 1:39 PM, Will Deacon wrote:
>>
>> And the whole "not precise" thing should be documented, of course.
>>
>> Yes, I will update the patch to document that fact that the count may
>> not be precise. Anyway even if we have a 1-2% error, it is not a big
>> deal in term of presenting a global picture of what operations are being
>> done.
>>
>> I suppose one alternative would be to have a per-cpu local_t variable,
>> and do the increments on that. However, that's probably worse than the
>> current approach for x86.
>>
>> I don't quite understand what you mean by per-cpu local_t variable. A per-cpu
>> variable is either statically allocated or dynamically allocated. Even with
>> dynamical allocation, the same problem exists, I think unless you differentiate
>> between irq context and process context. That will make it a lot more messier,
>> I think.
> So I haven't actually tried this to see if it works, but all I meant was
> that you could replace the current:
>
> DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);
>
> with:
>
> DECLARE_PER_CPU(local_t, lockevents[lockevent_num]);
>
> and then rework the inc/add macros to use a combination of raw_cpu_ptr
> and local_inc().
>
> I think that would allow you to get rid of the #ifdeffery, but it may
> introduce a small overhead for x86.

OK, I was not aware of the local_t type. Anyway, the x86 local_t type
perform similar single-instruction update. On other architectures that
can't do that, it will be a real atomic operation which will be more costly.

-Longman