2023-11-22 20:03:33

by Juntong Deng

[permalink] [raw]
Subject: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

The time obtained by local_clock() is the local CPU time, which may
drift between CPUs and is not suitable for comparison across CPUs.

It is possible for allocation and free to occur on different CPUs,
and using local_clock() to record timestamps may cause confusion.

ktime_get_boot_fast_ns() is based on clock sources and can be used
reliably and accurately for comparison across CPUs.

Signed-off-by: Juntong Deng <[email protected]>
---
mm/kfence/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 3872528d0963..041c03394193 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
track->num_stack_entries = num_stack_entries;
track->pid = task_pid_nr(current);
track->cpu = raw_smp_processor_id();
- track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
+ track->ts_nsec = ktime_get_boot_fast_ns();

/*
* Pairs with READ_ONCE() in
--
2.39.2


2023-11-22 20:36:51

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

On Wed, 22 Nov 2023 at 21:01, Juntong Deng <[email protected]> wrote:
>
> The time obtained by local_clock() is the local CPU time, which may
> drift between CPUs and is not suitable for comparison across CPUs.
>
> It is possible for allocation and free to occur on different CPUs,
> and using local_clock() to record timestamps may cause confusion.

The same problem exists with printk logging.

> ktime_get_boot_fast_ns() is based on clock sources and can be used
> reliably and accurately for comparison across CPUs.

You may be right here, however, the choice of local_clock() was
deliberate: it's the same timestamp source that printk uses.

Also, on systems where there is drift, the arch selects
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
generally bounded.

> Signed-off-by: Juntong Deng <[email protected]>
> ---
> mm/kfence/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index 3872528d0963..041c03394193 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
> track->num_stack_entries = num_stack_entries;
> track->pid = task_pid_nr(current);
> track->cpu = raw_smp_processor_id();
> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
> + track->ts_nsec = ktime_get_boot_fast_ns();

You have ignored the comment placed here - now it's no longer the same
source as printk timestamps. I think not being able to correlate
information from KFENCE reports with timestamps in lines from printk
is worse.

For now, I have to Nack: Unless you can prove that
ktime_get_boot_fast_ns() can still be correlated with timestamps from
printk timestamps, I think this change only trades one problem for
another.

Thanks,
-- Marco

2023-11-22 21:37:14

by Juntong Deng

[permalink] [raw]
Subject: Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

On 2023/11/23 4:35, Marco Elver wrote:
> On Wed, 22 Nov 2023 at 21:01, Juntong Deng <[email protected]> wrote:
>>
>> The time obtained by local_clock() is the local CPU time, which may
>> drift between CPUs and is not suitable for comparison across CPUs.
>>
>> It is possible for allocation and free to occur on different CPUs,
>> and using local_clock() to record timestamps may cause confusion.
>
> The same problem exists with printk logging.
>
>> ktime_get_boot_fast_ns() is based on clock sources and can be used
>> reliably and accurately for comparison across CPUs.
>
> You may be right here, however, the choice of local_clock() was
> deliberate: it's the same timestamp source that printk uses.
>
> Also, on systems where there is drift, the arch selects
> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
> generally bounded.
>
>> Signed-off-by: Juntong Deng <[email protected]>
>> ---
>> mm/kfence/core.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
>> index 3872528d0963..041c03394193 100644
>> --- a/mm/kfence/core.c
>> +++ b/mm/kfence/core.c
>> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
>> track->num_stack_entries = num_stack_entries;
>> track->pid = task_pid_nr(current);
>> track->cpu = raw_smp_processor_id();
>> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
>> + track->ts_nsec = ktime_get_boot_fast_ns();
>
> You have ignored the comment placed here - now it's no longer the same
> source as printk timestamps. I think not being able to correlate
> information from KFENCE reports with timestamps in lines from printk
> is worse.
>
> For now, I have to Nack: Unless you can prove that
> ktime_get_boot_fast_ns() can still be correlated with timestamps from
> printk timestamps, I think this change only trades one problem for
> another.
>
> Thanks,
> -- Marco

Honestly, the possibility of accurately matching a message in the printk
log by the timestamp in the kfence report is very low, since allocation
and free do not directly correspond to a certain event.

Since time drifts across CPUs, timestamps may be different even if
allocation and free can correspond to a certain event.

If we really need to find the relevant printk logs by the timestamps in
the kfence report, all we can do is to look for messages that are within
a certain time range.

If we are looking for messages in a certain time range, there is not
much difference between local_clock() and ktime_get_boot_fast_ns().

Also, this patch is in preparation for my next patch.

My next patch is to show the PID, CPU number, and timestamp when the
error occurred, in this case time drift from different CPUs can
cause confusion.

For example, use-after-free caused by a subtle race condition, in which
the time between the free and the error occur will be very close.

Time drift from different CPUs may cause it to appear in the report that
the error timestamp precedes the free timestamp.

2023-11-22 22:21:21

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

On Wed, 22 Nov 2023 at 22:36, Juntong Deng <[email protected]> wrote:
>
> On 2023/11/23 4:35, Marco Elver wrote:
> > On Wed, 22 Nov 2023 at 21:01, Juntong Deng <[email protected]> wrote:
> >>
> >> The time obtained by local_clock() is the local CPU time, which may
> >> drift between CPUs and is not suitable for comparison across CPUs.
> >>
> >> It is possible for allocation and free to occur on different CPUs,
> >> and using local_clock() to record timestamps may cause confusion.
> >
> > The same problem exists with printk logging.
> >
> >> ktime_get_boot_fast_ns() is based on clock sources and can be used
> >> reliably and accurately for comparison across CPUs.
> >
> > You may be right here, however, the choice of local_clock() was
> > deliberate: it's the same timestamp source that printk uses.
> >
> > Also, on systems where there is drift, the arch selects
> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
> > generally bounded.
> >
> >> Signed-off-by: Juntong Deng <[email protected]>
> >> ---
> >> mm/kfence/core.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> >> index 3872528d0963..041c03394193 100644
> >> --- a/mm/kfence/core.c
> >> +++ b/mm/kfence/core.c
> >> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
> >> track->num_stack_entries = num_stack_entries;
> >> track->pid = task_pid_nr(current);
> >> track->cpu = raw_smp_processor_id();
> >> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
> >> + track->ts_nsec = ktime_get_boot_fast_ns();
> >
> > You have ignored the comment placed here - now it's no longer the same
> > source as printk timestamps. I think not being able to correlate
> > information from KFENCE reports with timestamps in lines from printk
> > is worse.
> >
> > For now, I have to Nack: Unless you can prove that
> > ktime_get_boot_fast_ns() can still be correlated with timestamps from
> > printk timestamps, I think this change only trades one problem for
> > another.
> >
> > Thanks,
> > -- Marco
>
> Honestly, the possibility of accurately matching a message in the printk
> log by the timestamp in the kfence report is very low, since allocation
> and free do not directly correspond to a certain event.

It's about being able to compare the timestamps. I don't want to match
an exact event, but be able to figure out which event happened
before/after an allocation or free, i.e. the logical ordering of
events.

With CONFIG_PRINTK_CALLER we can see the CPU ID in printk lines and
are therefore able to accurately compare printk lines with information
given by KFENCE alloc/free info.

> Since time drifts across CPUs, timestamps may be different even if
> allocation and free can correspond to a certain event.

This is not a problem with CONFIG_PRINTK_CALLER.

> If we really need to find the relevant printk logs by the timestamps in
> the kfence report, all we can do is to look for messages that are within
> a certain time range.
>
> If we are looking for messages in a certain time range, there is not
> much difference between local_clock() and ktime_get_boot_fast_ns().
>
> Also, this patch is in preparation for my next patch.
>
> My next patch is to show the PID, CPU number, and timestamp when the
> error occurred, in this case time drift from different CPUs can
> cause confusion.

It's not quite clear how there's a dependency between this patch and a
later patch, but generally it's good practice to send related patches
as a patch series. That way it's easier to see what the overall
changes are and provide feedback as a whole - as is, it's difficult to
provide feedback.

However, from what you say this information is already given.
dump_stack_print_info() shows this - e.g this bit here is printed by
where the error occurred:

| CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
04/01/2014

And if the printk log has timestamps, then these lines are prefixed
with the timestamp where the error occurred.

> For example, use-after-free caused by a subtle race condition, in which
> the time between the free and the error occur will be very close.
>
> Time drift from different CPUs may cause it to appear in the report that
> the error timestamp precedes the free timestamp.

That doesn't matter. I recommend that you go through a hypothetical
debugging scenario:
1. We are not interested in the absolute timings of events, but the
logical ordering between them.

2. The logical ordering of events is inherent from how KFENCE
operates: an error _always_ follows an allocation and/or free. From a
debugging point of view, the timestamps do not have any value here.

3. The timestamps _do_ add value when trying to figure out the logical
ordering between allocation, free, or the erroneous access _with
other_ events in the system. A stream of other events is always shown
in the kernel log (printk). Other streams of events can be obtained
via e.g. ftrace (which also uses local_clock(), see
kernel/trace/trace_clock.c).

So, the timestamp that KFENCE should show is the one that most likely
allows us to deduce the logical ordering with other events in the
system.

2023-11-23 09:31:06

by Juntong Deng

[permalink] [raw]
Subject: Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

On 2023/11/23 6:19, Marco Elver wrote:
> On Wed, 22 Nov 2023 at 22:36, Juntong Deng <[email protected]> wrote:
>>
>> On 2023/11/23 4:35, Marco Elver wrote:
>>> On Wed, 22 Nov 2023 at 21:01, Juntong Deng <[email protected]> wrote:
>>>>
>>>> The time obtained by local_clock() is the local CPU time, which may
>>>> drift between CPUs and is not suitable for comparison across CPUs.
>>>>
>>>> It is possible for allocation and free to occur on different CPUs,
>>>> and using local_clock() to record timestamps may cause confusion.
>>>
>>> The same problem exists with printk logging.
>>>
>>>> ktime_get_boot_fast_ns() is based on clock sources and can be used
>>>> reliably and accurately for comparison across CPUs.
>>>
>>> You may be right here, however, the choice of local_clock() was
>>> deliberate: it's the same timestamp source that printk uses.
>>>
>>> Also, on systems where there is drift, the arch selects
>>> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
>>> generally bounded.
>>>
>>>> Signed-off-by: Juntong Deng <[email protected]>
>>>> ---
>>>> mm/kfence/core.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
>>>> index 3872528d0963..041c03394193 100644
>>>> --- a/mm/kfence/core.c
>>>> +++ b/mm/kfence/core.c
>>>> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
>>>> track->num_stack_entries = num_stack_entries;
>>>> track->pid = task_pid_nr(current);
>>>> track->cpu = raw_smp_processor_id();
>>>> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
>>>> + track->ts_nsec = ktime_get_boot_fast_ns();
>>>
>>> You have ignored the comment placed here - now it's no longer the same
>>> source as printk timestamps. I think not being able to correlate
>>> information from KFENCE reports with timestamps in lines from printk
>>> is worse.
>>>
>>> For now, I have to Nack: Unless you can prove that
>>> ktime_get_boot_fast_ns() can still be correlated with timestamps from
>>> printk timestamps, I think this change only trades one problem for
>>> another.
>>>
>>> Thanks,
>>> -- Marco
>>
>> Honestly, the possibility of accurately matching a message in the printk
>> log by the timestamp in the kfence report is very low, since allocation
>> and free do not directly correspond to a certain event.
>
> It's about being able to compare the timestamps. I don't want to match
> an exact event, but be able to figure out which event happened
> before/after an allocation or free, i.e. the logical ordering of
> events.
>
> With CONFIG_PRINTK_CALLER we can see the CPU ID in printk lines and
> are therefore able to accurately compare printk lines with information
> given by KFENCE alloc/free info.
>


That makes sense.


>> Since time drifts across CPUs, timestamps may be different even if
>> allocation and free can correspond to a certain event.
>
> This is not a problem with CONFIG_PRINTK_CALLER.
>
>> If we really need to find the relevant printk logs by the timestamps in
>> the kfence report, all we can do is to look for messages that are within
>> a certain time range.
>>
>> If we are looking for messages in a certain time range, there is not
>> much difference between local_clock() and ktime_get_boot_fast_ns().
>>
>> Also, this patch is in preparation for my next patch.
>>
>> My next patch is to show the PID, CPU number, and timestamp when the
>> error occurred, in this case time drift from different CPUs can
>> cause confusion.
>
> It's not quite clear how there's a dependency between this patch and a
> later patch, but generally it's good practice to send related patches
> as a patch series. That way it's easier to see what the overall
> changes are and provide feedback as a whole - as is, it's difficult to
> provide feedback.
>
> However, from what you say this information is already given.
> dump_stack_print_info() shows this - e.g this bit here is printed by
> where the error occurred:
>
> | CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
> 04/01/2014
>
> And if the printk log has timestamps, then these lines are prefixed
> with the timestamp where the error occurred.
>


Thanks, I found that information.

Since this information is at the bottom of the report, I had previously
ignored them.

I would suggest considering moving this information to the top of
the report, for example

BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234

CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
04/01/2014

Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72):
...

This more clearly correlates this information with the occurrence of
the error.

If the timestamp of the printk is used as the timestamp of when the
error occurred, then my patch is unnecessary.


>> For example, use-after-free caused by a subtle race condition, in which
>> the time between the free and the error occur will be very close.
>>
>> Time drift from different CPUs may cause it to appear in the report that
>> the error timestamp precedes the free timestamp.
>
> That doesn't matter. I recommend that you go through a hypothetical
> debugging scenario:
> 1. We are not interested in the absolute timings of events, but the
> logical ordering between them.
>
> 2. The logical ordering of events is inherent from how KFENCE
> operates: an error _always_ follows an allocation and/or free. From a
> debugging point of view, the timestamps do not have any value here.
>
> 3. The timestamps _do_ add value when trying to figure out the logical
> ordering between allocation, free, or the erroneous access _with
> other_ events in the system. A stream of other events is always shown
> in the kernel log (printk). Other streams of events can be obtained
> via e.g. ftrace (which also uses local_clock(), see
> kernel/trace/trace_clock.c).
>
> So, the timestamp that KFENCE should show is the one that most likely
> allows us to deduce the logical ordering with other events in the
> system.


Thanks for the detailed explanation.

I now have a better understanding of the purpose of timestamps in
KFENCE reports.


2023-11-23 09:48:06

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

On Thu, 23 Nov 2023 at 10:29, Juntong Deng <[email protected]> wrote:
>
> On 2023/11/23 6:19, Marco Elver wrote:
> > On Wed, 22 Nov 2023 at 22:36, Juntong Deng <[email protected]> wrote:
> >>
> >> On 2023/11/23 4:35, Marco Elver wrote:
> >>> On Wed, 22 Nov 2023 at 21:01, Juntong Deng <[email protected]> wrote:
> >>>>
> >>>> The time obtained by local_clock() is the local CPU time, which may
> >>>> drift between CPUs and is not suitable for comparison across CPUs.
> >>>>
> >>>> It is possible for allocation and free to occur on different CPUs,
> >>>> and using local_clock() to record timestamps may cause confusion.
> >>>
> >>> The same problem exists with printk logging.
> >>>
> >>>> ktime_get_boot_fast_ns() is based on clock sources and can be used
> >>>> reliably and accurately for comparison across CPUs.
> >>>
> >>> You may be right here, however, the choice of local_clock() was
> >>> deliberate: it's the same timestamp source that printk uses.
> >>>
> >>> Also, on systems where there is drift, the arch selects
> >>> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
> >>> generally bounded.
> >>>
> >>>> Signed-off-by: Juntong Deng <[email protected]>
> >>>> ---
> >>>> mm/kfence/core.c | 2 +-
> >>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> >>>> index 3872528d0963..041c03394193 100644
> >>>> --- a/mm/kfence/core.c
> >>>> +++ b/mm/kfence/core.c
> >>>> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
> >>>> track->num_stack_entries = num_stack_entries;
> >>>> track->pid = task_pid_nr(current);
> >>>> track->cpu = raw_smp_processor_id();
> >>>> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
> >>>> + track->ts_nsec = ktime_get_boot_fast_ns();
> >>>
> >>> You have ignored the comment placed here - now it's no longer the same
> >>> source as printk timestamps. I think not being able to correlate
> >>> information from KFENCE reports with timestamps in lines from printk
> >>> is worse.
> >>>
> >>> For now, I have to Nack: Unless you can prove that
> >>> ktime_get_boot_fast_ns() can still be correlated with timestamps from
> >>> printk timestamps, I think this change only trades one problem for
> >>> another.
> >>>
> >>> Thanks,
> >>> -- Marco
> >>
> >> Honestly, the possibility of accurately matching a message in the printk
> >> log by the timestamp in the kfence report is very low, since allocation
> >> and free do not directly correspond to a certain event.
> >
> > It's about being able to compare the timestamps. I don't want to match
> > an exact event, but be able to figure out which event happened
> > before/after an allocation or free, i.e. the logical ordering of
> > events.
> >
> > With CONFIG_PRINTK_CALLER we can see the CPU ID in printk lines and
> > are therefore able to accurately compare printk lines with information
> > given by KFENCE alloc/free info.
> >
>
>
> That makes sense.
>
>
> >> Since time drifts across CPUs, timestamps may be different even if
> >> allocation and free can correspond to a certain event.
> >
> > This is not a problem with CONFIG_PRINTK_CALLER.
> >
> >> If we really need to find the relevant printk logs by the timestamps in
> >> the kfence report, all we can do is to look for messages that are within
> >> a certain time range.
> >>
> >> If we are looking for messages in a certain time range, there is not
> >> much difference between local_clock() and ktime_get_boot_fast_ns().
> >>
> >> Also, this patch is in preparation for my next patch.
> >>
> >> My next patch is to show the PID, CPU number, and timestamp when the
> >> error occurred, in this case time drift from different CPUs can
> >> cause confusion.
> >
> > It's not quite clear how there's a dependency between this patch and a
> > later patch, but generally it's good practice to send related patches
> > as a patch series. That way it's easier to see what the overall
> > changes are and provide feedback as a whole - as is, it's difficult to
> > provide feedback.
> >
> > However, from what you say this information is already given.
> > dump_stack_print_info() shows this - e.g this bit here is printed by
> > where the error occurred:
> >
> > | CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
> > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
> > 04/01/2014
> >
> > And if the printk log has timestamps, then these lines are prefixed
> > with the timestamp where the error occurred.
> >
>
>
> Thanks, I found that information.
>
> Since this information is at the bottom of the report, I had previously
> ignored them.
>
> I would suggest considering moving this information to the top of
> the report, for example
>
> BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234
>
> CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
> 04/01/2014
>
> Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72):
> ...
>
> This more clearly correlates this information with the occurrence of
> the error.

Most kernel warnings/bugs/etc. show this information at the bottom of
the report, hence KFENCE also showing it there. If you look at
kfence_report_error() where it prints this info, there is a mode where
KFENCE also dumps all registers via show_regs(). show_regs() itself
displays this information at the bottom as well, but showing a dump of
registers at the start of the KFENCE report is pretty distracting and
looks ugly.

The placement of this information is not the best, but at the same
time I found it to be the least bad compromise (when also considering
the mode where it dumps registers). We could of course untangle some
of these functions and e.g. have a show_regs() version that doesn't
show that info, but I find that to add more interfaces to the kernel
with unclear gains - overall probably not worth the time effort.

At least that's the reasoning for why things are the way they are
today. If there is an easier way I missed, any clear improvements are
of course welcome.

Thanks,
-- Marco