Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
tracing infrastructure has support for using different clocks to generate
timestamps for trace events. Especially in TSN networks it's useful to have TAI
as trace clock, because the application scheduling is done in accordance to the
network time, which is based on TAI. With a tai trace_clock in place, it becomes
very convenient to correlate network activity with Linux kernel application
traces.
Use the same implementation as ktime_get_boot_fast_ns() does by reading the
monotonic time and adding the TAI offset. The same limitations as for the fast
boot implementation apply. The TAI offset may change at run time e.g., by
setting the time or using adjtimex() with an offset. However, these kind of
offset changes are rare events. Nevertheless, the user has to be aware and deal
with it in post processing.
An alternative approach would be to use the same implementation as
ktime_get_real_fast_ns() does. However, this requires to add an additional u64
member to the tk_read_base struct. This struct together with a seqcount is
designed to fit into a single cache line on 64 bit architectures. Adding a new
member would violate this constraint.
Signed-off-by: Kurt Kanzenbach <[email protected]>
---
Documentation/core-api/timekeeping.rst | 1 +
include/linux/timekeeping.h | 1 +
kernel/time/timekeeping.c | 17 +++++++++++++++++
3 files changed, 19 insertions(+)
diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst
index 729e24864fe7..22ec68f24421 100644
--- a/Documentation/core-api/timekeeping.rst
+++ b/Documentation/core-api/timekeeping.rst
@@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases:
.. c:function:: u64 ktime_get_mono_fast_ns( void )
u64 ktime_get_raw_fast_ns( void )
u64 ktime_get_boot_fast_ns( void )
+ u64 ktime_get_tai_fast_ns( void )
u64 ktime_get_real_fast_ns( void )
These variants are safe to call from any context, including from
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 78a98bdff76d..fe1e467ba046 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -177,6 +177,7 @@ static inline u64 ktime_get_raw_ns(void)
extern u64 ktime_get_mono_fast_ns(void);
extern u64 ktime_get_raw_fast_ns(void);
extern u64 ktime_get_boot_fast_ns(void);
+extern u64 ktime_get_tai_fast_ns(void);
extern u64 ktime_get_real_fast_ns(void);
/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index dcdcb85121e4..2c22023fbf5f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -532,6 +532,23 @@ u64 notrace ktime_get_boot_fast_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_boot_fast_ns);
+/**
+ * ktime_get_tai_fast_ns - NMI safe and fast access to tai clock.
+ *
+ * The same limitations as described for ktime_get_boot_fast_ns() apply. The
+ * mono time and the TAI offset are not read atomically which may yield wrong
+ * readouts. However, an update of the TAI offset is an rare event e.g., caused
+ * by settime or adjtimex with an offset. The user of this function has to deal
+ * with the possibility of wrong timestamps in post processing.
+ */
+u64 notrace ktime_get_tai_fast_ns(void)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+
+ return (ktime_get_mono_fast_ns() + ktime_to_ns(data_race(tk->offs_tai)));
+}
+EXPORT_SYMBOL_GPL(ktime_get_tai_fast_ns);
+
static __always_inline u64 __ktime_get_real_fast(struct tk_fast *tkf, u64 *mono)
{
struct tk_read_base *tkr;
--
2.30.2
The following commit has been merged into the timers/core branch of tip:
Commit-ID: 3dc6ffae2da201284cb24af66af77ee0bbb2efaa
Gitweb: https://git.kernel.org/tip/3dc6ffae2da201284cb24af66af77ee0bbb2efaa
Author: Kurt Kanzenbach <[email protected]>
AuthorDate: Thu, 14 Apr 2022 11:18:03 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 14 Apr 2022 16:19:30 +02:00
timekeeping: Introduce fast accessor to clock tai
Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
tracing infrastructure has support for using different clocks to generate
timestamps for trace events. Especially in TSN networks it's useful to have TAI
as trace clock, because the application scheduling is done in accordance to the
network time, which is based on TAI. With a tai trace_clock in place, it becomes
very convenient to correlate network activity with Linux kernel application
traces.
Use the same implementation as ktime_get_boot_fast_ns() does by reading the
monotonic time and adding the TAI offset. The same limitations as for the fast
boot implementation apply. The TAI offset may change at run time e.g., by
setting the time or using adjtimex() with an offset. However, these kind of
offset changes are rare events. Nevertheless, the user has to be aware and deal
with it in post processing.
An alternative approach would be to use the same implementation as
ktime_get_real_fast_ns() does. However, this requires to add an additional u64
member to the tk_read_base struct. This struct together with a seqcount is
designed to fit into a single cache line on 64 bit architectures. Adding a new
member would violate this constraint.
Signed-off-by: Kurt Kanzenbach <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/core-api/timekeeping.rst | 1 +
include/linux/timekeeping.h | 1 +
kernel/time/timekeeping.c | 17 +++++++++++++++++
3 files changed, 19 insertions(+)
diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst
index 729e248..22ec68f 100644
--- a/Documentation/core-api/timekeeping.rst
+++ b/Documentation/core-api/timekeeping.rst
@@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases:
.. c:function:: u64 ktime_get_mono_fast_ns( void )
u64 ktime_get_raw_fast_ns( void )
u64 ktime_get_boot_fast_ns( void )
+ u64 ktime_get_tai_fast_ns( void )
u64 ktime_get_real_fast_ns( void )
These variants are safe to call from any context, including from
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 78a98bd..fe1e467 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -177,6 +177,7 @@ static inline u64 ktime_get_raw_ns(void)
extern u64 ktime_get_mono_fast_ns(void);
extern u64 ktime_get_raw_fast_ns(void);
extern u64 ktime_get_boot_fast_ns(void);
+extern u64 ktime_get_tai_fast_ns(void);
extern u64 ktime_get_real_fast_ns(void);
/*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index dcdcb85..2c22023 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -532,6 +532,23 @@ u64 notrace ktime_get_boot_fast_ns(void)
}
EXPORT_SYMBOL_GPL(ktime_get_boot_fast_ns);
+/**
+ * ktime_get_tai_fast_ns - NMI safe and fast access to tai clock.
+ *
+ * The same limitations as described for ktime_get_boot_fast_ns() apply. The
+ * mono time and the TAI offset are not read atomically which may yield wrong
+ * readouts. However, an update of the TAI offset is an rare event e.g., caused
+ * by settime or adjtimex with an offset. The user of this function has to deal
+ * with the possibility of wrong timestamps in post processing.
+ */
+u64 notrace ktime_get_tai_fast_ns(void)
+{
+ struct timekeeper *tk = &tk_core.timekeeper;
+
+ return (ktime_get_mono_fast_ns() + ktime_to_ns(data_race(tk->offs_tai)));
+}
+EXPORT_SYMBOL_GPL(ktime_get_tai_fast_ns);
+
static __always_inline u64 __ktime_get_real_fast(struct tk_fast *tkf, u64 *mono)
{
struct tk_read_base *tkr;
On Thu, Apr 14 2022 at 11:18, Kurt Kanzenbach wrote:
> Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
> tracing infrastructure has support for using different clocks to generate
> timestamps for trace events. Especially in TSN networks it's useful to have TAI
> as trace clock, because the application scheduling is done in accordance to the
> network time, which is based on TAI. With a tai trace_clock in place, it becomes
> very convenient to correlate network activity with Linux kernel application
> traces.
Steven, this is available for you to pull via:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip tai-for-tracing
It's a single tagged commit on top of 5.18-rc2
Thanks,
tglx
On Thu, 14 Apr 2022 16:59:24 +0200
Thomas Gleixner <[email protected]> wrote:
> Steven, this is available for you to pull via:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip tai-for-tracing
>
> It's a single tagged commit on top of 5.18-rc2
Thanks. As I have not started putting together my next merge window series
yet, I can start off of this commit.
-- Steve
On Thu, 14 Apr 2022 11:18:03 +0200
Kurt Kanzenbach <[email protected]> wrote:
I finally ran this series through my tests, and it has some issues.
> Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
> tracing infrastructure has support for using different clocks to generate
> timestamps for trace events. Especially in TSN networks it's useful to have TAI
> as trace clock, because the application scheduling is done in accordance to the
> network time, which is based on TAI. With a tai trace_clock in place, it becomes
> very convenient to correlate network activity with Linux kernel application
> traces.
>
> Use the same implementation as ktime_get_boot_fast_ns() does by reading the
> monotonic time and adding the TAI offset. The same limitations as for the fast
> boot implementation apply. The TAI offset may change at run time e.g., by
> setting the time or using adjtimex() with an offset. However, these kind of
> offset changes are rare events. Nevertheless, the user has to be aware and deal
> with it in post processing.
>
> An alternative approach would be to use the same implementation as
> ktime_get_real_fast_ns() does. However, this requires to add an additional u64
> member to the tk_read_base struct. This struct together with a seqcount is
> designed to fit into a single cache line on 64 bit architectures. Adding a new
> member would violate this constraint.
>
> Signed-off-by: Kurt Kanzenbach <[email protected]>
> ---
> Documentation/core-api/timekeeping.rst | 1 +
> include/linux/timekeeping.h | 1 +
> kernel/time/timekeeping.c | 17 +++++++++++++++++
> 3 files changed, 19 insertions(+)
>
> diff --git a/Documentation/core-api/timekeeping.rst b/Documentation/core-api/timekeeping.rst
> index 729e24864fe7..22ec68f24421 100644
> --- a/Documentation/core-api/timekeeping.rst
> +++ b/Documentation/core-api/timekeeping.rst
> @@ -132,6 +132,7 @@ Some additional variants exist for more specialized cases:
> .. c:function:: u64 ktime_get_mono_fast_ns( void )
> u64 ktime_get_raw_fast_ns( void )
> u64 ktime_get_boot_fast_ns( void )
> + u64 ktime_get_tai_fast_ns( void )
> u64 ktime_get_real_fast_ns( void )
>
> These variants are safe to call from any context, including from
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index 78a98bdff76d..fe1e467ba046 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -177,6 +177,7 @@ static inline u64 ktime_get_raw_ns(void)
> extern u64 ktime_get_mono_fast_ns(void);
> extern u64 ktime_get_raw_fast_ns(void);
> extern u64 ktime_get_boot_fast_ns(void);
> +extern u64 ktime_get_tai_fast_ns(void);
> extern u64 ktime_get_real_fast_ns(void);
>
> /*
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index dcdcb85121e4..2c22023fbf5f 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -532,6 +532,23 @@ u64 notrace ktime_get_boot_fast_ns(void)
> }
> EXPORT_SYMBOL_GPL(ktime_get_boot_fast_ns);
>
> +/**
> + * ktime_get_tai_fast_ns - NMI safe and fast access to tai clock.
> + *
> + * The same limitations as described for ktime_get_boot_fast_ns() apply. The
> + * mono time and the TAI offset are not read atomically which may yield wrong
> + * readouts. However, an update of the TAI offset is an rare event e.g., caused
> + * by settime or adjtimex with an offset. The user of this function has to deal
> + * with the possibility of wrong timestamps in post processing.
> + */
> +u64 notrace ktime_get_tai_fast_ns(void)
> +{
> + struct timekeeper *tk = &tk_core.timekeeper;
> +
> + return (ktime_get_mono_fast_ns() + ktime_to_ns(data_race(tk->offs_tai)));
As you are using this for tracing, can you open code the
ktime_get_mono_fast_ns(), otherwise we need to mark that function as
notrace. Not to mention, this is a fast path and using the noinline of
__ktime_get_fast_ns() should be less overhead.
That said, I hit this too:
less-5071 [000] d.h2. 498087876.351330: do_raw_spin_trylock <-_raw_spin_lock
less-5071 [000] d.h4. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
less-5071 [000] d.h5. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
less-5071 [000] d.h3. 498087876.351334: rcu_read_lock_sched_held <-lock_acquired
less-5071 [000] d.h5. 498087876.351337: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
kworker/u8:1-45 [003] d.h5. 1651009380.982749: rcu_read_lock_held_common <-rcu_read_lock_sched_held
kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
kworker/u8:1-45 [003] d.h5. 498087876.375905: update_cfs_group <-task_tick_fair
kworker/u8:1-45 [003] d.h7. 498087876.375909: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
The clock seems to be toggling between 1651009380 and 498087876 causing the
ftrace ring buffer to shutdown (it doesn't allow for time to go backwards).
This is running on a 32 bit x86.
-- Steve
> +}
> +EXPORT_SYMBOL_GPL(ktime_get_tai_fast_ns);
> +
> static __always_inline u64 __ktime_get_real_fast(struct tk_fast *tkf, u64 *mono)
> {
> struct tk_read_base *tkr;
On Tue Apr 26 2022, Steven Rostedt wrote:
> On Thu, 14 Apr 2022 11:18:03 +0200
> Kurt Kanzenbach <[email protected]> wrote:
>
> I finally ran this series through my tests, and it has some issues.
Thanks!
[snip]
>> +/**
>> + * ktime_get_tai_fast_ns - NMI safe and fast access to tai clock.
>> + *
>> + * The same limitations as described for ktime_get_boot_fast_ns() apply. The
>> + * mono time and the TAI offset are not read atomically which may yield wrong
>> + * readouts. However, an update of the TAI offset is an rare event e.g., caused
>> + * by settime or adjtimex with an offset. The user of this function has to deal
>> + * with the possibility of wrong timestamps in post processing.
>> + */
>> +u64 notrace ktime_get_tai_fast_ns(void)
>> +{
>> + struct timekeeper *tk = &tk_core.timekeeper;
>> +
>> + return (ktime_get_mono_fast_ns() + ktime_to_ns(data_race(tk->offs_tai)));
>
> As you are using this for tracing, can you open code the
> ktime_get_mono_fast_ns(), otherwise we need to mark that function as
> notrace. Not to mention, this is a fast path and using the noinline of
> __ktime_get_fast_ns() should be less overhead.
Looking at the other functions used for tracing:
- ktime_get_mono_fast_ns - no annotations
- ktime_get_raw_fast_ns - no annotations
- ktime_get_boot_fast_ns - notrace
- ktime_get_tai_fast_ns - notrace
Seems a little bit inconsistent.
>
> That said, I hit this too:
>
> less-5071 [000] d.h2. 498087876.351330: do_raw_spin_trylock <-_raw_spin_lock
> less-5071 [000] d.h4. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> less-5071 [000] d.h5. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> less-5071 [000] d.h3. 498087876.351334: rcu_read_lock_sched_held <-lock_acquired
> less-5071 [000] d.h5. 498087876.351337: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h5. 1651009380.982749: rcu_read_lock_held_common <-rcu_read_lock_sched_held
> kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h5. 498087876.375905: update_cfs_group <-task_tick_fair
> kworker/u8:1-45 [003] d.h7. 498087876.375909: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>
> The clock seems to be toggling between 1651009380 and 498087876 causing the
> ftrace ring buffer to shutdown (it doesn't allow for time to go backwards).
>
> This is running on a 32 bit x86.
Hm. The only problem is that the TAI offset may change. That should only
happen if the time is set or similar. For the TSN use case I've ran
ptp4l and phc2sys. phc2sys in the default configuration sets the offset
hard once and uses frequency adjustments from that point on. I didn't
observe any time jumps. Nevertheless, my test systems is based on
arm64. I'll do some testing on x86.
Thanks,
Kurt
On Tue, Apr 26 2022 at 17:53, Steven Rostedt wrote:
>> +/**
>> + * ktime_get_tai_fast_ns - NMI safe and fast access to tai clock.
>> + *
>> + * The same limitations as described for ktime_get_boot_fast_ns() apply. The
>> + * mono time and the TAI offset are not read atomically which may yield wrong
>> + * readouts. However, an update of the TAI offset is an rare event e.g., caused
>> + * by settime or adjtimex with an offset. The user of this function has to deal
>> + * with the possibility of wrong timestamps in post processing.
>> + */
>> +u64 notrace ktime_get_tai_fast_ns(void)
>> +{
>> + struct timekeeper *tk = &tk_core.timekeeper;
>> +
>> + return (ktime_get_mono_fast_ns() + ktime_to_ns(data_race(tk->offs_tai)));
>
> As you are using this for tracing, can you open code the
> ktime_get_mono_fast_ns(), otherwise we need to mark that function as
> notrace.
which we should do anyway....
kernel/trace/trace.c: { ktime_get_mono_fast_ns, "mono", 1 },
> Not to mention, this is a fast path and using the noinline of
> __ktime_get_fast_ns() should be less overhead.
>
> That said, I hit this too:
>
> less-5071 [000] d.h2. 498087876.351330: do_raw_spin_trylock <-_raw_spin_lock
> less-5071 [000] d.h4. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> less-5071 [000] d.h5. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> less-5071 [000] d.h3. 498087876.351334: rcu_read_lock_sched_held <-lock_acquired
> less-5071 [000] d.h5. 498087876.351337: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h5. 1651009380.982749: rcu_read_lock_held_common <-rcu_read_lock_sched_held
> kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> kworker/u8:1-45 [003] d.h5. 498087876.375905: update_cfs_group <-task_tick_fair
> kworker/u8:1-45 [003] d.h7. 498087876.375909: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>
> The clock seems to be toggling between 1651009380 and 498087876 causing the
> ftrace ring buffer to shutdown (it doesn't allow for time to go backwards).
That does not make sense. offs_tai changes only when clock is set via
adjtimex() or settimeofday().
Thanks,
tglx
On Wed, 27 Apr 2022 10:38:59 +0200
Kurt Kanzenbach <[email protected]> wrote:
> Looking at the other functions used for tracing:
>
> - ktime_get_mono_fast_ns - no annotations
> - ktime_get_raw_fast_ns - no annotations
> - ktime_get_boot_fast_ns - notrace
> - ktime_get_tai_fast_ns - notrace
Yeah, all should be notrace.
>
> Seems a little bit inconsistent.
>
> >
> > That said, I hit this too:
> >
> > less-5071 [000] d.h2. 498087876.351330: do_raw_spin_trylock <-_raw_spin_lock
> > less-5071 [000] d.h4. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > less-5071 [000] d.h5. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > less-5071 [000] d.h3. 498087876.351334: rcu_read_lock_sched_held <-lock_acquired
> > less-5071 [000] d.h5. 498087876.351337: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > kworker/u8:1-45 [003] d.h5. 1651009380.982749: rcu_read_lock_held_common <-rcu_read_lock_sched_held
> > kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> > kworker/u8:1-45 [003] d.h5. 498087876.375905: update_cfs_group <-task_tick_fair
> > kworker/u8:1-45 [003] d.h7. 498087876.375909: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
> >
> > The clock seems to be toggling between 1651009380 and 498087876 causing the
> > ftrace ring buffer to shutdown (it doesn't allow for time to go backwards).
> >
> > This is running on a 32 bit x86.
>
> Hm. The only problem is that the TAI offset may change. That should only
> happen if the time is set or similar. For the TSN use case I've ran
> ptp4l and phc2sys. phc2sys in the default configuration sets the offset
> hard once and uses frequency adjustments from that point on. I didn't
> observe any time jumps. Nevertheless, my test systems is based on
> arm64. I'll do some testing on x86.
I'm able to trigger this on x86 64bit too.
One thing I noticed, is that the two numbers I have (from a different
trace, but very similar to the above)
$ printf "%llx\n" 498151194674148935
6e9c9df4afd3647
$ printf "%llx\n" 1651072699280995911
16e9c9df4afd3647
That is, the last nibble either is 0 or 1, causing the change?
06e9c9df4afd3647
16e9c9df4afd3647
The numbers are the same except for that last nibble.
-- Steve
On Wed, 27 Apr 2022 11:27:59 -0400
Steven Rostedt <[email protected]> wrote:
> I'm able to trigger this on x86 64bit too.
>
> One thing I noticed, is that the two numbers I have (from a different
> trace, but very similar to the above)
>
> $ printf "%llx\n" 498151194674148935
> 6e9c9df4afd3647
>
> $ printf "%llx\n" 1651072699280995911
> 16e9c9df4afd3647
>
> That is, the last nibble either is 0 or 1, causing the change?
>
> 06e9c9df4afd3647
> 16e9c9df4afd3647
>
> The numbers are the same except for that last nibble.
And I just realized what the issue is. The ring buffer only allows for 59
bits of timestamp, as there's times it needs to inject a timestamp into the
buffer, which is 16 bytes, and the first 5 bits are meta data, leaving 59
bits for the timestamp. Since no timestamp came close to that before this
timer, everything worked.
Thus, when a full timestamp is injected, it takes off the 5 MSBs of the
timestamp (which took off that 1) and that is what is happening.
I can see if I can play some games to detect this and replace the top 5
bits with the saved timestamp at the head of the sub buffer.
-- Steve
On Wed Apr 27 2022, Steven Rostedt wrote:
> I'm able to trigger this on x86 64bit too.
OK, reproduced it by using the function tracer (instead of events).
Anyway, with your patch applied [1] and the one below it looks way
better:
| bash-4252 [003] ...1. 1651091299.936121: file_ra_state_init <-do_dentry_open
| bash-4252 [003] ...1. 1651091299.936121: inode_to_bdi <-file_ra_state_init
| bash-4252 [003] ...1. 1651091299.936121: do_truncate <-path_openat
| bash-4252 [003] ...1. 1651091299.936122: dentry_needs_remove_privs <-do_truncate
| bash-4252 [003] ...1. 1651091299.936122: should_remove_suid <-dentry_needs_remove_privs
| bash-4252 [003] ...1. 1651091299.936122: security_inode_need_killpriv <-dentry_needs_remove_privs
| bash-4252 [003] ...1. 1651091299.936122: cap_inode_need_killpriv <-security_inode_need_killpriv
| bash-4252 [003] ...1. 1651091299.936123: __vfs_getxattr <-cap_inode_need_killpriv
| bash-4252 [003] ...1. 1651091299.936123: xattr_resolve_name <-__vfs_getxattr
| bash-4252 [003] ...1. 1651091299.936123: is_bad_inode <-xattr_resolve_name
| bash-4252 [003] ...1. 1651091299.936123: down_write <-do_truncate
| bash-4252 [003] ...1. 1651091299.936123: notify_change <-do_truncate
| bash-4252 [003] ...1. 1651091299.936123: may_setattr <-notify_change
| bash-4252 [003] ...1. 1651091299.936123: current_time <-notify_change
| bash-4252 [003] ...1. 1651091299.936124: ktime_get_coarse_real_ts64 <-current_time
| bash-4252 [003] ...1. 1651091299.936124: security_inode_setattr <-notify_change
| bash-4252 [003] ...1. 1651091299.936124: selinux_inode_setattr <-security_inode_setattr
| bash-4252 [003] ...1. 1651091299.936125: inode_has_perm <-selinux_inode_setattr
|/sys/kernel/debug/tracing # phc_ctl /dev/ptp0 get
|phc_ctl[760.604]: clock time is 1651091302.512977591 or Wed Apr 27 22:28:22 2022
Thanks for debugging it!
[1] - https://lore.kernel.org/all/[email protected]/
Thanks,
Kurt
commit 81c4f2de420cc4ac08efc39e78ffd80e146bfbd7
Author: Kurt Kanzenbach <[email protected]>
Date: Wed Apr 27 21:59:58 2022 +0200
timekeeping: Mark mono fast time accessors as notrace
Mark the CLOCK_MONOTONIC fast time accessors as notrace. These functions are
used in tracing to retrieve timestamps.
Signed-off-by: Kurt Kanzenbach <[email protected]>
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 2c22023fbf5f..dddc37035215 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -482,7 +482,7 @@ static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
* of the following timestamps. Callers need to be aware of that and
* deal with it.
*/
-u64 ktime_get_mono_fast_ns(void)
+u64 notrace ktime_get_mono_fast_ns(void)
{
return __ktime_get_fast_ns(&tk_fast_mono);
}
@@ -494,7 +494,7 @@ EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
* Contrary to ktime_get_mono_fast_ns() this is always correct because the
* conversion factor is not affected by NTP/PTP correction.
*/
-u64 ktime_get_raw_fast_ns(void)
+u64 notrace ktime_get_raw_fast_ns(void)
{
return __ktime_get_fast_ns(&tk_fast_raw);
}
On Wed, 27 Apr 2022 22:06:49 +0200
Thomas Gleixner <[email protected]> wrote:
> Haven't found where ftrace loses the upper bit, but it does so
> somewhere.
https://lore.kernel.org/all/[email protected]/
;-)
-- Steve
On Wed, 27 Apr 2022 13:22:05 -0400
Steven Rostedt <[email protected]> wrote:
> I can see if I can play some games to detect this and replace the top 5
> bits with the saved timestamp at the head of the sub buffer.
This appears to fix the issue. It adds a function when reading the absolute
time stamp comparing it to a previous time stamp. If the previous time
stamp has any of the 5 MSB set, then it will OR it into the absolute time
stamp, and then compare it to the previous time stamp to check for overflow.
It also does not complain of "big deltas" if the update is an absolute
time stamp and the current time stamp has any of the 5 MSB set.
I will need to update libtraceevent to handle this case as well.
But looks like there's nothing on your end that needs to be done. Probably
just set those other functions you found to notrace. But that can be a
separate patch.
-- Steve
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 655d6db3e3c3..3a0c7ed0e93f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -29,6 +29,14 @@
#include <asm/local.h>
+/*
+ * The "absolute" timestamp in the buffer is only 59 bits.
+ * If a clock has the 5 MSBs set, it needs to be saved and
+ * reinserted.
+ */
+#define TS_MSB (0xf8ULL << 56)
+#define ABS_TS_MASK (~TS_MSB)
+
static void update_pages_handler(struct work_struct *work);
/*
@@ -783,6 +791,24 @@ static inline void verify_event(struct ring_buffer_per_cpu *cpu_buffer,
}
#endif
+/*
+ * The absolute time stamp drops the 5 MSBs and some clocks may
+ * require them. The rb_fix_abs_ts() will take a previous full
+ * time stamp, and add the 5 MSB of that time stamp on to the
+ * saved absolute time stamp. Then they are compared in case of
+ * the unlikely event that the latest time stamp incremented
+ * the 5 MSB.
+ */
+static inline u64 rb_fix_abs_ts(u64 abs, u64 save_ts)
+{
+ if (save_ts & TS_MSB) {
+ abs |= save_ts & TS_MSB;
+ /* Check for overflow */
+ if (unlikely(abs < save_ts))
+ abs += 1ULL << 59;
+ }
+ return abs;
+}
static inline u64 rb_time_stamp(struct trace_buffer *buffer);
@@ -811,8 +837,10 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *buffer,
u64 ts;
/* If the event includes an absolute time, then just use that */
- if (event->type_len == RINGBUF_TYPE_TIME_STAMP)
- return rb_event_time_stamp(event);
+ if (event->type_len == RINGBUF_TYPE_TIME_STAMP) {
+ ts = rb_event_time_stamp(event);
+ return rb_fix_abs_ts(ts, cpu_buffer->tail_page->page->time_stamp);
+ }
nest = local_read(&cpu_buffer->committing);
verify_event(cpu_buffer, event);
@@ -2754,8 +2782,15 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
(RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE);
if (unlikely(info->delta > (1ULL << 59))) {
+ /*
+ * Some timers can use more than 59 bits, and when a timestamp
+ * is added to the buffer, it will lose those bits.
+ */
+ if (abs && (info->ts & TS_MSB)) {
+ info->delta &= ABS_TS_MASK;
+
/* did the clock go backwards */
- if (info->before == info->after && info->before > info->ts) {
+ } else if (info->before == info->after && info->before > info->ts) {
/* not interrupted */
static int once;
@@ -3304,7 +3339,7 @@ static void dump_buffer_page(struct buffer_data_page *bpage,
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
- ts = delta;
+ ts = rb_fix_abs_ts(delta, ts);
pr_warn(" [%lld] absolute:%lld TIME STAMP\n", ts, delta);
break;
@@ -3380,7 +3415,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
- ts = delta;
+ ts = rb_fix_abs_ts(delta, ts);
break;
case RINGBUF_TYPE_PADDING:
@@ -4367,6 +4402,7 @@ rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
+ delta = rb_fix_abs_ts(delta, cpu_buffer->read_stamp);
cpu_buffer->read_stamp = delta;
return;
@@ -4397,6 +4433,7 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
+ delta = rb_fix_abs_ts(delta, iter->read_stamp);
iter->read_stamp = delta;
return;
@@ -4650,6 +4687,7 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts,
case RINGBUF_TYPE_TIME_STAMP:
if (ts) {
*ts = rb_event_time_stamp(event);
+ *ts = rb_fix_abs_ts(*ts, reader->page->time_stamp);
ring_buffer_normalize_time_stamp(cpu_buffer->buffer,
cpu_buffer->cpu, ts);
}
@@ -4741,6 +4779,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
case RINGBUF_TYPE_TIME_STAMP:
if (ts) {
*ts = rb_event_time_stamp(event);
+ *ts = rb_fix_abs_ts(*ts, iter->head_page->page->time_stamp);
ring_buffer_normalize_time_stamp(cpu_buffer->buffer,
cpu_buffer->cpu, ts);
}
On Wed, 27 Apr 2022 22:42:57 +0200
Kurt Kanzenbach <[email protected]> wrote:
> Thanks for debugging it!
Unfortunately, I found that 32 bit is still broken. Will have that fixed
soon.
-- Steve
On Wed, Apr 27 2022 at 22:42, Kurt Kanzenbach wrote:
> Anyway, with your patch applied [1] and the one below it looks way
> better:
>
> commit 81c4f2de420cc4ac08efc39e78ffd80e146bfbd7
> Author: Kurt Kanzenbach <[email protected]>
> Date: Wed Apr 27 21:59:58 2022 +0200
>
> timekeeping: Mark mono fast time accessors as notrace
Can you please post that proper?
> Mark the CLOCK_MONOTONIC fast time accessors as notrace. These functions are
> used in tracing to retrieve timestamps.
>
That lacks a Fixes: tag. These accessors should have been notrace from
day one.
Thanks,
tglx
On Wed, Apr 27 2022 at 11:27, Steven Rostedt wrote:
>> > less-5071 [000] d.h2. 498087876.351330: do_raw_spin_trylock <-_raw_spin_lock
>> > less-5071 [000] d.h4. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > less-5071 [000] d.h5. 498087876.351334: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > less-5071 [000] d.h3. 498087876.351334: rcu_read_lock_sched_held <-lock_acquired
>> > less-5071 [000] d.h5. 498087876.351337: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > kworker/u8:1-45 [003] d.h7. 1651009380.982749: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > kworker/u8:1-45 [003] d.h5. 1651009380.982749: rcu_read_lock_held_common <-rcu_read_lock_sched_held
>> > kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > kworker/u8:1-45 [003] d.h7. 498087876.375905: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> > kworker/u8:1-45 [003] d.h5. 498087876.375905: update_cfs_group <-task_tick_fair
>> > kworker/u8:1-45 [003] d.h7. 498087876.375909: ktime_get_mono_fast_ns <-ktime_get_tai_fast_ns
>> >
>> > The clock seems to be toggling between 1651009380 and 498087876 causing the
>> > ftrace ring buffer to shutdown (it doesn't allow for time to go backwards).
>> >
>> > This is running on a 32 bit x86.
>>
>> Hm. The only problem is that the TAI offset may change. That should only
>> happen if the time is set or similar. For the TSN use case I've ran
>> ptp4l and phc2sys. phc2sys in the default configuration sets the offset
>> hard once and uses frequency adjustments from that point on. I didn't
>> observe any time jumps. Nevertheless, my test systems is based on
>> arm64. I'll do some testing on x86.
>
> I'm able to trigger this on x86 64bit too.
>
> One thing I noticed, is that the two numbers I have (from a different
> trace, but very similar to the above)
>
> $ printf "%llx\n" 498151194674148935
> 6e9c9df4afd3647
>
> $ printf "%llx\n" 1651072699280995911
> 16e9c9df4afd3647
TAI offset after boot is somewhere in this region:
It's 16e9d5dca2e7ac57 here. Converted to ctime:
Wed Apr 27 20:58:02 2022
and that does not change after booting on the VM I'm testing. But then
ftrace suddenly throws this warning:
[ 456.444350] Delta way too big! 1152921504606861182
ts=1651087617720667106 before=498166113113805924
after=498166113113820130 write stamp=498166113113820130
I first set tai as trace clock and then enabled a trace point. It seems
the first hit on this trace point ended up there.
> That is, the last nibble either is 0 or 1, causing the change?
>
> 06e9c9df4afd3647
> 16e9c9df4afd3647
I added debug into ktime_get_tai_fast_ns() to emit a pr_warn() when
either the TAI offset or the result is less than 1 << 60.
That does not trigger, but ftrace still confuses itself....
There is some mumble about timestamps only requiring 59 bits and the
check which triggers the warning has:
if (unlikely(info->delta > (1ULL << 59))) {
So something cuts off bit 60 somewhere and then the whole thing goes
into confusion mode.
When I set wall time to something before 1985, it works.
ctime(0x06e9c9df4afd3647/1e9) = Mon Oct 14 16:19:54 1985
Haven't found where ftrace loses the upper bit, but it does so
somewhere.
Thanks,
tglx