2019-11-01 23:43:00

by Jan Stancek

[permalink] [raw]
Subject: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

fill_ac() calculates process creation time from current time as:
ac->ac_btime = get_seconds() - elapsed

get_seconds() doesn't accumulate nanoseconds as regular time getters.
This creates race for user-space (e.g. LTP acct02), which typically
uses gettimeofday(), because process creation time sometimes appear
to be dated 'in past':

acct("myfile");
time_t start_time = time(NULL);
if (fork()==0) {
sleep(1);
exit(0);
}
waitpid(NULL);
acct(NULL);

// acct.ac_btime == 1572616777
// start_time == 1572616778

Testing: 10 hours of LTP acct02 on s390x with CONFIG_HZ=100,
test failed on unpatched kernel in 15 minutes

Signed-off-by: Jan Stancek <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Richard Fontana <[email protected]>
---
kernel/acct.c | 4 +++-
kernel/tsacct.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/acct.c b/kernel/acct.c
index 81f9831a7859..991c898160cd 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -417,6 +417,7 @@ static void fill_ac(acct_t *ac)
struct pacct_struct *pacct = &current->signal->pacct;
u64 elapsed, run_time;
struct tty_struct *tty;
+ struct timespec64 ts;

/*
* Fill the accounting struct with the needed info as recorded
@@ -448,7 +449,8 @@ static void fill_ac(acct_t *ac)
}
#endif
do_div(elapsed, AHZ);
- ac->ac_btime = get_seconds() - elapsed;
+ ktime_get_real_ts64(&ts);
+ ac->ac_btime = ts.tv_sec - elapsed;
#if ACCT_VERSION==2
ac->ac_ahz = AHZ;
#endif
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 7be3e7530841..4d10854255ab 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -24,6 +24,7 @@ void bacct_add_tsk(struct user_namespace *user_ns,
const struct cred *tcred;
u64 utime, stime, utimescaled, stimescaled;
u64 delta;
+ struct timespec64 ts;

BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);

@@ -34,7 +35,8 @@ void bacct_add_tsk(struct user_namespace *user_ns,
stats->ac_etime = delta;
/* Convert to seconds for btime */
do_div(delta, USEC_PER_SEC);
- stats->ac_btime = get_seconds() - delta;
+ ktime_get_real_ts64(&ts);
+ stats->ac_btime = ts.tv_sec - delta;
if (thread_group_leader(tsk)) {
stats->ac_exitcode = tsk->exit_code;
if (tsk->flags & PF_FORKNOEXEC)
--
1.8.3.1


2019-11-07 08:57:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

Jan,

The subsystem prefix for acct is surely not 'kernel.'. Try:

git log --oneline kernel/acct.c

On Sat, 2 Nov 2019, Jan Stancek wrote:
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 81f9831a7859..991c898160cd 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -417,6 +417,7 @@ static void fill_ac(acct_t *ac)
> struct pacct_struct *pacct = &current->signal->pacct;
> u64 elapsed, run_time;
> struct tty_struct *tty;
> + struct timespec64 ts;
>
> /*
> * Fill the accounting struct with the needed info as recorded
> @@ -448,7 +449,8 @@ static void fill_ac(acct_t *ac)
> }
> #endif
> do_div(elapsed, AHZ);
> - ac->ac_btime = get_seconds() - elapsed;
> + ktime_get_real_ts64(&ts);
> + ac->ac_btime = ts.tv_sec - elapsed;

So the calculation goes like this:

runtime = ktime_get_ns() - current->...->start_time;

elapsed = ns_to_ahz(runtime)

elapsed /= AHZ -> runtime in seconds

And then you retrieve the current wall time and just use the seconds
portion for building the delta. That still can fail in corner cases when
the runtime to seconds conversion does not have truncation in the
conversions and the timespec nanoseconds part is close to 1e9.

There is another issue related to suspend. If the system suspends then
runtime is accurate vs. clock MONOTONIC, but the offset between clock
MONOTONIC and clock REALTIME is not longer the same due to the
suspend/resume which has the same issue as what you are trying to solve
because

runtime = totaltime - time_in_suspend

so your ac_btime will be off by time_in_suspend.

Something like this should work:

runtime = ktime_get_ns() - current->...->start_time;
....
runtime_boot = ktime_get_boot_ns() - current->...->real_start_time;
start_ns = ktime_get_real_ns() - runtime_boot;
start_s = startns / NSEC_PER_SEC;

current->...->real_start_time is clearly a misnomer as it suggests
CLOCK_REALTIME at first sight ...

But it would be way simpler just to store the CLOCK_REALTIME start time
along with BOOT and MONOTONIC and just get rid of all these horrible
calculations which are bound to be wrong.

Peter?

Thanks,

tglx




2019-11-07 12:33:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Sat, Nov 02, 2019 at 12:39:24AM +0100, Jan Stancek wrote:
> fill_ac() calculates process creation time from current time as:
> ac->ac_btime = get_seconds() - elapsed
>
> get_seconds() doesn't accumulate nanoseconds as regular time getters.
> This creates race for user-space (e.g. LTP acct02), which typically
> uses gettimeofday(), because process creation time sometimes appear
> to be dated 'in past':
>
> acct("myfile");
> time_t start_time = time(NULL);
> if (fork()==0) {
> sleep(1);
> exit(0);
> }
> waitpid(NULL);
> acct(NULL);
>
> // acct.ac_btime == 1572616777
> // start_time == 1572616778
>

Lets start by saying this accounting stuff is terrible crap and it
deserves to fail and burn.

The only spec I found for what it actually wants in those fields is
Documentation/accounting/taskstats-struct.rst, that states:

/* The time when a task begins, in [secs] since 1970. */
__u32 ac_btime; /* Begin time [sec since 1970] */

/* The elapsed time of a task, in [usec]. */
__u64 ac_etime; /* Elapsed time [usec] */

But that is not really well defined. As implemented etime does not
include suspend time (maybe on purpose, maybe not).

And what does btime want? As implemented it jumps around if you ask the
question twice with an adjtime() call or suspend in between. Of course,
if we take an actual CLOCK_REALTIME timestamp at fork() the value
doesn't change, but then it can be in the future (DST,adjtime()), which
is exactly the reason why CLOCK_REALTIME is absolute shit for timestamps
(logging, accounting, etc.).

And your 'fix' is pretty terible too. Arguably ktime_get_seconds() wants
fixing for not having the ns accumulation and actually differing from
tv_sec, but now you accrue one source of ns while still disregarding
another (also, I friggin hate timespec, it's a terrible interface for
time).

All in all, I'm tempted to just declare this stuff broken and -EWONTFIX,
but if we have to do something, something like the below is at least
internally consistent.

---
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 7be3e7530841..76d6325c2724 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -23,18 +23,31 @@ void bacct_add_tsk(struct user_namespace *user_ns,
{
const struct cred *tcred;
u64 utime, stime, utimescaled, stimescaled;
- u64 delta;
+ u64 mono, real, btime;

BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN);

+ mono = ktime_get_ns();
+ real = ktime_get_real_ns();
+
/* calculate task elapsed time in nsec */
- delta = ktime_get_ns() - tsk->start_time;
+ delta = mono - tsk->start_time;
/* Convert to micro seconds */
do_div(delta, NSEC_PER_USEC);
stats->ac_etime = delta;
- /* Convert to seconds for btime */
- do_div(delta, USEC_PER_SEC);
- stats->ac_btime = get_seconds() - delta;
+
+ /*
+ * Compute btime by subtracting the elapsed time from the current
+ * CLOCK_REALTIME.
+ *
+ * XXX totally buggered, because it changes results across
+ * adjtime() calls and suspend/resume.
+ */
+ delta = mono - tsk->start_time; // elapsed in ns
+ btime = real - delta; // real ns - elapsed ns
+ do_div(btime, NSEC_PER_SEC); // truncated to seconds
+ stats->ac_btime = btime;
+
if (thread_group_leader(tsk)) {
stats->ac_exitcode = tsk->exit_code;
if (tsk->flags & PF_FORKNOEXEC)

2019-11-07 12:42:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Thu, 7 Nov 2019, Peter Zijlstra wrote:
> Lets start by saying this accounting stuff is terrible crap and it
> deserves to fail and burn.

No argument about that.

> And what does btime want? As implemented it jumps around if you ask the
> question twice with an adjtime() call or suspend in between. Of course,
> if we take an actual CLOCK_REALTIME timestamp at fork() the value
> doesn't change, but then it can be in the future (DST,adjtime()), which
> is exactly the reason why CLOCK_REALTIME is absolute shit for timestamps
> (logging, accounting, etc.).
>
> And your 'fix' is pretty terible too. Arguably ktime_get_seconds() wants
> fixing for not having the ns accumulation and actually differing from
> tv_sec, but now you accrue one source of ns while still disregarding
> another (also, I friggin hate timespec, it's a terrible interface for
> time).
>
> All in all, I'm tempted to just declare this stuff broken and -EWONTFIX,
> but if we have to do something, something like the below is at least
> internally consistent.

Kinda :)

> + mono = ktime_get_ns();
> + real = ktime_get_real_ns();
> + /*
> + * Compute btime by subtracting the elapsed time from the current
> + * CLOCK_REALTIME.
> + *
> + * XXX totally buggered, because it changes results across
> + * adjtime() calls and suspend/resume.
> + */
> + delta = mono - tsk->start_time; // elapsed in ns
> + btime = real - delta; // real ns - elapsed ns
> + do_div(btime, NSEC_PER_SEC); // truncated to seconds
> + stats->ac_btime = btime;

That has pretty much the same problem as just storing the CLOCK_REALTIME
start time at fork and additionally it is wreckaged vs. suspend resume.

So a CLOCK_REALTIME time stamp at fork would at least be correct
vs. suspend resume.

The same result is achieved by:

boot = ktime_get_boot_ns();
delta = boot = tsk->real_start_time;

Typing real_start_time makes me really cringe.

Thanks,

tglx

2019-11-07 12:50:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote:
> Typing real_start_time makes me really cringe.

I have a patch fixing that...

---
Subject: kernel: Rename tsk->real_start_time
From: Peter Zijlstra <[email protected]>
Date: Thu Nov 7 11:07:58 CET 2019

Since it stores CLOCK_BOOTTIME, not, as the name suggests,
CLOCK_REALTIME, let's rename real_start_time.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1132,7 +1132,7 @@ static int de_thread(struct task_struct
* also take its birthdate (always earlier than our own).
*/
tsk->start_time = leader->start_time;
- tsk->real_start_time = leader->real_start_time;
+ tsk->start_boottime = leader->start_boottime;

BUG_ON(!same_thread_group(leader, tsk));
BUG_ON(has_group_leader_pid(tsk));
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -533,7 +533,7 @@ static int do_task_stat(struct seq_file
nice = task_nice(task);

/* convert nsec -> ticks */
- start_time = nsec_to_clock_t(task->real_start_time);
+ start_time = nsec_to_clock_t(task->start_boottime);

seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
seq_puts(m, " (");
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,7 +879,7 @@ struct task_struct {
u64 start_time;

/* Boot based time in nsecs: */
- u64 real_start_time;
+ u64 start_boottime;

/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */
unsigned long min_flt;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2130,7 +2130,7 @@ static __latent_entropy struct task_stru
*/

p->start_time = ktime_get_ns();
- p->real_start_time = ktime_get_boottime_ns();
+ p->start_boottime = ktime_get_boottime_ns();

/*
* Make it visible to the rest of the system, but dont wake it up yet.

2019-11-07 12:51:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Thu, Nov 07, 2019 at 09:56:12AM +0100, Thomas Gleixner wrote:
> But it would be way simpler just to store the CLOCK_REALTIME start time
> along with BOOT and MONOTONIC and just get rid of all these horrible
> calculations which are bound to be wrong.
>
> Peter?

So I'd really hate to do that, as that gives the impression REALTIME is
a sane clock to use. As I argued in that other email, REALTIME is
horrible crap.

2019-11-07 12:57:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote:
> On Thu, 7 Nov 2019, Peter Zijlstra wrote:

> > + mono = ktime_get_ns();
> > + real = ktime_get_real_ns();
> > + /*
> > + * Compute btime by subtracting the elapsed time from the current
> > + * CLOCK_REALTIME.
> > + *
> > + * XXX totally buggered, because it changes results across
> > + * adjtime() calls and suspend/resume.
> > + */
> > + delta = mono - tsk->start_time; // elapsed in ns
> > + btime = real - delta; // real ns - elapsed ns
> > + do_div(btime, NSEC_PER_SEC); // truncated to seconds
> > + stats->ac_btime = btime;
>
> That has pretty much the same problem as just storing the CLOCK_REALTIME
> start time at fork and additionally it is wreckaged vs. suspend resume.

It's wrecked in general. It also jumps around for any REALTIME
adjustment.

> So a CLOCK_REALTIME time stamp at fork would at least be correct
> vs. suspend resume.

But still wrecked vs REALTIME jumps, as in, when DST flips the clock
back an hour, your timestamp is in the future.

Any which way around the whole thing is buggered. The only real fix is
not using REALTIME anything. Which is why I'm loath to add that REALTIME
timestamp at fork(), it just encourages more use.

2019-11-07 14:38:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime

On Thu, 7 Nov 2019, Peter Zijlstra wrote:
> On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote:
> > On Thu, 7 Nov 2019, Peter Zijlstra wrote:
>
> > > + mono = ktime_get_ns();
> > > + real = ktime_get_real_ns();
> > > + /*
> > > + * Compute btime by subtracting the elapsed time from the current
> > > + * CLOCK_REALTIME.
> > > + *
> > > + * XXX totally buggered, because it changes results across
> > > + * adjtime() calls and suspend/resume.
> > > + */
> > > + delta = mono - tsk->start_time; // elapsed in ns
> > > + btime = real - delta; // real ns - elapsed ns
> > > + do_div(btime, NSEC_PER_SEC); // truncated to seconds
> > > + stats->ac_btime = btime;
> >
> > That has pretty much the same problem as just storing the CLOCK_REALTIME
> > start time at fork and additionally it is wreckaged vs. suspend resume.
>
> It's wrecked in general. It also jumps around for any REALTIME
> adjustment.
>
> > So a CLOCK_REALTIME time stamp at fork would at least be correct
> > vs. suspend resume.
>
> But still wrecked vs REALTIME jumps, as in, when DST flips the clock
> back an hour, your timestamp is in the future.
>
> Any which way around the whole thing is buggered. The only real fix is
> not using REALTIME anything. Which is why I'm loath to add that REALTIME
> timestamp at fork(), it just encourages more use.

Fair enough. You have a sane alternative though: CLOCK_TAI

Thanks,

tglx


2019-11-11 09:44:48

by Jan Stancek

[permalink] [raw]
Subject: Re: [PATCH] kernel: use ktime_get_real_ts64() to calculate acct.ac_btime


----- Original Message -----
> It's wrecked in general. It also jumps around for any REALTIME
> adjustment.
>
> > So a CLOCK_REALTIME time stamp at fork would at least be correct
> > vs. suspend resume.
>
> But still wrecked vs REALTIME jumps, as in, when DST flips the clock
> back an hour, your timestamp is in the future.
>
> Any which way around the whole thing is buggered. The only real fix is
> not using REALTIME anything. Which is why I'm loath to add that REALTIME
> timestamp at fork(), it just encourages more use.

Thank you for feedback and listing all other problems.
I'll adjust test expectations.

Regards,
Jan