BSD accounting currently reports only the lower 16 bits of 32 bit uids,
reports times in units of 1/HZ seconds while it announces a unit of
1/100 seconds in the header file (which hurts especially on i386 where
HZ changed since 2.4) and cannot report times longer that 49 days.
We might want to fix that, but since we are in a stable kernel series
we need to insert some glue for binary compatibility.
This, of course, is nicer if we have a plot in place to zap such crap in
the long term.
As only Arthur Corliss commented on my previous post (BSD accounting
dosn't seem that fancy these days...), I made up such a plot myself:
- store the upper 16 bits of high uid/gids in the (currently zeroed)
padding bytes of struct acct.
- comp_t used for time values is actually able to hold up to 34 bit
values, so use that to allow times up to 194 days (with HZ=1024) to
be reported. The GNU acct tools are able to deal with the extra bits.
- units of time are the hard part. As userspace programs have no way
to determine which kernel version wrote the accounting file, and
as the units of time have changed between 2.4 and 2.6, we have already
messed up. It gets worse since we announce the constant (and thus
useless) value of AHZ=100 in <linux/acct.h> for all archs since ages.
My proposed solution: keep backwards compatibility with 2.4 by using
a unit of 1/USER_HZ, announce that correctly in the header file,
and simply admit that BSD accounting on i386 is broken in 2.6.0-2.6.4.
- store a version number in the last byte of struct acct, which allows
for a smooth transition to a new binary format when 2.7 comes out.
For 2.7, extend uid/gid fields to 32 bit, report times in terms
of AHZ=100 on all platforms (thus allowing to report times up to 1988
days), and remove the compatibility stuff from the kernel.
- introduce some compatibility glue into the userland tools, which can be
removed when 2.6 is history.
This would leave the user with two choices:
- Don't mess with userland tools. Binary compatibility holds from
2.0 to 2.6, but breaks at the 2.6/2.7 border. For 2.7., userspace
needs to be recompiled, but yields support for 32 bit uid/gids
in return.
- Install an updated userland accounting package. Immediately yields
32 bit uid/gid support on kernels 2.6.5 and up.
User may switch forth and back between kernels from 2.0 up to 2.7+,
even within the same accounting file, without breaking compatibility.
Patch for 2.6 kernel is below. Proposed cleanup patch for 2.7 is at
http://www.physik3.uni-rostock.de/tim/kernel/2.7/acct-cleanup-01.patch
Compatibility glue for GNU acct package is still to be done.
Comments?
Any other suggestions for incompatible changes to struct acct in 2.7?
Tim
--- linux-2.6.4-rc1/include/linux/acct.h 2004-02-04 04:43:17.000000000 +0100
+++ linux-2.6.4-rc1-acct/include/linux/acct.h 2004-03-08 22:19:21.000000000 +0100
@@ -16,6 +16,8 @@
#define _LINUX_ACCT_H
#include <linux/types.h>
+#include <linux/version.h>
+#include <asm/param.h>
/*
* comp_t is a 16-bit "floating" point number with a 3-bit base 8
@@ -34,12 +36,18 @@ typedef __u16 comp_t;
#define ACCT_COMM 16
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,7,0)
+
+#define ACCT_VERSION 1
+#define AHZ (USER_HZ)
+
+
struct acct
{
char ac_flag; /* Accounting Flags */
/*
- * No binary format break with 2.0 - but when we hit 32bit uid we'll
- * have to bite one
+ * No binary format break with 2.0 - but 32 bit uid/gid support is
+ * a kludge.
*/
__u16 ac_uid; /* Accounting Real User ID */
__u16 ac_gid; /* Accounting Real Group ID */
@@ -56,9 +64,46 @@ struct acct
comp_t ac_swaps; /* Accounting Number of Swaps */
__u32 ac_exitcode; /* Accounting Exitcode */
char ac_comm[ACCT_COMM + 1]; /* Accounting Command Name */
- char ac_pad[10]; /* Accounting Padding Bytes */
+ __u16 ac_uid_hi; /* Accounting Real User ID */
+ __u16 ac_gid_hi; /* Accounting Real Group ID */
+ char ac_pad[5]; /* Accounting Padding Bytes */
+ char ac_version; /* Always set to ACCT_VERSION */
+};
+
+#else
+
+#define ACCT_VERSION 2
+#define AHZ 100
+
+/*
+ * New binary format with 32 bit uid/gid.
+ * Scheduled for 2.7.0
+ */
+
+struct acct
+{
+ __u32 ac_uid; /* Accounting Real User ID */
+ __u32 ac_gid; /* Accounting Real Group ID */
+ __u32 ac_exitcode; /* Accounting Exitcode */
+ __u32 ac_btime; /* Accounting Process Creation Time */
+ comp_t ac_utime; /* Accounting User Time */
+ comp_t ac_stime; /* Accounting System Time */
+ comp_t ac_etime; /* Accounting Elapsed Time */
+ comp_t ac_mem; /* Accounting Average Memory Usage */
+ comp_t ac_io; /* Accounting Chars Transferred */
+ comp_t ac_rw; /* Accounting Blocks Read or Written */
+ comp_t ac_minflt; /* Accounting Minor Pagefaults */
+ comp_t ac_majflt; /* Accounting Major Pagefaults */
+ comp_t ac_swaps; /* Accounting Number of Swaps */
+ __u16 ac_tty; /* Accounting Control Terminal */
+ char ac_pad[9]; /* Accounting Padding Bytes */
+ char ac_flag; /* Accounting Flags */
+ char ac_comm[ACCT_COMM + 1]; /* Accounting Command Name */
+ char ac_version; /* Always set to ACCT_VERSION */
};
+#endif
+
/*
* accounting flags
*/
@@ -69,8 +114,6 @@ struct acct
#define ACORE 0x08 /* ... dumped core */
#define AXSIG 0x10 /* ... was killed by a signal */
-#define AHZ 100
-
#ifdef __KERNEL__
#include <linux/config.h>
@@ -84,6 +127,28 @@ extern void acct_process(long exitcode);
#define acct_process(x) do { } while (0)
#endif
+/*
+ * Yet another HZ to *HZ helper function.
+ * See <linux/timer.h> for the original.
+ */
+
+#if (HZ % AHZ)==0
+# define jiffies_to_AHZ(x) ((x) / (HZ / AHZ))
+#else
+# define jiffies_to_AHZ(x) ((clock_t) jiffies_64_to_AHZ((u64) x))
+#endif
+
+static inline u64 jiffies_64_to_AHZ(u64 x)
+{
+#if (HZ % AHZ)==0
+ do_div(x, HZ / AHZ);
+#else
+ x *= AHZ;
+ do_div(x, HZ);
+#endif
+ return x;
+}
+
#endif /* __KERNEL */
#endif /* _LINUX_ACCT_H */
--- linux-2.6.4-rc1/kernel/acct.c 2004-02-04 04:43:56.000000000 +0100
+++ linux-2.6.4-rc1-acct/kernel/acct.c 2004-03-08 22:31:26.000000000 +0100
@@ -265,13 +265,16 @@ void acct_auto_close(struct super_block
* This routine has been adopted from the encode_comp_t() function in
* the kern_acct.c file of the FreeBSD operating system. The encoding
* is a 13-bit fraction with a 3-bit (base 8) exponent.
+ *
+ * Bumped up to encode 64 bit values. Unfortunately the result may
+ * overflow now.
*/
#define MANTSIZE 13 /* 13 bit mantissa. */
#define EXPSIZE 3 /* Base 8 (3 bit) exponent. */
#define MAXFRACT ((1 << MANTSIZE) - 1) /* Maximum fractional value. */
-static comp_t encode_comp_t(unsigned long value)
+static comp_t encode_comp_t(u64 value)
{
int exp, rnd;
@@ -293,9 +296,14 @@ static comp_t encode_comp_t(unsigned lon
/*
* Clean it up and polish it off.
*/
- exp <<= MANTSIZE; /* Shift the exponent into place */
- exp += value; /* and add on the mantissa. */
- return exp;
+ if (exp >= (1 << EXPSIZE)) {
+ /* Overflow. Return largest representable number instead. */
+ return (1ul << (MANTSIZE + EXPSIZE)) - 1;
+ } else {
+ exp <<= MANTSIZE; /* Shift the exponent into place */
+ exp += value; /* and add on the mantissa. */
+ return exp;
+ }
}
/*
@@ -334,15 +342,21 @@ static void do_acct_process(long exitcod
strlcpy(ac.ac_comm, current->comm, sizeof(ac.ac_comm));
elapsed = get_jiffies_64() - current->start_time;
- ac.ac_etime = encode_comp_t(elapsed < (unsigned long) -1l ?
- (unsigned long) elapsed : (unsigned long) -1l);
+ ac.ac_etime = encode_comp_t(jiffies_64_to_AHZ(elapsed));
do_div(elapsed, HZ);
ac.ac_btime = xtime.tv_sec - elapsed;
ac.ac_utime = encode_comp_t(current->utime);
ac.ac_stime = encode_comp_t(current->stime);
- /* we really need to bite the bullet and change layout */
ac.ac_uid = current->uid;
ac.ac_gid = current->gid;
+#if ACCT_VERSION==1
+ /*
+ * A change in layout to 32 bit uid/gid fields is scheduled for 2.7.
+ * But let's keep it compatible for now.
+ */
+ ac.ac_uid_hi = current->uid >> 16;
+ ac.ac_gid_hi = current->gid >> 16;
+#endif
ac.ac_tty = current->tty ? old_encode_dev(tty_devnum(current->tty)) : 0;
ac.ac_flag = 0;
@@ -374,6 +388,7 @@ static void do_acct_process(long exitcod
ac.ac_majflt = encode_comp_t(current->maj_flt);
ac.ac_swaps = encode_comp_t(current->nswap);
ac.ac_exitcode = exitcode;
+ ac.ac_version = ACCT_VERSION;
/*
* Kernel segment override to datasegment and write it
> Patch for 2.6 kernel is below.
Whoops, missed the incremental hunk below. Corrected full patch is at
http://www.physik3.uni-rostock.de/tim/kernel/2.6/acct-06.patch
Tim
--- linux-2.6.4-rc1-acct/kernel/acct.c 2004-03-08 22:31:26.000000000 +0100
+++ linux-2.6.4-rc1-acct1/kernel/acct.c 2004-03-09 00:25:50.000000000 +0100
@@ -345,8 +345,8 @@ static void do_acct_process(long exitcod
ac.ac_etime = encode_comp_t(jiffies_64_to_AHZ(elapsed));
do_div(elapsed, HZ);
ac.ac_btime = xtime.tv_sec - elapsed;
- ac.ac_utime = encode_comp_t(current->utime);
- ac.ac_stime = encode_comp_t(current->stime);
+ ac.ac_utime = encode_comp_t(jiffies_to_AHZ(current->utime));
+ ac.ac_stime = encode_comp_t(jiffies_to_AHZ(current->stime));
ac.ac_uid = current->uid;
ac.ac_gid = current->gid;
#if ACCT_VERSION==1
On Tue, 9 Mar 2004, Tim Schmielau wrote:
> Comments?
The glue for the GNU accounting tools are already on the way, should this be
the patch accepted for 2.6. If you would, please provide a similar patch
against 2.4 (or I'll do it, which ever is more convenient). Great job on your
patch, BTW, I should have snarfed those padding bytes myself, but I wasn't
giving as much thought as you did to backwards compatibility.
> Any other suggestions for incompatible changes to struct acct in 2.7?
Taking a note from the BSD camp, instead of fixing the field size in acct.h
why not cast ac_uid/ac_gid as uid_t/gid_t and let the kernel determine the
size to log in 2.7 onwards? Yes, this will break binary compatibility for
those upgrading older systems with a new stable series kernel. However, this
would obviate the need in the future to provide a migration path. Outside of
a recompile it would be transparent to the userland tools.
For fields working within a well-defined range I don't think we should be
second-guessing the kernel, and just use what the kernel dictates. Not doing
so means we are acknowledging that we don't care about logging incorrect data.
Being the whiny little bitch that I am I'm going to repeat my mantra: we
should not be in the business of logging bad data, regardless of how rare we
think we might actually do it.
Besides which, with your great idea of putting a version field in the struct
we can now put some intelligence into the GNU utils to read the different
versions of the records in the same file.
--Arthur Corliss
Bolverk's Lair -- http://arthur.corlissfamily.org/
Digital Mages -- http://www.digitalmages.com/
"Live Free or Die, the Only Way to Live" -- NH State Motto
[ CCing Sam Watters and Linda Walsh at SGI for feedback about using
pid/ppid/pgid/sid/paggid/whatever to track jobs in accounting-logs ]
On Tue, Mar 09, 2004 at 12:03:27AM +0100, Tim Schmielau wrote:
> BSD accounting currently reports only the lower 16 bits of 32 bit uids,
> reports times in units of 1/HZ seconds while it announces a unit of
> 1/100 seconds in the header file (which hurts especially on i386 where
> HZ changed since 2.4) and cannot report times longer that 49 days.
Red Hat uses a patch that adds 32 bit uids and 32 bit gids at the end of
struct acct, reducing the size of the padding to 2 bytes. If a simular
extension is added to the official kernel it would be nice if they were
compatible.
There is also a few other problems with the current implementation that
we would like to see fixed if there is going to be a format-change
anyway:
- Architecture indepencence. It would be nice if the binary format was
independent of the architecture so files can be processed on other
hosts. This would involve defining padding and byte-ordering, as well
as either stop using 1/HZ units in the file or add AHZ to the record.
- Better accuracy. Specificly the records don't include an exact
termination-time for the process. One can estimate termination-time
based on creation time and elapsed time but since elapsed time is
truncated into comp_t it is not accurate for long-lived processes.
- More information. Accounting is typically used to keep track of
what resources different jobs have used, where the definition of
"job" may differ from site to site. Quite often "job" is a group
of related processes, so including pid/ppid would allow
post-processing to rebuild the process-hierarchy and deduce what
processes were part of the same job. Even better, if usespace
could notify the kernel what job a process belongs to, it would be
easier to track jobs across multiple nodes in a cluster and so on.
Maybe adding pid, ppid and jobid to the format and then use the sid
as the jobid for now? And then it can be replaced by a proper jobid
from e.g. PAGG later?
> My proposed solution: keep backwards compatibility with 2.4 by using
> a unit of 1/USER_HZ, announce that correctly in the header file,
> and simply admit that BSD accounting on i386 is broken in 2.6.0-2.6.4.
Sounds good.
> - store a version number in the last byte of struct acct, which allows
> for a smooth transition to a new binary format when 2.7 comes out.
> For 2.7, extend uid/gid fields to 32 bit, report times in terms
> of AHZ=100 on all platforms (thus allowing to report times up to 1988
> days), and remove the compatibility stuff from the kernel.
It may be useful to have the version-information in the beginning of the
struct to allow future versions to use a struct of a different size.
There should be some bits available in the flag-field, maybe that's
enough?
--
Ragnar Kj?rstad
Software Engineer
Scali - http://www.scali.com
High Performance Clustering
> store a version number in the last byte of struct acct, which allows
> for a smooth transition to a new binary format when 2.7 comes out.
> For 2.7, extend uid/gid fields to 32 bit, report times in terms
> of AHZ=100 on all platforms (thus allowing to report times up to 1988
> days), and remove the compatibility stuff from the kernel.
This is no good. The struct size varies. It is 64 bytes on
most architectures. It is likely to be larger on hppa, and
smaller on m68k or arm.
The second byte is available, as padding on every arch.
Use that instead.