2005-03-25 10:05:13

by Guillaume Thouvenin

[permalink] [raw]
Subject: [patch 1/2] fork_connector: add a fork connector

This patch adds a fork connector in the do_fork() routine. It sends a
netlink datagram when enabled. The message can be read by a user space
application. By this way, the user space application is alerted when a
fork occurs.

It uses the userspace <-> kernelspace connector that works on top of
the netlink protocol. The fork connector is enabled or disabled by
sending a message to the connector. This operation should be done by
only one application. Such application can be downloaded from
http://cvs.sourceforge.net/viewcvs.py/elsa/elsa_project/utils/fcctl.c

The unique sequence number of messages can be used to check if a
message is lost. This sequence number is relative to a CPU.

The lmbench shows that the overhead (the construction and the sending
of the message) in the fork() routine is around 7%.

This patch applies to 2.6.12-rc1-mm3. Some other patches are needed
that fix problems in the connector.c file. At least, you need to apply
the patch provided in the second email.

Signed-off-by: Guillaume Thouvenin <[email protected]>
---

drivers/connector/Kconfig | 11 ++++
drivers/connector/Makefile | 1
drivers/connector/cn_fork.c | 104 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/connector.h | 8 +++
kernel/fork.c | 44 ++++++++++++++++++
5 files changed, 168 insertions(+)


Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Kconfig
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/Kconfig 2005-03-25 09:47:09.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Kconfig 2005-03-25 10:14:21.000000000 +0100
@@ -10,4 +10,15 @@ config CONNECTOR
Connector support can also be built as a module. If so, the module
will be called cn.ko.

+config FORK_CONNECTOR
+ bool "Enable fork connector"
+ depends on CONNECTOR=y
+ default y
+ ---help---
+ It adds a connector in kernel/fork.c:do_fork() function. When a fork
+ occurs, netlink is used to transfer information about the parent and
+ its child. This information can be used by a user space application.
+ The fork connector can be enable/disable by sending a message to the
+ connector with the corresponding group id.
+
endmenu
Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Makefile
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/Makefile 2005-03-25 09:47:09.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/Makefile 2005-03-25 10:14:21.000000000 +0100
@@ -1,2 +1,3 @@
obj-$(CONFIG_CONNECTOR) += cn.o
+obj-$(CONFIG_FORK_CONNECTOR) += cn_fork.o
cn-objs := cn_queue.o connector.o
Index: linux-2.6.12-rc1-mm3-cnfork/drivers/connector/cn_fork.c
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/drivers/connector/cn_fork.c 2003-01-30 11:24:37.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/drivers/connector/cn_fork.c 2005-03-25 10:14:21.000000000 +0100
@@ -0,0 +1,104 @@
+/*
+ * cn_fork.c
+ *
+ * 2005 Copyright (c) Guillaume Thouvenin <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#include <linux/connector.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Guillaume Thouvenin <[email protected]>");
+MODULE_DESCRIPTION("Enable or disable the usage of the fork connector");
+
+int cn_fork_enable = 0;
+struct cb_id cb_fork_id = { CN_IDX_FORK, CN_VAL_FORK };
+
+static inline void cn_fork_send_status(void)
+{
+ /* TODO */
+ printk(KERN_INFO "cn_fork_enable == %d\n", cn_fork_enable);
+}
+
+/**
+ * cn_fork_callback - enable or disable the fork connector
+ * @data: message send by the connector
+ *
+ * The callback allows to enable or disable the sending of information
+ * about fork in the do_fork() routine. To enable the fork, the user
+ * space application must send the integer 1 in the data part of the
+ * message. To disable the fork connector, it must send the integer 0.
+ */
+static void cn_fork_callback(void *data)
+{
+ struct cn_msg *msg = (struct cn_msg *)data;
+ int action;
+
+ if (cn_already_initialized && (msg->len == sizeof(cn_fork_enable))) {
+ memcpy(&action, msg->data, sizeof(cn_fork_enable));
+ switch(action) {
+ case FORK_CN_START:
+ cn_fork_enable = 1;
+ break;
+ case FORK_CN_STOP:
+ cn_fork_enable = 0;
+ break;
+ case FORK_CN_STATUS:
+ cn_fork_send_status();
+ break;
+ }
+ }
+}
+
+/**
+ * cn_fork_init - initialization entry point
+ *
+ * This routine will be run at kernel boot time because this driver is
+ * built in the kernel. It adds the connector callback to the connector
+ * driver.
+ */
+static int cn_fork_init(void)
+{
+ int err;
+
+ err = cn_add_callback(&cb_fork_id, "cn_fork", &cn_fork_callback);
+ if (err) {
+ printk(KERN_WARNING "Failed to register cn_fork\n");
+ return -EINVAL;
+ }
+
+ printk(KERN_NOTICE "cn_fork is registered\n");
+ return 0;
+}
+
+/**
+ * cn_fork_exit - exit entry point
+ *
+ * As this driver is always statically compiled into the kernel the
+ * cn_fork_exit has no effect.
+ */
+static void cn_fork_exit(void)
+{
+ cn_del_callback(&cb_fork_id);
+}
+
+module_init(cn_fork_init);
+module_exit(cn_fork_exit);
Index: linux-2.6.12-rc1-mm3-cnfork/include/linux/connector.h
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/include/linux/connector.h 2005-03-25 09:47:11.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/include/linux/connector.h 2005-03-25 10:14:21.000000000 +0100
@@ -28,10 +28,16 @@
#define CN_VAL_KOBJECT_UEVENT 0x0000
#define CN_IDX_SUPERIO 0xaabb /* SuperIO subsystem */
#define CN_VAL_SUPERIO 0xccdd
+#define CN_IDX_FORK 0xfeed /* fork events */
+#define CN_VAL_FORK 0xbeef


#define CONNECTOR_MAX_MSG_SIZE 1024

+#define FORK_CN_STOP 0
+#define FORK_CN_START 1
+#define FORK_CN_STATUS 2
+
struct cb_id
{
__u32 idx;
@@ -133,6 +139,8 @@ struct cn_dev
};

extern int cn_already_initialized;
+extern int cn_fork_enable;
+extern struct cb_id cb_fork_id;

int cn_add_callback(struct cb_id *, char *, void (* callback)(void *));
void cn_del_callback(struct cb_id *);
Index: linux-2.6.12-rc1-mm3-cnfork/kernel/fork.c
===================================================================
--- linux-2.6.12-rc1-mm3-cnfork.orig/kernel/fork.c 2005-03-25 09:47:11.000000000 +0100
+++ linux-2.6.12-rc1-mm3-cnfork/kernel/fork.c 2005-03-25 10:14:21.000000000 +0100
@@ -41,6 +41,7 @@
#include <linux/profile.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/connector.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -63,6 +64,47 @@ DEFINE_PER_CPU(unsigned long, process_co

EXPORT_SYMBOL(tasklist_lock);

+#ifdef CONFIG_FORK_CONNECTOR
+
+#define CN_FORK_INFO_SIZE 64
+#define CN_FORK_MSG_SIZE (sizeof(struct cn_msg) + CN_FORK_INFO_SIZE)
+
+static DEFINE_PER_CPU(unsigned long, fork_counts);
+
+static inline void fork_connector(pid_t parent, pid_t child)
+{
+ if (cn_fork_enable) {
+ struct cn_msg *msg;
+ __u8 buffer[CN_FORK_MSG_SIZE];
+
+ msg = (struct cn_msg *)buffer;
+
+ memcpy(&msg->id, &cb_fork_id, sizeof(msg->id));
+
+ msg->ack = 0; /* not used */
+ msg->seq = get_cpu_var(fork_counts)++;
+
+ /*
+ * size of data is the number of characters
+ * printed plus one for the trailing '\0'
+ */
+ memset(msg->data, '\0', CN_FORK_INFO_SIZE);
+ msg->len = scnprintf(msg->data, CN_FORK_INFO_SIZE-1,
+ "%i %i %i",
+ smp_processor_id(), parent, child) + 1;
+
+ put_cpu_var(fork_counts);
+
+ cn_netlink_send(msg, CN_IDX_FORK);
+ }
+}
+#else
+static inline void fork_connector(pid_t parent, pid_t child)
+{
+ return;
+}
+#endif /* CONFIG_FORK_CONNECTOR */
+
int nr_processes(void)
{
int cpu;
@@ -1253,6 +1295,8 @@ long do_fork(unsigned long clone_flags,
if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE))
ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
}
+
+ fork_connector(current->pid, p->pid);
} else {
free_pidmap(pid);
pid = PTR_ERR(p);



2005-03-25 22:48:51

by dean gaudet

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Fri, 25 Mar 2005, Guillaume Thouvenin wrote:

...
> The lmbench shows that the overhead (the construction and the sending
> of the message) in the fork() routine is around 7%.
...
> + /*
> + * size of data is the number of characters
> + * printed plus one for the trailing '\0'
> + */
> + memset(msg->data, '\0', CN_FORK_INFO_SIZE);
> + msg->len = scnprintf(msg->data, CN_FORK_INFO_SIZE-1,
> + "%i %i %i",
> + smp_processor_id(), parent, child) + 1;

i'm certain that if you used a struct {} and filled in 3 fields rather
than zeroing 64 bytes of memory, and doing 3 conversions to decimal ascii
then you'd see a marked decrease in the overhead of this. it's not clear
to me why you need ascii here -- the rest of the existing bsd accounting
code is not ascii (i'm assuming the purpose of the fork connector is for
accounting).

-dean

2005-03-28 21:45:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> The lmbench shows that the overhead (the construction and the sending
> of the message) in the fork() routine is around 7%.

Thanks for including the numbers. The 7% seems a bit costly, for a bit
more accounting information. Perhaps dean's suggestion, to not use
ascii, will help. I hope so, though I doubt it will make a huge
difference. Was this 7% loss with or without a user level program
consuming the sent messages? I would think that the number of interest
would include a minimal consumer task.

I don't see a good reason to make fork_connector() inline. Since it
calls other subroutines and is not just a few lines, perhaps better to
make it a real routine, so we can see it in "nm --print-size" output and
debug stacks.

Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
seems unfortunate. Can the real fork_connector() be put elsewhere, and
the ifdef put in a header file that makes it a no-op if not configured,
or simply a function declaration, if configured?

What's the status of the connector driver patch? I perhaps wasn't
paying close enough attention, but all I see of it now is a couple of
patches sent to lkml, from Evgeniy Polyakov, in September and January.
I don't see it in my copies of *-mm or recent Linus bk trees. Am I
missing something?

This still seems to me like more apparatus than is desirable, just to
get another form of session id, as best as I can figure it. However
we've already been there, and apparently my concerns were not
persuasive. If one does go down this path, then using this connector
patch is a good an alternative as any I know of. Well, that or relayfs.
My uneducated assumption is that relayfs might at least batch data
packets up into big buffer chunks better, but someone more knowledgeable
than me needs to consider that.

It's a little sad, when almost all the required accounting information
comes out in packed 64 byte records, carefully buffered and sent in
big chunks, to minimize per-task costs. Then this one extra detail,
of <parent-pid, child-pid> requires an entire netlink packet of
its own of what size -- another 50 or 100 bytes? Is this packet
received as a separate data packet, on its own recv(2) system call,
by the user task, not in a big block of packets? The efficiency
of getting this one extra <parent-pid, child-pid> out of the kernel
seems to be one or two orders of magnitude worse than the rest of
the accounting data.

===

Hmmm ... perhaps one could add a _second_ accounting file, cutting and
pasting code in kernel/acct.c and enabling writing additional
information to that second file, using the same mechanisms as now used
for the primary file. Use a more extensible record format for the
second file (say start each record with a magic cookie, a byte record
type and a byte record length, then that many bytes). That way, we have
an escape valve for adding additional record types in the future.
And that way we can efficiently write short records, with just say
a couple of interesting values, and minimal overhead.

Don't worry if the magic cookie appears as part of the raw data. If one
has to resync such a data stream, one can look for a series of records,
each starting with the magic cookie, sensible record type byte, and a
length that ends right at the next such valid record. The occassional
duplication of the same cookie within the data stream would not thwart a
resync for long. And the main purpose of the magic cookie is to make
sure you are still in sync, not reverting to garbage-in, garbage-out,
mode. Almost any magic value other than 0x0000 will suffice for that
purpose.

I just ran a silly little test on my PC desktop Linux box, scanning
/proc/kcore. The _least_ common 2 byte word seen was 0x2B91, with 31
instances in a half-billion words scanned, so I nominate that value for
the magic cookie ;).

The key reason that it might make sense here to adapt the existing
accounting file direct write mechanism, rather than using "connector" or
"relayfs", is that we really do want to get this data to disk initially.
Relayfs is optimized for getting alot of data to a user daemon, and the
connector for sending smaller packets of data to a user daemon. But
accounting processing is sometimes done out of a cron job off-hours.
During the day (the busy hours) you might just want to stash the stuff
with as little performance impact is possible. If one can avoid _any_
other task having to context switch in, in order to get this data on its
way, that is a huge win.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 07:09:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > The lmbench shows that the overhead (the construction and the sending
> > of the message) in the fork() routine is around 7%.
>
> Thanks for including the numbers. The 7% seems a bit costly, for a bit
> more accounting information. Perhaps dean's suggestion, to not use
> ascii, will help. I hope so, though I doubt it will make a huge
> difference. Was this 7% loss with or without a user level program
> consuming the sent messages? I would think that the number of interest
> would include a minimal consumer task.

There is no overhead at all using CBUS.
On my old P2/256mb SMP machine it took about 950 usec
to create+exit process both with fork connector turned on and
without it even compiled.
Direct connector's method call took about 1000-1100 usec.
Current fork connector does not use CBUS [yet, I hope].

> I don't see a good reason to make fork_connector() inline. Since it
> calls other subroutines and is not just a few lines, perhaps better to
> make it a real routine, so we can see it in "nm --print-size" output and
> debug stacks.
>
> Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
> seems unfortunate. Can the real fork_connector() be put elsewhere, and
> the ifdef put in a header file that makes it a no-op if not configured,
> or simply a function declaration, if configured?
>
> What's the status of the connector driver patch? I perhaps wasn't
> paying close enough attention, but all I see of it now is a couple of
> patches sent to lkml, from Evgeniy Polyakov, in September and January.
> I don't see it in my copies of *-mm or recent Linus bk trees. Am I
> missing something?

It was dropped from -mm tree, since bk tree where it lives
was in maintenance mode.
I think connector will be appeared in the next -mm release.

> This still seems to me like more apparatus than is desirable, just to
> get another form of session id, as best as I can figure it. However
> we've already been there, and apparently my concerns were not
> persuasive. If one does go down this path, then using this connector
> patch is a good an alternative as any I know of. Well, that or relayfs.
> My uneducated assumption is that relayfs might at least batch data
> packets up into big buffer chunks better, but someone more knowledgeable
> than me needs to consider that.
>
> It's a little sad, when almost all the required accounting information
> comes out in packed 64 byte records, carefully buffered and sent in
> big chunks, to minimize per-task costs. Then this one extra detail,
> of <parent-pid, child-pid> requires an entire netlink packet of
> its own of what size -- another 50 or 100 bytes? Is this packet
> received as a separate data packet, on its own recv(2) system call,
> by the user task, not in a big block of packets? The efficiency
> of getting this one extra <parent-pid, child-pid> out of the kernel
> seems to be one or two orders of magnitude worse than the rest of
> the accounting data.

It can be easily changed.
One may send kernel/acct.c acct_t structure out of the kernel -
overhead will be the same: kmalloc probably will get new area from the
same 256-bytes pool, skb is still in cache.

> ===
>
> Hmmm ... perhaps one could add a _second_ accounting file, cutting and
> pasting code in kernel/acct.c and enabling writing additional
> information to that second file, using the same mechanisms as now used
> for the primary file. Use a more extensible record format for the
> second file (say start each record with a magic cookie, a byte record
> type and a byte record length, then that many bytes). That way, we have
> an escape valve for adding additional record types in the future.
> And that way we can efficiently write short records, with just say
> a couple of interesting values, and minimal overhead.
>
> Don't worry if the magic cookie appears as part of the raw data. If one
> has to resync such a data stream, one can look for a series of records,
> each starting with the magic cookie, sensible record type byte, and a
> length that ends right at the next such valid record. The occassional
> duplication of the same cookie within the data stream would not thwart a
> resync for long. And the main purpose of the magic cookie is to make
> sure you are still in sync, not reverting to garbage-in, garbage-out,
> mode. Almost any magic value other than 0x0000 will suffice for that
> purpose.
>
> I just ran a silly little test on my PC desktop Linux box, scanning
> /proc/kcore. The _least_ common 2 byte word seen was 0x2B91, with 31
> instances in a half-billion words scanned, so I nominate that value for
> the magic cookie ;).
>
> The key reason that it might make sense here to adapt the existing
> accounting file direct write mechanism, rather than using "connector" or
> "relayfs", is that we really do want to get this data to disk initially.
> Relayfs is optimized for getting alot of data to a user daemon, and the
> connector for sending smaller packets of data to a user daemon. But
> accounting processing is sometimes done out of a cron job off-hours.
> During the day (the busy hours) you might just want to stash the stuff
> with as little performance impact is possible. If one can avoid _any_
> other task having to context switch in, in order to get this data on its
> way, that is a huge win.

File writing accounting [kernel/acct.c] is slower, it takes global
locks
and requires process' context to work with system calls.
realyfs is interesting project, but it has different aims,
as far as I can see, - it is created for transferring huge amounts of
data,
and it is succeded in it, while connector is purely
control/notification
mechanism, for example, for gathering short-living per-process
accounting data.

--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-03-29 07:20:50

by Greg KH

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, Mar 29, 2005 at 11:04:16AM +0400, Evgeniy Polyakov wrote:
> On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> > I don't see it in my copies of *-mm or recent Linus bk trees. Am I
> > missing something?
>
> It was dropped from -mm tree, since bk tree where it lives
> was in maintenance mode.
> I think connector will be appeared in the next -mm release.

Should have been in the last -mm release. If not, please let me know.

thanks,

greg k-h

2005-03-29 08:14:02

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > The lmbench shows that the overhead (the construction and the sending
> > of the message) in the fork() routine is around 7%.
>
> Thanks for including the numbers. The 7% seems a bit costly, for a bit
> more accounting information. Perhaps dean's suggestion, to not use
> ascii, will help. I hope so, though I doubt it will make a huge
> difference. Was this 7% loss with or without a user level program
> consuming the sent messages? I would think that the number of interest
> would include a minimal consumer task.

Yes, dean's suggestion helps. The overhead is now around 4%

fork_connector disabled:
Process fork+exit: 149.4444 microseconds

fork_connector enabled:
Process fork+exit: 154.9167 microseconds

> Having the "#ifdef CONFIG_FORK_CONNECTOR" chunk of code right in fork.c
> seems unfortunate. Can the real fork_connector() be put elsewhere, and
> the ifdef put in a header file that makes it a no-op if not configured,
> or simply a function declaration, if configured?

I think that it can be moved in include/linux/connector.h

Guillaume

2005-03-29 08:09:59

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Mon, 2005-03-28 at 23:02 -0800, Greg KH wrote:
> On Tue, Mar 29, 2005 at 11:04:16AM +0400, Evgeniy Polyakov wrote:
> > On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> > > I don't see it in my copies of *-mm or recent Linus bk trees. Am I
> > > missing something?
> >
> > It was dropped from -mm tree, since bk tree where it lives
> > was in maintenance mode.
> > I think connector will be appeared in the next -mm release.
>
> Should have been in the last -mm release. If not, please let me know.

Thank you.
If you are not going to sleep right now I will recreate rejected
NLMSGSPACE
patch in a several minutes.

> thanks,
>
> greg k-h
--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-03-29 08:56:25

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Evgeniy wrote:
> There is no overhead at all using CBUS.

This is unlikely. Very unlikely.

Please understand that I am not trying to critique CBUS or connector in
isolation, but rather trying to determine what mechanism is best suited
for getting this accounting data written to disk, which is where I
assume it has to go until some non-real time job gets around to
analyzing it. We already have the rest of the BSD Accounting
information taking this batched up path directly to the disk. There is
nothing (that I know of) to be gained from delivering this new fork data
with any higher quality of service, or to any other place.

>From what I can understand, correct me if I'm wrong, we have two
alternatives in front of us (ignoring relayfs for a second):

1) Using fork_connector (presumably soon to include use of CBUS):
- forking process enqueues data plus header for single fork
- context switch
- daemon process dequeues single fork data (is this a read or recv?)
- daemon process mergers multiple fork data into single buffer
- daemon process writes buffer for multiple forks (a write)
- disk driver pushes buffer with data for multiple forks to disk

2) Using a modified form of what BSD ACCOUNTING does now:
- forking process appends single fork data to in-kernel buffer
- disk driver pushes buffer with data for multiple forks to disk

It seems to me to be rather unlikely that (1) is cheaper than (2). It
is no particular fault of connector or CBUS that this is so. Even if
there were no overhead at all using CBUS (which I don't believe), (1)
still costs more, because it passes the data, with an added packet
header, into a separate process, and into user space, before it is
combined with other accounting information, and written back down into
the kernel to go to the disk.


> > ... The efficiency
> > of getting this one extra <parent-pid, child-pid> out of the kernel
> > seems to be one or two orders of magnitude worse than the rest of
> > the accounting data.
>
> It can be easily changed.
> One may send kernel/acct.c acct_t structure out of the kernel -
> overhead will be the same: kmalloc probably will get new area from the
> same 256-bytes pool, skb is still in cache.

I have no idea what you just said.


> File writing accounting [kernel/acct.c] is slower,

Sure file writing is slower than queuing on an internal list. I don't
care that getting the data where I want it is slower than getting it
some other place that's only part way.


> and requires process' context to work with system calls.

For some connector uses, that might matter. For hooks in fork,
that is no problem - we have all the process context one could
want - two of them if that helps ;).


> realyfs is interesting project, but it has different aims,

That could well be ... I can't claim to know which of relayfs or
connector would be better here, of the two.


> while connector is purely control/notification mechanism

However connector is, in this regard, overkill. We don't need a single
event notification mechanism here. One of the key ways in which
accounting such as this has historically minimized system load is to
forgo any effort to provide any notification or data packet per event,
and instead immediately work to batch the data up in bulk form, with one
buffer containing the concatenated data for multiple events. This
amortizes the cost of almost all the handling, and of all the disk i/o,
over many data collection events. Correct me if I'm wrong, but
fork_connector doesn't do this merging of events into a consolidated
data buffer, so is at a distinct disadvantage, for this use, because the
data merging is delayed, and a separate, user level process, is required
to accomplish the merging and conversion to writable blocks of data
suitable for storing on the disk.

Nothing wrong with a good screwdriver. But if you want to pound nails,
hammers, even rocks, work better.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 09:17:31

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 2005-03-29 at 00:49 -0800, Paul Jackson wrote:
> This
> amortizes the cost of almost all the handling, and of all the disk i/o,
> over many data collection events. Correct me if I'm wrong, but
> fork_connector doesn't do this merging of events into a consolidated
> data buffer, so is at a distinct disadvantage, for this use, because the
> data merging is delayed, and a separate, user level process, is required
> to accomplish the merging and conversion to writable blocks of data
> suitable for storing on the disk.

The goal of the fork connector is to inform a user space application
that a fork occurs in the kernel. This information (cpu ID, parent PID
and child PID) can be used by several user space applications. It's not
only for accounting. Accounting and fork_connector are two different
things and thus, fork_connector doesn't do the merge of any kinds of
data (and it will never do).

One difference with relayfs is the amount of datas that are
transfered. The relayfs is done, like Evgeniy said, for large amount of
datas. So I think that it's not suitable for what we want to achieve
with the fork connector.


I hope this help,
Regards,
Guillaume

2005-03-29 10:28:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 2005-03-29 at 00:49 -0800, Paul Jackson wrote:
> Evgeniy wrote:
> > There is no overhead at all using CBUS.
>
> This is unlikely. Very unlikely.
>
> Please understand that I am not trying to critique CBUS or connector in
> isolation, but rather trying to determine what mechanism is best suited
> for getting this accounting data written to disk, which is where I
> assume it has to go until some non-real time job gets around to
> analyzing it. We already have the rest of the BSD Accounting
> information taking this batched up path directly to the disk. There is
> nothing (that I know of) to be gained from delivering this new fork data
> with any higher quality of service, or to any other place.
>
> From what I can understand, correct me if I'm wrong, we have two
> alternatives in front of us (ignoring relayfs for a second):
>
> 1) Using fork_connector (presumably soon to include use of CBUS):
> - forking process enqueues data plus header for single fork

Here forking connector module "exits" and can handle next fork() on the
same CPU.
Different CPUs are handled independently due to per-cpu variables usage.

That is why it is very fast in "fast-path".

> - context switch
> - daemon process dequeues single fork data (is this a read or recv?)
> - daemon process mergers multiple fork data into single buffer
> - daemon process writes buffer for multiple forks (a write)
> - disk driver pushes buffer with data for multiple forks to disk

Not exactly.

- context switch
- CBUS daemon, which runs with +19 nice get a bunch [10 currently, nice
value and queue "length" are determined in experiments] of requests from
each CPU queue and send them using connector's cn_netlink_send.
- cn_netlink_send reallocates a buffer for each message
[skb + allocation from 256-bytes pool in kmalloc],
walks through list of registered sockets and links given skb to the
requested socket queues.
- context switch
- userspace daemon is awakened from recv() syscall and does what it
want with provided data.
It can write it to disk, but may process in real-time or send to the
network.

The most expensive part is cn_netlink_send()/netlink_broadcast(),
with CBUS it is deferred to the safe time,
so fork() itself is not affected (it is only per-cpu locking + linking +
atomic allocation).
Since deferred message will be processed in "safe" time with low
priority, it should not
affect fork() too (but can).

> 2) Using a modified form of what BSD ACCOUNTING does now:
> - forking process appends single fork data to in-kernel buffer

It is not as simple.
It takes global locks several times, it access bunch of shared between
CPU data.
It calls ->stat() and ->write() which may sleep.

> - disk driver pushes buffer with data for multiple forks to disk


Here is the same deffering as in connector, only preparation is
different.
And acct.c preparation may sleep and works with shared objects and
locks,
so it is slower, but it has an advantage - data is already written to
the
storage.

> It seems to me to be rather unlikely that (1) is cheaper than (2). It
> is no particular fault of connector or CBUS that this is so. Even if
> there were no overhead at all using CBUS (which I don't believe), (1)
> still costs more, because it passes the data, with an added packet
> header, into a separate process, and into user space, before it is
> combined with other accounting information, and written back down into
> the kernel to go to the disk.

That work is deferred and does not affect in-kernel processes.
And why userspace fork connector should write data to the disk?
It can process it in-flight and write only results.
acct.c processing daemon needs to read data, i.e. transfer them from
kernelspace.
But again all that work is deferred and does not affect fork()
performance.

>
> > > ... The efficiency
> > > of getting this one extra <parent-pid, child-pid> out of the kernel
> > > seems to be one or two orders of magnitude worse than the rest of
> > > the accounting data.
> >
> > It can be easily changed.
> > One may send kernel/acct.c acct_t structure out of the kernel -
> > overhead will be the same: kmalloc probably will get new area from the
> > same 256-bytes pool, skb is still in cache.
>
> I have no idea what you just said.

Connector's overhead may come from memory allocation -
currently it calls alloc_skb(size, GFP_ATOMIC), skb allocation
calls kmem_cache_alloc() for skb itself and kmalloc() for
size + sizeof(struct skb_shared_info), which essentially
is allocation from the 256-bytes slab pool.

>
> > File writing accounting [kernel/acct.c] is slower,
>
> Sure file writing is slower than queuing on an internal list. I don't
> care that getting the data where I want it is slower than getting it
> some other place that's only part way.

One needs to pay for speed.
In case of the acct.c price is high since writing is slow,
but one does not need to care about receiving part.
In the case of connector, price is low, but that requires
some additional process to fetch the data.

For some tasks one may be better than other, for others it is not.

> > and requires process' context to work with system calls.
>
> For some connector uses, that might matter. For hooks in fork,
> that is no problem - we have all the process context one could
> want - two of them if that helps ;).
>
> > realyfs is interesting project, but it has different aims,
>
> That could well be ... I can't claim to know which of relayfs or
> connector would be better here, of the two.
>
>
> > while connector is purely control/notification mechanism
>
> However connector is, in this regard, overkill. We don't need a single
> event notification mechanism here. One of the key ways in which
> accounting such as this has historically minimized system load is to
> forgo any effort to provide any notification or data packet per event,
> and instead immediately work to batch the data up in bulk form, with one
> buffer containing the concatenated data for multiple events. This
> amortizes the cost of almost all the handling, and of all the disk i/o,
> over many data collection events. Correct me if I'm wrong, but
> fork_connector doesn't do this merging of events into a consolidated
> data buffer, so is at a distinct disadvantage, for this use, because the
> data merging is delayed, and a separate, user level process, is required
> to accomplish the merging and conversion to writable blocks of data
> suitable for storing on the disk.

It is design decision,
one may want to write all data even with slowdowning
the system, but later process it all at once,
but one may want to process it in real-time in small pieces and have
direct
vision of how the system behaves, and thus it requires very small
overhead.
Userspace fork connector daemon may send data to the network or transfer
it using some other mechanism without touching disk IO subsystem,
and it is faster than writing it disk, then reading it or may be
seeking,
and transfer again.

One instrument is better for one type of taks,
others are suitable for different.

> Nothing wrong with a good screwdriver. But if you want to pound nails,
> hammers, even rocks, work better.

He-he, while you lift your rock hammer, others can finish all the work
with theirs small instruments. :)

> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-03-29 12:51:32

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Mon, 2005-03-28 at 13:42 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > The lmbench shows that the overhead (the construction and the sending
> > of the message) in the fork() routine is around 7%.
>
> Thanks for including the numbers. The 7% seems a bit costly, for a bit
> more accounting information. Perhaps dean's suggestion, to not use
> ascii, will help. I hope so, though I doubt it will make a huge
> difference. Was this 7% loss with or without a user level program
> consuming the sent messages? I would think that the number of interest
> would include a minimal consumer task.

I ran some test using the CBUS instead of the cn_netlink_send() routine
and the overhead is nearly 0%:

fork connector disabled:
Process fork+exit: 148.1429 microseconds

fork connector enabled:
Process fork+exit: 148.4595 microseconds

Regards,
Guillaume

2005-03-29 14:49:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> Yes, dean's suggestion helps. The overhead is now around 4%

More improvement than I expected (and I see a CBUS result further
down in my inbox).

Does this include a minimal consumer task of the data that writes
it to disk?

> I think that it can be moved in include/linux/connector.h

Good.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 15:25:59

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> The goal of the fork connector is to inform a user space application
> that a fork occurs in the kernel. This information (cpu ID, parent PID
> and child PID) can be used by several user space applications. It's not
> only for accounting. Accounting and fork_connector are two different
> things and thus, fork_connector doesn't do the merge of any kinds of
> data (and it will never do).

Yes - it is clear that the fork_connector does this - inform user space
of fork information <cpu, parent, child>. I'm not saying that
fork_connector should merge data; I'm observing that it doesn't, and
that this would seem to serve the needs of accounting poorly.

Out of curiosity, what are these 'several user space applications?' The
only one I know of is this extension to bsd accounting to include
capturing parent and child pid at fork. Probably you've mentioned some
other uses of fork_connector before here, but I missed it.

> The relayfs is done, like Evgeniy said, for large amount of
> datas. So I think that it's not suitable for what we want to achieve
> with the fork connector.

I never claimed that relayfs was appropriate for fork_connector.

I'm not trying to tape a rock to Evgeniy's screwdriver. I'm saying that
accounting looks like a nail to me, so let us see what rocks and hammers
we have in our tool box.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 15:38:11

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> I ran some test using the CBUS instead of the cn_netlink_send() routine
> and the overhead is nearly 0%:

Overhead of what? Does this include merging the data and getting it to
disk?

Am I even asking the right question here - is it true that this data,
when collected for accounting purposes, needs to go to disk, and that
summarizing and analyzing the data is done 'off-line', perhaps hours
later? That's the way it was 25 years ago ... but perhaps the basic
data flow appropriate for accounting has changed since then.

And if the data flow has changed, then how to you account for the fact
that the rest of the accounting data, under the CONFIG_BSD_PROCESS_ACCT
option, is still collected the 'old fashioned way' (direct kernel write
to disk)?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 17:08:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Evgeniy writes:
> Here forking connector module "exits" and can handle next fork() on the
> same CPU.

Fine ... but it's not about what the fork_connector does. It's about
getting the accounting data to disk, if I understand correctly.

> That is why it is very fast in "fast-path".

I don't care how fast a tool is. I care how fast the job gets done. If
a tool is only doing part of the job, then we can't decide whether to
use that tool just based on how fast that part of the job gets done.

> The most expensive part is cn_netlink_send()/netlink_broadcast(),
> with CBUS it is deferred to the safe time,

This is "safe time" for the immediate purpose of letting the forking
process continue on its way. But the deferred work of buffering up the
data and writing it to disk still needs to be done, pretty soon. When
sizing a system to see how many users or jobs I can run on it at a time,
I will have to include sufficient cpu, memory and disk i/o to handle
getting this accounting data to disk, right?

> > 2) Using a modified form of what BSD ACCOUNTING does now:
> > - forking process appends single fork data to in-kernel buffer
>
> It is not as simple.
> It takes global locks several times, it access bunch of shared between
> CPU data.
> It calls ->stat() and ->write() which may sleep.

Hmmm ... good points. The mechanisms in the kernel now (and for the
last 25 years) to write out BSD ACCOUNTING data may not be numa friendly.

Perhaps there should be a per-cpu 512 byte buffer, which can gather up 8
accounting records (64 bytes each) and only call the file system write
once every 8 task exits. Or perhaps a per-node buffer, with a spinlock
to serialize access by the CPUs on that node. Or perhaps per-node
accounting files. Or something like that.

Guillaume, Jay - do we (you ?) need to make classic BSD ACCOUNTING data
collection numa friendly? Based on the various frustrated comments at
the top of kernel/acct.c, this could be a non-trivial effort to get
right. Maybe we need it, but can't afford it.

And perhaps my proposed variable length records for supplementary
accounting, such as <parent pid, child pid> from fork, need to allow
for some way to pad out the rest of a buffer, when the next record
won't fit entirely.

> That work is deferred and does not affect in-kernel processes.

The accounting data collection cannot be deferred for long, perhaps
just a few minutes. Not until the data hits the disk can we rest
indefinitely. Unless, that is, I don't understand what problem is
being solved here (quite possible ;).

> And why userspace fork connector should write data to the disk?

I NEVER said it should. I am NOT trying to redesign fork_connector.

Good grief ... how many times and ways do I have to say this ;)?

I am asking what is the best tool for accounting data collection,
which, if I understand correctly, does need to write to disk.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-29 18:44:32

by Jay Lan

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Paul Jackson wrote:
> Guillaume wrote:
>
>> The goal of the fork connector is to inform a user space application
>>that a fork occurs in the kernel. This information (cpu ID, parent PID
>>and child PID) can be used by several user space applications. It's not
>>only for accounting. Accounting and fork_connector are two different
>>things and thus, fork_connector doesn't do the merge of any kinds of
>>data (and it will never do).
>
>
> Yes - it is clear that the fork_connector does this - inform user space
> of fork information <cpu, parent, child>. I'm not saying that
> fork_connector should merge data; I'm observing that it doesn't, and
> that this would seem to serve the needs of accounting poorly.

Paul,

You probably can look at it this way: the accounting data being
written out by BSD are per process data and the fork connector
provides information needed to group processes into process
aggregates.

Thanks,
- jay

>
> Out of curiosity, what are these 'several user space applications?' The
> only one I know of is this extension to bsd accounting to include
> capturing parent and child pid at fork. Probably you've mentioned some
> other uses of fork_connector before here, but I missed it.
>
>
>>The relayfs is done, like Evgeniy said, for large amount of
>>datas. So I think that it's not suitable for what we want to achieve
>>with the fork connector.
>
>
> I never claimed that relayfs was appropriate for fork_connector.
>
> I'm not trying to tape a rock to Evgeniy's screwdriver. I'm saying that
> accounting looks like a nail to me, so let us see what rocks and hammers
> we have in our tool box.
>

2005-03-29 21:21:06

by Jay Lan

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Paul,

The fork_connector is not designed to solve accounting data collection
problem.

The accounting data collection must be done via a hook from do_exit().
The acct_process() hook invokes do_acct_process() to write BSD
accounting data to disk. CSA needs a similar hook off do_exit() to
collect more accounting data and write to disk in different
accounting records format. This part is not part of fork_connector.

ELSA does not care how accounting data being written to disks. However,
it needs the accounting data being reliably and accurately collected
and written to disk by BSD and/or CSA and it needs the <ppid, pid>
information to aggregate processes. It was never the fork_connector's
intention to piggy back the data to the accounting file.

Thanks,
- jay



Paul Jackson wrote:
> Evgeniy writes:
>
>>Here forking connector module "exits" and can handle next fork() on the
>>same CPU.
>
>
> Fine ... but it's not about what the fork_connector does. It's about
> getting the accounting data to disk, if I understand correctly.
>
>
>>That is why it is very fast in "fast-path".
>
>
> I don't care how fast a tool is. I care how fast the job gets done. If
> a tool is only doing part of the job, then we can't decide whether to
> use that tool just based on how fast that part of the job gets done.
>
>
>>The most expensive part is cn_netlink_send()/netlink_broadcast(),
>>with CBUS it is deferred to the safe time,
>
>
> This is "safe time" for the immediate purpose of letting the forking
> process continue on its way. But the deferred work of buffering up the
> data and writing it to disk still needs to be done, pretty soon. When
> sizing a system to see how many users or jobs I can run on it at a time,
> I will have to include sufficient cpu, memory and disk i/o to handle
> getting this accounting data to disk, right?
>
>
>>> 2) Using a modified form of what BSD ACCOUNTING does now:
>>> - forking process appends single fork data to in-kernel buffer
>>
>>It is not as simple.
>>It takes global locks several times, it access bunch of shared between
>>CPU data.
>>It calls ->stat() and ->write() which may sleep.
>
>
> Hmmm ... good points. The mechanisms in the kernel now (and for the
> last 25 years) to write out BSD ACCOUNTING data may not be numa friendly.
>
> Perhaps there should be a per-cpu 512 byte buffer, which can gather up 8
> accounting records (64 bytes each) and only call the file system write
> once every 8 task exits. Or perhaps a per-node buffer, with a spinlock
> to serialize access by the CPUs on that node. Or perhaps per-node
> accounting files. Or something like that.
>
> Guillaume, Jay - do we (you ?) need to make classic BSD ACCOUNTING data
> collection numa friendly? Based on the various frustrated comments at
> the top of kernel/acct.c, this could be a non-trivial effort to get
> right. Maybe we need it, but can't afford it.
>
> And perhaps my proposed variable length records for supplementary
> accounting, such as <parent pid, child pid> from fork, need to allow
> for some way to pad out the rest of a buffer, when the next record
> won't fit entirely.
>
>
>>That work is deferred and does not affect in-kernel processes.
>
>
> The accounting data collection cannot be deferred for long, perhaps
> just a few minutes. Not until the data hits the disk can we rest
> indefinitely. Unless, that is, I don't understand what problem is
> being solved here (quite possible ;).
>
>
>>And why userspace fork connector should write data to the disk?
>
>
> I NEVER said it should. I am NOT trying to redesign fork_connector.
>
> Good grief ... how many times and ways do I have to say this ;)?
>
> I am asking what is the best tool for accounting data collection,
> which, if I understand correctly, does need to write to disk.
>

2005-03-29 22:04:03

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Jay wrote:
> The fork_connector is not designed to solve accounting data collection
> problem.

I don't think I ever said it was designed for that purpose.

Indeed, I will confess to not yet knowing the 'real' purpose of its
design.


> It was never the fork_connector's
> intention to piggy back the data to the accounting file.

I am sure that was not its intention, and I can't imagine I would ever
have said it was.


> CSA needs a similar hook off do_exit() to
> collect more accounting data and write to disk in different
> accounting records format.

Aha - as I suspected - there will be more data to collect, in addition
to both the classic bsd accounting records at exit, and the <parent pid,
child pid> at fork.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-30 01:06:17

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

[ Hmmm .. the following pertains more to accounting than to fork_connector,
as have my other remarks earlier today. I notice just now I am on a thread
whose Subject is "fork_connector". Oh well. Sorry. - pj ]

Jay wrote:
> You probably can look at it this way: the accounting data being
> written out by BSD are per process data and the fork connector
> provides information needed to group processes into process
> aggregates.

I guess so. Though that doesn't provide any explicit guidance as to
what the necessary dataflow must be -- who (which essential piece(s) of
software) needs the data when, to accomplish what purposes that some
Linux users will desire.

Well, maybe to someone expert in Process Aggregates, it provides
such guidance, implicitly. That's definitely not I.

Let me step back a minute here.

What's needed to is work from the actual user requirements down
to what technical pieces are needed. There's an old saying
that if you want something done bad enough, do it yourself.

Or, on usenet and now on mailing lists, this has become:
if you want something done, post a sufficiently botched
example yourself, and someone who actually knows will become
sufficiently annoyed to post a useful answer.

So here goes my botched effort to work from user requirements
down to actual technical pieces needed. I look forward to
being shot down in flames.

My current understanding of the 'system accounting' requirement
is that users of large shared resource servers want to determine,
after the fact, what was the usage by or for various
tasks/jobs/users/groups/time-periods of various compute
resources, in order to perform such tasks as billing and sizing
of future equipment needs, and to identify patterns of over or
under utilized system resources that might present other
opportunities for useful action, or causes for remedial action.

I am working under the assumption that there is some accounting
(of computer users and resources, not of money ;) software
(runnacct, CSA, and ELSA, for example) that runs, after the fact
in some post-processing mode, that reads records of actual usage
details from disk files and does useful stuff (like generate
reports useful to the above requirements) with what it can glean
from those records and from other configuration information
it can find about the current system (by reading other disk
files, typically). This processing can be and often is done
in batch mode, and is often scheduled out of a cron job for some
time when the system is normally under relatively lighter load,
such as late at night.

I assume that the information needed by this accounting software
includes both the classic BSD accounting records and the
<parent_pid, child_pid> information at fork.

I am not aware of any other uses of the <parent_pid, child_pid>
information from fork, though it would not surprise me to learn
that there other such uses - you're welcome to educate me on
this matter.

I suspect that there is other information, or will be, in
addition to the specific details collected by the classic bsd
accounting kernel hooks, and in addition to the <parent_pid,
child_pid> information at fork, which will also be needed by CSA
and/or ELSA, and which also needs to be written to disk files
as the data is collected, for subsequent processing by such
accounting software as CSA and ELSA, or the classic runacct(1M)
daily accounting software and variants.

If the above is all true, then the basic problem to solve
regarding the <parent_pid, child_pid> information collected at
fork is how to get it into a disk file, with close to minimum
impact on the system.

Since the data is not needed in anything like realtime (or
if it is, I don't realize that yet) therefore there is an
opportunity to combine the data records into buffers of data,
so as to amortize some of the costs of writing the data to
disk over several records. The classic bsd accounting hooks
do this merging aggressively, in the context of the process
doing the exit.

The classic accounting hooks may have a problem that they are not
NUMA friendly - having all the nodes in a big system trying to
simultaneously add small (64 bytes, typically) snippets to the
same shared file buffers at the same time might not scale well.
These hooks were designed over 25 years ago, when multiprocessing
was in its infancy, and may need overhaul.

The fork_connector mechanism is being proposed to get the
particular bit of information <parent_pid, child_pid> from
fork moved to what I presume is a data collector daemon user
process, which will I presume then write merged records of
this data to disk. This may have the problem that it moves
the individual records between various contexts on the system,
more than is necessary, before it can be merged into buffers
and written. While such data motion does not happen inline
to the fork itself, it still has to occur in near realtime
(minutes) of the fork event, so still impacts system performance
(both CPU cycles and memory footprint) during peak usage hours.
Performance impact numbers have been presented to show that
this impact is minimal, but my direct question as to whether
these numbers include the load of the data collector and writer
daemon has gone unanswered, so far as I know.

So this is what I understand the problems and the requirements
to be.

Most likely I misunderstand important parts of this - I invite
corrections. I don't actually have a horse in this race; I'm just
doing the color commentary. So if someone wants to rip this apart,
then go for it. I will enjoy the show along with everyone else.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-30 05:40:15

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 2005-03-29 at 07:23 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > The goal of the fork connector is to inform a user space application
> > that a fork occurs in the kernel. This information (cpu ID, parent PID
> > and child PID) can be used by several user space applications. It's not
> > only for accounting. Accounting and fork_connector are two different
> > things and thus, fork_connector doesn't do the merge of any kinds of
> > data (and it will never do).
>
> Yes - it is clear that the fork_connector does this - inform user space
> of fork information <cpu, parent, child>. I'm not saying that
> fork_connector should merge data; I'm observing that it doesn't, and
> that this would seem to serve the needs of accounting poorly.
>
> Out of curiosity, what are these 'several user space applications?' The
> only one I know of is this extension to bsd accounting to include
> capturing parent and child pid at fork. Probably you've mentioned some
> other uses of fork_connector before here, but I missed it.

During the discussion some people like Erich Focht and Ram mentioned
that this information can be useful for them. I remember that Erich had
in mind something like cluster-wide pid tracking in user space.

When I wrote "several user space applications" it was just to say that
this fork connector is not designed only for ELSA and fork information
is available to every listeners.

Regards,
Guillaume

2005-03-30 05:52:37

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 2005-03-29 at 07:35 -0800, Paul Jackson wrote:
> Guillaume wrote:
> > I ran some test using the CBUS instead of the cn_netlink_send() routine
> > and the overhead is nearly 0%:
>
> Overhead of what? Does this include merging the data and getting it to
> disk?

I test the overhead of sending the fork information to a user space
application. The merge of the data is done later and it has nothing to
do with the fork connector...

> Am I even asking the right question here - is it true that this data,
> when collected for accounting purposes, needs to go to disk, and that
> summarizing and analyzing the data is done 'off-line', perhaps hours
> later? That's the way it was 25 years ago ... but perhaps the basic
> data flow appropriate for accounting has changed since then.

Accounting is another problem and, as you said previously, summarizing
and analyzing the data is done later.

I'm sorry but I really don't understand why you're speaking about
accounting when I present results about fork connector. I agree that
ELSA is using the fork connector but the fork connector has nothing to
do with accounting.

Regards,
Guillaume


2005-03-30 06:06:40

by dean gaudet

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 29 Mar 2005, Jay Lan wrote:

> The fork_connector is not designed to solve accounting data collection
> problem.
>
> The accounting data collection must be done via a hook from do_exit().

by the time do_exit() occurs the parent may have disappeared... you do
need to record something at fork() time so that you can account to the
correct ancestor.

an example of where this ancestry is useful would be the summation of all
cpu time spent by children of apache, spamd, clamd, ...

-dean

2005-03-30 06:27:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Dean wrote:
> by the time do_exit() occurs the parent may have disappeared

I don't think Jay was disagreeing with this. I think he agrees
that there is to be collected:
1) the classic bsd accounting data, in do_exit
2) the fork time <parent pid, child pid> by some mechanism at
fork time (perhaps just not the fork_connect mechanism)
3) some additional data to be harvested at exit time, for CSA

I suspect you two are just tripping over words to describe this.

However, this does expose another possibility. Record the original
forking parent pid in another task_struct field at fork time (didn't
someone else have a 'bio_pid' patch to this affect?), and add that task
struct value to the list of additional items to be written out, at exit
time.

I was skeptical that CBUS could have zero impact on fork, but recording
one more word in the task struct at fork gets about as close to zero
impact as one can get on fork.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-30 06:37:50

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> When I wrote "several user space applications" it was just to say that
> this fork connector is not designed only for ELSA and fork information
> is available to every listeners.

So I suppose if fork_connector were not used to collect <parent pid,
child pid> information for accounting, then someone would have to make
the case that there were enough other uses, of sufficient value, to add
fork_connector. We have to be a bit careful, in the kernel, to avoid
adding mechanisms until we have the immediate use in hand. If we don't
do this, then the kernel ends up looking like the Gargoyles on a
Renaissance church - burdened with overly ornate features serving no
earthly purpose.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-30 06:39:23

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 2005-03-29 at 22:06 -0800, dean gaudet wrote:
> On Tue, 29 Mar 2005, Jay Lan wrote:
>
> > The fork_connector is not designed to solve accounting data collection
> > problem.
> >
> > The accounting data collection must be done via a hook from do_exit().
>
> by the time do_exit() occurs the parent may have disappeared... you do
> need to record something at fork() time so that you can account to the
> correct ancestor.

You're right. At fork(), the "job daemon", provided by ELSA, records
information about parent PID, child PID and also about the group of
processes they belong to. At exit(), accounting data are recorded by CSA
or BSD-like accounting.

> an example of where this ancestry is useful would be the summation of all
> cpu time spent by children of apache, spamd, clamd, ...

You're right. One usage can be: apache, spamd and clamd can be put in a
job (a group of processes) by using the "job daemon" and automatically,
all children will belong to the same jobs respectively. So the gaol here
is really to perform per-group of processes accounting using ELSA and
CSA accounting data.

Best Regards,
Guillaume

2005-03-30 06:44:40

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Guillaume wrote:
> I'm sorry but I really don't understand why you're speaking about
> accounting when I present results about fork connector. I agree that
> ELSA is using the fork connector but the fork connector has nothing to
> do with accounting.

True - sorry. I kinda hijacked your thread. I had fork_connector
associated in my mind with process accounting, so made the leap from
analyzing the fork_connector mechanism on its own merit, to analyzing
whether it was useful for collecting the new process accounting
information that was needed from forks.

In my own defense, I don't see where the motivations for fork_connector
are spelled out in the presentation to this patch, and it seems that
the other potential uses of it are less well explored at this point.

So I think my leap was a small one ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-03-30 10:28:50

by Herbert Xu

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Paul Jackson <[email protected]> wrote:
>
> So I suppose if fork_connector were not used to collect <parent pid,
> child pid> information for accounting, then someone would have to make
> the case that there were enough other uses, of sufficient value, to add
> fork_connector. We have to be a bit careful, in the kernel, to avoid
> adding mechanisms until we have the immediate use in hand. If we don't
> do this, then the kernel ends up looking like the Gargoyles on a
> Renaissance church - burdened with overly ornate features serving no
> earthly purpose.

I agree completely. In fact the whole drivers/connector directory
looks pretty suspect. Are there any in-kernel users of it at all?
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-30 10:53:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Wed, 2005-03-30 at 20:25 +1000, Herbert Xu wrote:
> Paul Jackson <[email protected]> wrote:
> >
> > So I suppose if fork_connector were not used to collect <parent pid,
> > child pid> information for accounting, then someone would have to make
> > the case that there were enough other uses, of sufficient value, to add
> > fork_connector. We have to be a bit careful, in the kernel, to avoid
> > adding mechanisms until we have the immediate use in hand. If we don't
> > do this, then the kernel ends up looking like the Gargoyles on a
> > Renaissance church - burdened with overly ornate features serving no
> > earthly purpose.
>
> I agree completely. In fact the whole drivers/connector directory
> looks pretty suspect. Are there any in-kernel users of it at all?

SuperIO subsystem.
In agenda sit w1, acrypto [but it already looks like it will not be
included :) ].

--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-03-30 11:01:42

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Wed, 2005-03-30 at 20:25 +1000, Herbert Xu wrote:
> Paul Jackson <[email protected]> wrote:
> >
> > So I suppose if fork_connector were not used to collect <parent pid,
> > child pid> information for accounting, then someone would have to make
> > the case that there were enough other uses, of sufficient value, to add
> > fork_connector. We have to be a bit careful, in the kernel, to avoid
> > adding mechanisms until we have the immediate use in hand. If we don't
> > do this, then the kernel ends up looking like the Gargoyles on a
> > Renaissance church - burdened with overly ornate features serving no
> > earthly purpose.
>
> I agree completely. In fact the whole drivers/connector directory
> looks pretty suspect. Are there any in-kernel users of it at all?

There is the Enhanced Linux System Accounting project
http://elsa.sourceforge.net

Guillaume

2005-03-30 14:15:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

Sorry for long delay - I was quite far from my test machines.
Here are results:

fork connector with turned off disk writes and direct connector's
methods calls.

pcix$ ./fork_test 100000
Average per process fork+exit time is 505 usecs [diff=50567251, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 512 usecs [diff=51248174, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 504 usecs [diff=50469379, max=100000].

fork connector with turned on disk writes and direct connector's
methods calls.
Each disk write has about 80 bytes which are:
time(&tm);
fprintf(out,
"%.24s : [%x.%x] [seq=%08x, ack=%08x] %s.\n",
ctime(&tm), data->id.idx, data->id.val,
data->seq, data->ack, (char *)data->data);

pcix$ ./fork_test 100000
Average per process fork+exit time is 539 usecs [diff=53944663, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 523 usecs [diff=52378314, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 540 usecs [diff=54078648, max=100000].

CBUS results.

Writing disabled:

pcix$ ./fork_test 100000
Average per process fork+exit time is 451 usecs [diff=45194377, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 454 usecs [diff=45416470, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 448 usecs [diff=44863153, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 453 usecs [diff=45312870, max=100000].
pcix$

Writing enabled like described above:

pcix$ ./fork_test 100000
Average per process fork+exit time is 456 usecs [diff=45680384, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 455 usecs [diff=45590682, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 453 usecs [diff=45376436, max=100000].
pcix$


fork connector is not compiled in:

Average per process fork+exit time is 452 usecs [diff=45280538, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 446 usecs [diff=44687388, max=100000].
pcix$ ./fork_test 100000
Average per process fork+exit time is 445 usecs [diff=44505999, max=100000].
pcix$ ./fork_test 100000


With 80 bytes write per fork with CBUS it takes from 0.5% to 2.5%.
So it still can be used for accounting :)

--
Evgeniy Polyakov

2005-03-30 18:15:02

by Jay Lan

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

The parent information ((ppid,pid) pair) is useful for process group
aggregation while do_exit() hook is needed to save per task
accounting data before the task data is disposed.

Thanks,
- jay


dean gaudet wrote:
> On Tue, 29 Mar 2005, Jay Lan wrote:
>
>
>>The fork_connector is not designed to solve accounting data collection
>>problem.
>>
>>The accounting data collection must be done via a hook from do_exit().
>
>
> by the time do_exit() occurs the parent may have disappeared... you do
> need to record something at fork() time so that you can account to the
> correct ancestor.
>
> an example of where this ancestry is useful would be the summation of all
> cpu time spent by children of apache, spamd, clamd, ...
>
> -dean

2005-03-30 21:00:02

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

> So it still can be used for accounting :)

No ... so these results don't show that it shouldn't be used for
accounting.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-01 03:26:10

by Drew Hess

[permalink] [raw]
Subject: Re: [patch 1/2] fork_connector: add a fork connector

On Tue, 29 Mar 2005 07:23:35 -0800, Paul Jackson <[email protected]> wrote:

> Out of curiosity, what are these 'several user space applications?' The
> only one I know of is this extension to bsd accounting to include
> capturing parent and child pid at fork. Probably you've mentioned some
> other uses of fork_connector before here, but I missed it.


I have a user-space batch job scheduler that could use fork_connector
to track which processes belong to a job. It looks perfect for what I
need.

I would also like to see a do_exit hook, but only as a convenience. I
can probably scrape the BSD accounting files in lieu of a do_exit
hook, but if I had one, I wouldn't need to touch disk for my job
accounting.

d