2002-09-22 05:35:28

by Karim Yaghmour

[permalink] [raw]
Subject: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


D: The core tracing infrastructure serves as the main rallying point for
D: all the tracing activity in the kernel. (Tracing here isn't meant in
D: the ptrace sense, but in the sense of recording key kernel events along
D: with a time-stamp in order to reconstruct the system's behavior post-
D: mortem.) Whether the trace driver (which buffers the data collected
D: and provides it to the user-space trace daemon via a char dev) is loaded
D: or not, the kernel sees a unique tracing function: trace_event().
D: Basically, this provides a trace driver register/unregister service.
D: When a trace driver registers, it is forwarded all the events generated
D: by the kernel. If no trace driver is registered, then the events go
D: nowhere.
D:
D: In addition to these basic services, this patch allows kernel modules
D: to allocate and trace their own custom events. Hence, a driver can
D: create its own set of events and log them part of the kernel trace.
D: Many existing drivers who go a long way in writing their own trace
D: driver and implementing their own tracing mechanism should actually
D: be using this custom event creation interface. And again, whether the
D: trace driver is active or even present makes little difference for
D: the users of the kernel's tracing infrastructure.

Here are the files being added:
include/linux/trace.h | 467 ++++++++++++++++++++++++++++++
kernel/trace.c | 712 ++++++++++++++++++++++++++++++++++++++++++++++

Here are the files being modified:
include/linux/init_task.h | 1
include/linux/sched.h | 3
kernel/Makefile | 3
kernel/exit.c | 4
kernel/fork.c | 7
7 files changed, 1195 insertions, 2 deletions

The complete patch is available in one piece here:
http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.38-vanilla-020922-1.14.bz2

The userspace tools are available here:
http://www.opersys.com/LTT

diff -urpN linux-2.5.38/include/linux/init_task.h linux-2.5.38-ltt/include/linux/init_task.h
--- linux-2.5.38/include/linux/init_task.h Sun Sep 22 00:25:00 2002
+++ linux-2.5.38-ltt/include/linux/init_task.h Sun Sep 22 00:51:51 2002
@@ -98,6 +98,7 @@
.alloc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .trace_info = NULL, \
}


diff -urpN linux-2.5.38/include/linux/sched.h linux-2.5.38-ltt/include/linux/sched.h
--- linux-2.5.38/include/linux/sched.h Sun Sep 22 00:25:00 2002
+++ linux-2.5.38-ltt/include/linux/sched.h Sun Sep 22 00:51:51 2002
@@ -398,6 +398,9 @@ struct task_struct {
/* journalling filesystem info */
void *journal_info;
struct dentry *proc_dentry;
+
+/* Linux Trace Toolkit trace info */
+ void *trace_info;
};

extern void __put_task_struct(struct task_struct *tsk);
diff -urpN linux-2.5.38/include/linux/trace.h linux-2.5.38-ltt/include/linux/trace.h
--- linux-2.5.38/include/linux/trace.h Wed Dec 31 19:00:00 1969
+++ linux-2.5.38-ltt/include/linux/trace.h Sun Sep 22 00:51:51 2002
@@ -0,0 +1,467 @@
+/*
+ * linux/include/linux/trace.h
+ *
+ * Copyright (C) 1999-2002 Karim Yaghmour ([email protected])
+ *
+ * This contains the necessary definitions for tracing the
+ * the system.
+ */
+
+#ifndef _LINUX_TRACE_H
+#define _LINUX_TRACE_H
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+
+/* Is kernel tracing enabled */
+#if defined(CONFIG_TRACE) || defined(CONFIG_TRACE_MODULE)
+
+/* Structure packing within the trace */
+#if LTT_UNPACKED_STRUCTS
+#define LTT_PACKED_STRUCT
+#else /* if LTT_UNPACKED_STRUCTS */
+#define LTT_PACKED_STRUCT __attribute__ ((packed))
+#endif /* if LTT_UNPACKED_STRUCTS */
+
+/* The prototype of the tracer call (EventID, *EventStruct) */
+typedef int (*tracer_call) (u8, void *);
+
+/* This structure contains all the information needed to be known
+ about the tracing module. */
+struct tracer {
+ tracer_call trace; /* The tracing routine itself */
+
+ int fetch_syscall_eip_use_depth; /* Use the given depth */
+ int fetch_syscall_eip_use_bounds; /* Find eip in bounds */
+ int syscall_eip_depth; /* Call depth at which eip is fetched */
+ void *syscall_lower_eip_bound; /* Lower eip bound */
+ void *syscall_upper_eip_bound; /* Higher eip bound */
+};
+
+struct ltt_info
+{
+ /* # event writes currently pending for a process */
+ atomic_t pending_write_count;
+};
+
+typedef struct ltt_info ltt_info_t;
+
+/* Maximal size a custom event can have */
+#define CUSTOM_EVENT_MAX_SIZE 8192
+
+/* String length limits for custom events creation */
+#define CUSTOM_EVENT_TYPE_STR_LEN 20
+#define CUSTOM_EVENT_DESC_STR_LEN 100
+#define CUSTOM_EVENT_FORM_STR_LEN 256
+#define CUSTOM_EVENT_FINAL_STR_LEN 200
+
+/* Type of custom event formats */
+#define CUSTOM_EVENT_FORMAT_TYPE_NONE 0
+#define CUSTOM_EVENT_FORMAT_TYPE_STR 1
+#define CUSTOM_EVENT_FORMAT_TYPE_HEX 2
+#define CUSTOM_EVENT_FORMAT_TYPE_XML 3
+#define CUSTOM_EVENT_FORMAT_TYPE_IBM 4
+
+/* Architecture types */
+#define TRACE_ARCH_TYPE_I386 1 /* i386 system */
+#define TRACE_ARCH_TYPE_PPC 2 /* PPC system */
+#define TRACE_ARCH_TYPE_SH 3 /* SH system */
+#define TRACE_ARCH_TYPE_S390 4 /* S/390 system */
+#define TRACE_ARCH_TYPE_MIPS 5 /* MIPS system */
+#define TRACE_ARCH_TYPE_ARM 6 /* ARM system */
+
+/* Standard definitions for variants */
+#define TRACE_ARCH_VARIANT_NONE 0 /* Main architecture implementation */
+
+/* Global trace flags */
+extern unsigned int syscall_entry_trace_active;
+extern unsigned int syscall_exit_trace_active;
+
+/* The functions to the tracer management code */
+int register_tracer
+ (tracer_call); /* The tracer function */
+int unregister_tracer
+ (tracer_call); /* The tracer function */
+int trace_set_config
+ (tracer_call, /* The tracer function */
+ int, /* Use depth to fetch eip */
+ int, /* Use bounds to fetch eip */
+ int, /* Detph to fetch eip */
+ void *, /* Lower bound eip address */
+ void *); /* Upper bound eip address */
+int trace_register_callback
+ (tracer_call, /* The callback to add */
+ u8); /* The event ID targeted */
+int trace_unregister_callback
+ (tracer_call, /* The callback to remove */
+ u8); /* The event ID targeted */
+int trace_get_config
+ (int *, /* Use depth to fetch eip */
+ int *, /* Use bounds to fetch eip */
+ int *, /* Detph to fetch eip */
+ void **, /* Lower bound eip address */
+ void **); /* Upper bound eip address */
+int trace_create_event
+ (char *, /* String describing event type */
+ char *, /* String to format standard event description */
+ int, /* Type of formatting used to log event data */
+ char *); /* Data specific to format */
+int trace_create_owned_event
+ (char *, /* String describing event type */
+ char *, /* String to format standard event description */
+ int, /* Type of formatting used to log event data */
+ char *, /* Data specific to format */
+ pid_t); /* PID of event's owner */
+void trace_destroy_event
+ (int); /* The event ID given by trace_create_event() */
+void trace_destroy_owners_events
+ (pid_t); /* The PID of the process' who's events are to be deleted */
+void trace_reregister_custom_events
+ (void);
+int trace_std_formatted_event
+ (int, /* The event ID given by trace_create_event() */
+ ...); /* The parameters to be printed out in the event string */
+int trace_raw_event
+ (int, /* The event ID given by trace_create_event() */
+ int, /* The size of the raw data */
+ void *); /* Pointer to the raw event data */
+int trace_event
+ (u8, /* Event ID (as defined in this header file) */
+ void *); /* Structure describing the event */
+int trace_get_pending_write_count
+ (void);
+int alloc_trace_info
+ (struct task_struct * /* Process descriptor */ );
+void free_trace_info
+ (struct task_struct * /* Process descriptor */ );
+
+/* Generic function */
+static inline void TRACE_EVENT(u8 event_id, void* data)
+{
+ trace_event(event_id, data);
+}
+
+/* Traced events */
+#define TRACE_EV_START 0 /* This is to mark the trace's start */
+#define TRACE_EV_SYSCALL_ENTRY 1 /* Entry in a given system call */
+#define TRACE_EV_SYSCALL_EXIT 2 /* Exit from a given system call */
+#define TRACE_EV_TRAP_ENTRY 3 /* Entry in a trap */
+#define TRACE_EV_TRAP_EXIT 4 /* Exit from a trap */
+#define TRACE_EV_IRQ_ENTRY 5 /* Entry in an irq */
+#define TRACE_EV_IRQ_EXIT 6 /* Exit from an irq */
+#define TRACE_EV_SCHEDCHANGE 7 /* Scheduling change */
+#define TRACE_EV_KERNEL_TIMER 8 /* The kernel timer routine has been called */
+#define TRACE_EV_SOFT_IRQ 9 /* Hit key part of soft-irq management */
+#define TRACE_EV_PROCESS 10 /* Hit key part of process management */
+#define TRACE_EV_FILE_SYSTEM 11 /* Hit key part of file system */
+#define TRACE_EV_TIMER 12 /* Hit key part of timer management */
+#define TRACE_EV_MEMORY 13 /* Hit key part of memory management */
+#define TRACE_EV_SOCKET 14 /* Hit key part of socket communication */
+#define TRACE_EV_IPC 15 /* Hit key part of System V IPC */
+#define TRACE_EV_NETWORK 16 /* Hit key part of network communication */
+
+#define TRACE_EV_BUFFER_START 17 /* Mark the begining of a trace buffer */
+#define TRACE_EV_BUFFER_END 18 /* Mark the ending of a trace buffer */
+#define TRACE_EV_NEW_EVENT 19 /* New event type */
+#define TRACE_EV_CUSTOM 20 /* Custom event */
+
+#define TRACE_EV_CHANGE_MASK 21 /* Change in event mask */
+
+/* Number of traced events */
+#define TRACE_EV_MAX TRACE_EV_CHANGE_MASK
+
+/* Structures and macros for events */
+/* TRACE_SYSCALL_ENTRY */
+typedef struct _trace_syscall_entry {
+ u8 syscall_id; /* Syscall entry number in entry.S */
+ u32 address; /* Address from which call was made */
+} LTT_PACKED_STRUCT trace_syscall_entry;
+
+/* TRACE_TRAP_ENTRY */
+#ifndef __s390__
+typedef struct _trace_trap_entry {
+ u16 trap_id; /* Trap number */
+ u32 address; /* Address where trap occured */
+} LTT_PACKED_STRUCT trace_trap_entry;
+static inline void TRACE_TRAP_ENTRY(u16 trap_id, u32 address)
+#else
+typedef u64 trapid_t;
+typedef struct _trace_trap_entry {
+ trapid_t trap_id; /* Trap number */
+ u32 address; /* Address where trap occured */
+} LTT_PACKED_STRUCT trace_trap_entry;
+static inline void TRACE_TRAP_ENTRY(trapid_t trap_id, u32 address)
+#endif
+{
+ trace_trap_entry trap_event;
+
+ trap_event.trap_id = trap_id;
+ trap_event.address = address;
+
+ trace_event(TRACE_EV_TRAP_ENTRY, &trap_event);
+}
+
+/* TRACE_TRAP_EXIT */
+static inline void TRACE_TRAP_EXIT(void)
+{
+ trace_event(TRACE_EV_TRAP_EXIT, NULL);
+}
+
+/* TRACE_IRQ_ENTRY */
+typedef struct _trace_irq_entry {
+ u8 irq_id; /* IRQ number */
+ u8 kernel; /* Are we executing kernel code */
+} LTT_PACKED_STRUCT trace_irq_entry;
+static inline void TRACE_IRQ_ENTRY(u8 irq_id, u8 in_kernel)
+{
+ trace_irq_entry irq_entry;
+
+ irq_entry.irq_id = irq_id;
+ irq_entry.kernel = in_kernel;
+
+ trace_event(TRACE_EV_IRQ_ENTRY, &irq_entry);
+}
+
+/* TRACE_IRQ_EXIT */
+static inline void TRACE_IRQ_EXIT(void)
+{
+ trace_event(TRACE_EV_IRQ_EXIT, NULL);
+}
+
+/* TRACE_SCHEDCHANGE */
+typedef struct _trace_schedchange {
+ u32 out; /* Outgoing process */
+ u32 in; /* Incoming process */
+ u32 out_state; /* Outgoing process' state */
+} LTT_PACKED_STRUCT trace_schedchange;
+static inline void TRACE_SCHEDCHANGE(task_t * task_out, task_t * task_in)
+{
+ trace_schedchange sched_event;
+
+ sched_event.out = (u32) task_out->pid;
+ sched_event.in = (u32) task_in;
+ sched_event.out_state = (u32) task_out->state;
+
+ trace_event(TRACE_EV_SCHEDCHANGE, &sched_event);
+}
+
+/* TRACE_SOFT_IRQ */
+#define TRACE_EV_SOFT_IRQ_BOTTOM_HALF 1 /* Conventional bottom-half */
+#define TRACE_EV_SOFT_IRQ_SOFT_IRQ 2 /* Real soft-irq */
+#define TRACE_EV_SOFT_IRQ_TASKLET_ACTION 3 /* Tasklet action */
+#define TRACE_EV_SOFT_IRQ_TASKLET_HI_ACTION 4 /* Tasklet hi-action */
+typedef struct _trace_soft_irq {
+ u8 event_sub_id; /* Soft-irq event Id */
+ u32 event_data; /* Data associated with event */
+} LTT_PACKED_STRUCT trace_soft_irq;
+static inline void TRACE_SOFT_IRQ(u8 ev_id, u32 data)
+{
+ trace_soft_irq soft_irq_event;
+
+ soft_irq_event.event_sub_id = ev_id;
+ soft_irq_event.event_data = data;
+
+ trace_event(TRACE_EV_SOFT_IRQ, &soft_irq_event);
+}
+
+/* TRACE_PROCESS */
+#define TRACE_EV_PROCESS_KTHREAD 1 /* Creation of a kernel thread */
+#define TRACE_EV_PROCESS_FORK 2 /* A fork or clone occured */
+#define TRACE_EV_PROCESS_EXIT 3 /* An exit occured */
+#define TRACE_EV_PROCESS_WAIT 4 /* A wait occured */
+#define TRACE_EV_PROCESS_SIGNAL 5 /* A signal has been sent */
+#define TRACE_EV_PROCESS_WAKEUP 6 /* Wake up a process */
+typedef struct _trace_process {
+ u8 event_sub_id; /* Process event ID */
+ u32 event_data1; /* Data associated with event */
+ u32 event_data2;
+} LTT_PACKED_STRUCT trace_process;
+static inline void TRACE_PROCESS(u8 ev_id, u32 data1, u32 data2)
+{
+ trace_process proc_event;
+
+ proc_event.event_sub_id = ev_id;
+ proc_event.event_data1 = data1;
+ proc_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_PROCESS, &proc_event);
+}
+
+/* TRACE_FILE_SYSTEM */
+#define TRACE_EV_FILE_SYSTEM_BUF_WAIT_START 1 /* Starting to wait for a data buffer */
+#define TRACE_EV_FILE_SYSTEM_BUF_WAIT_END 2 /* End to wait for a data buffer */
+#define TRACE_EV_FILE_SYSTEM_EXEC 3 /* An exec occured */
+#define TRACE_EV_FILE_SYSTEM_OPEN 4 /* An open occured */
+#define TRACE_EV_FILE_SYSTEM_CLOSE 5 /* A close occured */
+#define TRACE_EV_FILE_SYSTEM_READ 6 /* A read occured */
+#define TRACE_EV_FILE_SYSTEM_WRITE 7 /* A write occured */
+#define TRACE_EV_FILE_SYSTEM_SEEK 8 /* A seek occured */
+#define TRACE_EV_FILE_SYSTEM_IOCTL 9 /* An ioctl occured */
+#define TRACE_EV_FILE_SYSTEM_SELECT 10 /* A select occured */
+#define TRACE_EV_FILE_SYSTEM_POLL 11 /* A poll occured */
+typedef struct _trace_file_system {
+ u8 event_sub_id; /* File system event ID */
+ u32 event_data1; /* Event data */
+ u32 event_data2; /* Event data 2 */
+ char *file_name; /* Name of file operated on */
+} LTT_PACKED_STRUCT trace_file_system;
+static inline void TRACE_FILE_SYSTEM(u8 ev_id, u32 data1, u32 data2, const unsigned char *file_name)
+{
+ trace_file_system fs_event;
+
+ fs_event.event_sub_id = ev_id;
+ fs_event.event_data1 = data1;
+ fs_event.event_data2 = data2;
+ fs_event.file_name = (char*) file_name;
+
+ trace_event(TRACE_EV_FILE_SYSTEM, &fs_event);
+}
+
+/* TRACE_TIMER */
+#define TRACE_EV_TIMER_EXPIRED 1 /* Timer expired */
+#define TRACE_EV_TIMER_SETITIMER 2 /* Setting itimer occurred */
+#define TRACE_EV_TIMER_SETTIMEOUT 3 /* Setting sched timeout occurred */
+typedef struct _trace_timer {
+ u8 event_sub_id; /* Timer event ID */
+ u8 event_sdata; /* Short data */
+ u32 event_data1; /* Data associated with event */
+ u32 event_data2;
+} LTT_PACKED_STRUCT trace_timer;
+static inline void TRACE_TIMER(u8 ev_id, u8 sdata, u32 data1, u32 data2)
+{
+ trace_timer timer_event;
+
+ timer_event.event_sub_id = ev_id;
+ timer_event.event_sdata = sdata;
+ timer_event.event_data1 = data1;
+ timer_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_TIMER, &timer_event);
+}
+
+/* TRACE_MEMORY */
+#define TRACE_EV_MEMORY_PAGE_ALLOC 1 /* Allocating pages */
+#define TRACE_EV_MEMORY_PAGE_FREE 2 /* Freing pages */
+#define TRACE_EV_MEMORY_SWAP_IN 3 /* Swaping pages in */
+#define TRACE_EV_MEMORY_SWAP_OUT 4 /* Swaping pages out */
+#define TRACE_EV_MEMORY_PAGE_WAIT_START 5 /* Start to wait for page */
+#define TRACE_EV_MEMORY_PAGE_WAIT_END 6 /* End to wait for page */
+typedef struct _trace_memory {
+ u8 event_sub_id; /* Memory event ID */
+ u32 event_data; /* Data associated with event */
+} LTT_PACKED_STRUCT trace_memory;
+static inline void TRACE_MEMORY(u8 ev_id, u32 data)
+{
+ trace_memory memory_event;
+
+ memory_event.event_sub_id = ev_id;
+ memory_event.event_data = data;
+
+ trace_event(TRACE_EV_MEMORY, &memory_event);
+}
+
+/* TRACE_SOCKET */
+#define TRACE_EV_SOCKET_CALL 1 /* A socket call occured */
+#define TRACE_EV_SOCKET_CREATE 2 /* A socket has been created */
+#define TRACE_EV_SOCKET_SEND 3 /* Data was sent to a socket */
+#define TRACE_EV_SOCKET_RECEIVE 4 /* Data was read from a socket */
+typedef struct _trace_socket {
+ u8 event_sub_id; /* Socket event ID */
+ u32 event_data1; /* Data associated with event */
+ u32 event_data2; /* Data associated with event */
+} LTT_PACKED_STRUCT trace_socket;
+static inline void TRACE_SOCKET(u8 ev_id, u32 data1, u32 data2)
+{
+ trace_socket socket_event;
+
+ socket_event.event_sub_id = ev_id;
+ socket_event.event_data1 = data1;
+ socket_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_SOCKET, &socket_event);
+}
+
+/* TRACE_IPC */
+#define TRACE_EV_IPC_CALL 1 /* A System V IPC call occured */
+#define TRACE_EV_IPC_MSG_CREATE 2 /* A message queue has been created */
+#define TRACE_EV_IPC_SEM_CREATE 3 /* A semaphore was created */
+#define TRACE_EV_IPC_SHM_CREATE 4 /* A shared memory segment has been created */
+typedef struct _trace_ipc {
+ u8 event_sub_id; /* IPC event ID */
+ u32 event_data1; /* Data associated with event */
+ u32 event_data2; /* Data associated with event */
+} LTT_PACKED_STRUCT trace_ipc;
+static inline void TRACE_IPC(u8 ev_id, u32 data1, u32 data2)
+{
+ trace_ipc ipc_event;
+
+ ipc_event.event_sub_id = ev_id;
+ ipc_event.event_data1 = data1;
+ ipc_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_IPC, &ipc_event);
+}
+
+/* TRACE_NETWORK */
+#define TRACE_EV_NETWORK_PACKET_IN 1 /* A packet came in */
+#define TRACE_EV_NETWORK_PACKET_OUT 2 /* A packet was sent */
+typedef struct _trace_network {
+ u8 event_sub_id; /* Network event ID */
+ u32 event_data; /* Event data */
+} LTT_PACKED_STRUCT trace_network;
+static inline void TRACE_NETWORK(u8 ev_id, u32 data)
+{
+ trace_network net_event;
+
+ net_event.event_sub_id = ev_id;
+ net_event.event_data = data;
+
+ trace_event(TRACE_EV_NETWORK, &net_event);
+}
+
+/* Custom declared events */
+/* ***WARNING*** These structures should never be used as is, use the provided custom
+ event creation and logging functions. */
+typedef struct _trace_new_event {
+ /* Basics */
+ u32 id; /* Custom event ID */
+ char type[CUSTOM_EVENT_TYPE_STR_LEN]; /* Event type description */
+ char desc[CUSTOM_EVENT_DESC_STR_LEN]; /* Detailed event description */
+
+ /* Custom formatting */
+ u32 format_type; /* Type of formatting */
+ char form[CUSTOM_EVENT_FORM_STR_LEN]; /* Data specific to format */
+} LTT_PACKED_STRUCT trace_new_event;
+typedef struct _trace_custom {
+ u32 id; /* Event ID */
+ u32 data_size; /* Size of data recorded by event */
+ void *data; /* Data recorded by event */
+} LTT_PACKED_STRUCT trace_custom;
+
+/* TRACE_CHANGE_MASK */
+typedef u64 trace_event_mask; /* The event mask type */
+typedef struct _trace_change_mask {
+ trace_event_mask mask; /* Event mask */
+} LTT_PACKED_STRUCT trace_change_mask;
+
+#else /* Kernel is configured without tracing */
+#define TRACE_EVENT(ID, DATA)
+#define TRACE_TRAP_ENTRY(ID, EIP)
+#define TRACE_TRAP_EXIT()
+#define TRACE_IRQ_ENTRY(ID, KERNEL)
+#define TRACE_IRQ_EXIT()
+#define TRACE_SCHEDCHANGE(OUT, IN)
+#define TRACE_SOFT_IRQ(ID, DATA)
+#define TRACE_PROCESS(ID, DATA1, DATA2)
+#define TRACE_FILE_SYSTEM(ID, DATA1, DATA2, FILE_NAME)
+#define TRACE_TIMER(ID, SDATA, DATA1, DATA2)
+#define TRACE_MEMORY(ID, DATA)
+#define TRACE_SOCKET(ID, DATA1, DATA2)
+#define TRACE_IPC(ID, DATA1, DATA2)
+#define TRACE_NETWORK(ID, DATA)
+#define alloc_trace_info(tsk) (0) /* if(0) */
+#define free_trace_info(tsk)
+#endif /* defined(CONFIG_TRACE) || defined(CONFIG_TRACE_MODULE) */
+
+#endif /* _LINUX_TRACE_H */
diff -urpN linux-2.5.38/kernel/Makefile linux-2.5.38-ltt/kernel/Makefile
--- linux-2.5.38/kernel/Makefile Sun Sep 22 00:25:03 2002
+++ linux-2.5.38-ltt/kernel/Makefile Sun Sep 22 00:51:51 2002
@@ -3,7 +3,7 @@
#

export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
- printk.o platform.o suspend.o dma.o
+ printk.o platform.o suspend.o dma.o trace.o

obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
@@ -17,6 +17,7 @@ obj-$(CONFIG_MODULES) += ksyms.o
obj-$(CONFIG_PM) += pm.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(subst m,y,$(CONFIG_TRACE)) += trace.o

ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -urpN linux-2.5.38/kernel/exit.c linux-2.5.38-ltt/kernel/exit.c
--- linux-2.5.38/kernel/exit.c Sun Sep 22 00:25:05 2002
+++ linux-2.5.38-ltt/kernel/exit.c Sun Sep 22 00:51:51 2002
@@ -20,6 +20,8 @@
#include <linux/binfmts.h>
#include <linux/ptrace.h>

+#include <linux/trace.h>
+
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/mmu_context.h>
@@ -620,6 +622,8 @@ NORET_TYPE void do_exit(long code)
fake_volatile:
acct_process(code);
__exit_mm(tsk);
+
+ free_trace_info(tsk);

sem_exit();
__exit_files(tsk);
diff -urpN linux-2.5.38/kernel/fork.c linux-2.5.38-ltt/kernel/fork.c
--- linux-2.5.38/kernel/fork.c Sun Sep 22 00:25:00 2002
+++ linux-2.5.38-ltt/kernel/fork.c Sun Sep 22 00:51:51 2002
@@ -28,6 +28,7 @@
#include <linux/security.h>
#include <linux/futex.h>
#include <linux/ptrace.h>
+#include <linux/trace.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -718,9 +719,11 @@ static struct task_struct *copy_process(
retval = -ENOMEM;
if (security_ops->task_alloc_security(p))
goto bad_fork_cleanup;
+ if (alloc_trace_info(p))
+ goto bad_fork_cleanup_security;
/* copy all the process information */
if (copy_semundo(clone_flags, p))
- goto bad_fork_cleanup_security;
+ goto bad_fork_cleanup_trace;
if (copy_files(clone_flags, p))
goto bad_fork_cleanup_semundo;
if (copy_fs(clone_flags, p))
@@ -865,6 +868,8 @@ bad_fork_cleanup_files:
exit_files(p); /* blocking */
bad_fork_cleanup_semundo:
exit_semundo(p);
+bad_fork_cleanup_trace:
+ free_trace_info(p);
bad_fork_cleanup_security:
security_ops->task_free_security(p);
bad_fork_cleanup:
diff -urpN linux-2.5.38/kernel/trace.c linux-2.5.38-ltt/kernel/trace.c
--- linux-2.5.38/kernel/trace.c Wed Dec 31 19:00:00 1969
+++ linux-2.5.38-ltt/kernel/trace.c Sun Sep 22 00:51:51 2002
@@ -0,0 +1,712 @@
+/*
+ * linux/kernel/trace.c
+ *
+ * (C) Copyright 1999, 2000, 2001, 2002 - Karim Yaghmour ([email protected])
+ *
+ * This code is distributed under the GPL license
+ *
+ * Tracing management
+ *
+ */
+
+#include <linux/init.h> /* For __init */
+#include <linux/trace.h> /* Tracing definitions */
+#include <linux/errno.h> /* Miscellaneous error codes */
+#include <linux/stddef.h> /* NULL */
+#include <linux/slab.h> /* kmalloc() */
+#include <linux/module.h> /* EXPORT_SYMBOL */
+#include <linux/sched.h> /* pid_t */
+
+/* Global variables */
+unsigned int syscall_entry_trace_active = 0;
+unsigned int syscall_exit_trace_active = 0;
+
+/* Local variables */
+static int tracer_registered = 0; /* Is there a tracer registered */
+struct tracer *tracer = NULL; /* The registered tracer */
+
+/* Registration lock. This lock avoids a race condition in case a tracer is
+removed while an event is being traced. */
+rwlock_t tracer_register_lock = RW_LOCK_UNLOCKED;
+
+/* Trace callback table entry */
+struct trace_callback_table_entry {
+ tracer_call callback;
+
+ struct trace_callback_table_entry *next;
+};
+
+/* Trace callback table */
+struct trace_callback_table_entry trace_callback_table[TRACE_EV_MAX];
+
+/* Custom event description */
+struct custom_event_desc {
+ trace_new_event event;
+
+ pid_t owner_pid;
+
+ struct custom_event_desc *next;
+ struct custom_event_desc *prev;
+};
+
+/* Next event ID to be used */
+int next_event_id;
+
+/* Circular list of custom events */
+struct custom_event_desc custom_events_head;
+struct custom_event_desc *custom_events;
+
+/* Circular list lock. This is classic lock that provides for atomic access
+to the circular list. */
+rwlock_t custom_list_lock = RW_LOCK_UNLOCKED;
+
+/**
+ * register_tracer: - Register the tracer to the kernel
+ * @pm_trace_function: tracing function being registered
+ *
+ * Returns:
+ * 0, all is OK
+ * -EBUSY, there already is a registered tracer
+ * -ENOMEM, couldn't allocate memory
+ */
+int register_tracer(tracer_call pm_trace_function)
+{
+ unsigned long l_flags;
+
+ if (tracer_registered == 1)
+ return -EBUSY;
+
+ /* Allocate memory for the tracer */
+ if ((tracer = (struct tracer *) kmalloc(sizeof(struct tracer), GFP_ATOMIC)) == NULL)
+ return -ENOMEM;
+
+ /* Has the init task been allocated trace info too? */
+ if(init_task.trace_info == NULL)
+ if(alloc_trace_info(&init_task))
+ return -ENOMEM;
+
+
+ /* Safely register the new tracer */
+ write_lock_irqsave(&tracer_register_lock, l_flags);
+ tracer_registered = 1;
+ tracer->trace = pm_trace_function;
+ write_unlock_irqrestore(&tracer_register_lock, l_flags);
+
+ /* Initialize the tracer settings */
+ tracer->fetch_syscall_eip_use_bounds = 0;
+ tracer->fetch_syscall_eip_use_depth = 0;
+
+ return 0;
+}
+
+/**
+ * unregister_tracer: - Unregister the currently registered tracer
+ * @pm_trace_function: the tracer being unregistered
+ *
+ * Returns:
+ * 0, all is OK
+ * -ENOMEDIUM, there isn't a registered tracer
+ * -ENXIO, unregestering wrong tracer
+ */
+int unregister_tracer(tracer_call pm_trace_function)
+{
+ unsigned long l_flags;
+
+ if (tracer_registered == 0)
+ return -ENOMEDIUM;
+
+ if(init_task.trace_info != NULL)
+ free_trace_info(&init_task);
+
+ write_lock_irqsave(&tracer_register_lock, l_flags);
+
+ /* Is it the tracer that was registered */
+ if (tracer->trace == pm_trace_function)
+ /* There isn't any tracer in here */
+ tracer_registered = 0;
+ else {
+ write_unlock_irqrestore(&tracer_register_lock, l_flags);
+ return -ENXIO;
+ }
+
+ /* Free the memory used by the tracing structure */
+ kfree(tracer);
+ tracer = NULL;
+
+ write_unlock_irqrestore(&tracer_register_lock, l_flags);
+
+ return 0;
+}
+
+/**
+ * trace_set_config: - Set the tracing configuration
+ * @pm_trace_function: the trace function.
+ * @pm_fetch_syscall_use_depth: Use depth to fetch eip
+ * @pm_fetch_syscall_use_bounds: Use bounds to fetch eip
+ * @pm_syscall_eip_depth: Detph to fetch eip
+ * @pm_syscall_lower_bound: Lower bound eip address
+ * @pm_syscall_upper_bound: Upper bound eip address
+ *
+ * Returns:
+ * 0, all is OK
+ * -ENOMEDIUM, there isn't a registered tracer
+ * -ENXIO, wrong tracer
+ * -EINVAL, invalid configuration
+ */
+int trace_set_config(tracer_call pm_trace_function,
+ int pm_fetch_syscall_use_depth,
+ int pm_fetch_syscall_use_bounds,
+ int pm_syscall_eip_depth,
+ void *pm_syscall_lower_bound,
+ void *pm_syscall_upper_bound)
+{
+ if (tracer_registered == 0)
+ return -ENOMEDIUM;
+
+ /* Is this the tracer that is already registered */
+ if (tracer->trace != pm_trace_function)
+ return -ENXIO;
+
+ /* Is this a valid configuration */
+ if ((pm_fetch_syscall_use_depth && pm_fetch_syscall_use_bounds)
+ || (pm_syscall_lower_bound > pm_syscall_upper_bound)
+ || (pm_syscall_eip_depth < 0))
+ return -EINVAL;
+
+ /* Set the configuration */
+ tracer->fetch_syscall_eip_use_depth = pm_fetch_syscall_use_depth;
+ tracer->fetch_syscall_eip_use_bounds = pm_fetch_syscall_use_bounds;
+ tracer->syscall_eip_depth = pm_syscall_eip_depth;
+ tracer->syscall_lower_eip_bound = pm_syscall_lower_bound;
+ tracer->syscall_upper_eip_bound = pm_syscall_upper_bound;
+
+ return 0;
+}
+
+/**
+ * trace_get_config: - Get the tracing configuration
+ * @pm_fetch_syscall_use_depth: Use depth to fetch eip
+ * @pm_fetch_syscall_use_bounds: Use bounds to fetch eip
+ * @pm_syscall_eip_depth: Detph to fetch eip
+ * @pm_syscall_lower_bound: Lower bound eip address
+ * @pm_syscall_upper_bound: Upper bound eip address
+ *
+ * Returns:
+ * 0, all is OK
+ * -ENOMEDIUM, there isn't a registered tracer
+ */
+int trace_get_config(int *pm_fetch_syscall_use_depth,
+ int *pm_fetch_syscall_use_bounds,
+ int *pm_syscall_eip_depth,
+ void **pm_syscall_lower_bound,
+ void **pm_syscall_upper_bound)
+{
+ if (tracer_registered == 0)
+ return -ENOMEDIUM;
+
+ /* Get the configuration */
+ *pm_fetch_syscall_use_depth = tracer->fetch_syscall_eip_use_depth;
+ *pm_fetch_syscall_use_bounds = tracer->fetch_syscall_eip_use_bounds;
+ *pm_syscall_eip_depth = tracer->syscall_eip_depth;
+ *pm_syscall_lower_bound = tracer->syscall_lower_eip_bound;
+ *pm_syscall_upper_bound = tracer->syscall_upper_eip_bound;
+
+ return 0;
+}
+
+/**
+ * trace_register_callback: - Register a callback function.
+ * @pm_trace_function: the callback function.
+ * @pm_event_id: the event ID to be monitored.
+ *
+ * The callback is invoked on the occurence of the pm_event_id.
+ * Returns:
+ * 0, all is OK
+ * -ENOMEM, unable to allocate memory for callback
+ */
+int trace_register_callback(tracer_call pm_trace_function,
+ u8 pm_event_id)
+{
+ struct trace_callback_table_entry *p_tct_entry;
+
+ /* Search for an empty entry in the callback table */
+ for (p_tct_entry = &(trace_callback_table[pm_event_id - 1]);
+ p_tct_entry->next != NULL;
+ p_tct_entry = p_tct_entry->next);
+
+ /* Allocate a new callback */
+ if ((p_tct_entry->next = kmalloc(sizeof(struct trace_callback_table_entry), GFP_ATOMIC)) == NULL)
+ return -ENOMEM;
+
+ /* Setup the new callback */
+ p_tct_entry->next->callback = pm_trace_function;
+ p_tct_entry->next->next = NULL;
+
+ return 0;
+}
+
+/**
+ * trace_unregister_callback: - UnRegister a callback function.
+ * @pm_trace_function: the callback function.
+ * @pm_event_id: the event ID that had to be monitored.
+ *
+ * Returns:
+ * 0, all is OK
+ * -ENOMEDIUM, no such callback resigtered
+ */
+int trace_unregister_callback(tracer_call pm_trace_function,
+ u8 pm_event_id)
+{
+ struct trace_callback_table_entry *p_tct_entry;
+ struct trace_callback_table_entry *p_temp_entry;
+
+ /* Search for the callback in the callback table */
+ for (p_tct_entry = &(trace_callback_table[pm_event_id - 1]);
+ ((p_tct_entry->next != NULL) && (p_tct_entry->next->callback != pm_trace_function));
+ p_tct_entry = p_tct_entry->next);
+
+ /* Did we find anything */
+ if (p_tct_entry == NULL)
+ return -ENOMEDIUM;
+
+ /* Free the callback entry we found */
+ p_temp_entry = p_tct_entry->next->next;
+ kfree(p_tct_entry->next);
+ p_tct_entry->next = p_temp_entry;
+
+ return 0;
+}
+
+/**
+ * _trace_create_event: - Create a new traceable event type
+ * @pm_event_type: string describing event type
+ * @pm_event_desc: string used for standard formatting
+ * @pm_format_type: type of formatting used to log event data
+ * @pm_format_data: data specific to format
+ * @pm_owner_pid: PID of event's owner (0 if none)
+ *
+ * Returns:
+ * New Event ID if all is OK
+ * -ENOMEM, Unable to allocate new event
+ */
+int _trace_create_event(char *pm_event_type,
+ char *pm_event_desc,
+ int pm_format_type,
+ char *pm_format_data,
+ pid_t pm_owner_pid)
+{
+ trace_new_event *p_event;
+ struct custom_event_desc *p_new_event;
+
+ /* Create event */
+ if ((p_new_event = (struct custom_event_desc *) kmalloc(sizeof(struct custom_event_desc), GFP_ATOMIC)) == NULL)
+ return -ENOMEM;
+ p_event = &(p_new_event->event);
+
+ /* Initialize event properties */
+ p_event->type[0] = '\0';
+ p_event->desc[0] = '\0';
+ p_event->form[0] = '\0';
+
+ /* Set basic event properties */
+ if (pm_event_type != NULL)
+ strncpy(p_event->type, pm_event_type, CUSTOM_EVENT_TYPE_STR_LEN);
+ if (pm_event_desc != NULL)
+ strncpy(p_event->desc, pm_event_desc, CUSTOM_EVENT_DESC_STR_LEN);
+ if (pm_format_data != NULL)
+ strncpy(p_event->form, pm_format_data, CUSTOM_EVENT_FORM_STR_LEN);
+
+ /* Ensure that strings are bound */
+ p_event->type[CUSTOM_EVENT_TYPE_STR_LEN - 1] = '\0';
+ p_event->desc[CUSTOM_EVENT_DESC_STR_LEN - 1] = '\0';
+ p_event->form[CUSTOM_EVENT_FORM_STR_LEN - 1] = '\0';
+
+ /* Set format type */
+ p_event->format_type = pm_format_type;
+
+ /* Give the new event a unique event ID */
+ p_event->id = next_event_id;
+ next_event_id++;
+
+ /* Set event's owner */
+ p_new_event->owner_pid = pm_owner_pid;
+
+ /* Insert new event in event list */
+ write_lock(&custom_list_lock);
+ p_new_event->next = custom_events;
+ p_new_event->prev = custom_events->prev;
+ custom_events->prev->next = p_new_event;
+ custom_events->prev = p_new_event;
+ write_unlock(&custom_list_lock);
+
+ /* Log the event creation event */
+ trace_event(TRACE_EV_NEW_EVENT, &(p_new_event->event));
+
+ return p_event->id;
+}
+int trace_create_event(char *pm_event_type,
+ char *pm_event_desc,
+ int pm_format_type,
+ char *pm_format_data)
+{
+ return _trace_create_event(pm_event_type, pm_event_desc, pm_format_type, pm_format_data, 0);
+}
+int trace_create_owned_event(char *pm_event_type,
+ char *pm_event_desc,
+ int pm_format_type,
+ char *pm_format_data,
+ pid_t pm_owner_pid)
+{
+ return _trace_create_event(pm_event_type, pm_event_desc, pm_format_type, pm_format_data, pm_owner_pid);
+}
+
+/**
+ * trace_destroy_event: - Destroy a created event type
+ * @pm_event_id, the Id returned by trace_create_event()
+ *
+ * No return values.
+ */
+void trace_destroy_event(int pm_event_id)
+{
+ struct custom_event_desc *p_event_desc;
+
+ write_lock(&custom_list_lock);
+
+ /* Find the event to destroy in the event description list */
+ for (p_event_desc = custom_events->next;
+ p_event_desc != custom_events;
+ p_event_desc = p_event_desc->next)
+ if (p_event_desc->event.id == pm_event_id)
+ break;
+
+ /* If we found something */
+ if (p_event_desc != custom_events) {
+ /* Remove the event fromt the list */
+ p_event_desc->next->prev = p_event_desc->prev;
+ p_event_desc->prev->next = p_event_desc->next;
+
+ /* Free the memory used by this event */
+ kfree(p_event_desc);
+ }
+ write_unlock(&custom_list_lock);
+}
+
+/**
+ * trace_destroy_owners_events: Destroy an owner's events
+ * @pm_owner_pid: the PID of the owner who's events are to be deleted.
+ *
+ * No return values.
+ */
+void trace_destroy_owners_events(pid_t pm_owner_pid)
+{
+ struct custom_event_desc *p_temp_event;
+ struct custom_event_desc *p_event_desc;
+
+ write_lock(&custom_list_lock);
+
+ /* Start at the first event in the list */
+ p_event_desc = custom_events->next;
+
+ /* Find all events belonging to the PID */
+ while (p_event_desc != custom_events) {
+ p_temp_event = p_event_desc->next;
+
+ /* Does this event belong to the same owner */
+ if (p_event_desc->owner_pid == pm_owner_pid) {
+ /* Remove the event from the list */
+ p_event_desc->next->prev = p_event_desc->prev;
+ p_event_desc->prev->next = p_event_desc->next;
+
+ /* Free the memory used by this event */
+ kfree(p_event_desc);
+ }
+ p_event_desc = p_temp_event;
+ }
+
+ write_unlock(&custom_list_lock);
+}
+
+/**
+ * trace_reregister_custom_events: - Relogs event creations.
+ *
+ * Relog the declarations of custom events. This is necessary to make
+ * sure that even though the event creation might not have taken place
+ * during a previous trace, that all custom events be part of all traces.
+ * Hence, if a custom event occurs during a new trace, we can be sure
+ * that its definition will also be part of the trace.
+ *
+ * No return values.
+ */
+void trace_reregister_custom_events(void)
+{
+ struct custom_event_desc *p_event_desc;
+
+ read_lock(&custom_list_lock);
+
+ /* Log an event creation for every description in the list */
+ for (p_event_desc = custom_events->next;
+ p_event_desc != custom_events;
+ p_event_desc = p_event_desc->next)
+ trace_event(TRACE_EV_NEW_EVENT, &(p_event_desc->event));
+
+ read_unlock(&custom_list_lock);
+}
+
+/**
+ * trace_std_formatted_event: - Trace a formatted event
+ * @pm_event_id: the event Id provided upon creation
+ * @...: printf-like data that will be used to fill the event string.
+ *
+ * Returns:
+ * Trace fct return code if OK.
+ * -ENOMEDIUM, there is no registered tracer or event doesn't exist.
+ */
+int trace_std_formatted_event(int pm_event_id,...)
+{
+ int l_string_size; /* Size of the string outputed by vsprintf() */
+ char l_string[CUSTOM_EVENT_FINAL_STR_LEN]; /* Final formatted string */
+ va_list l_var_arg_list; /* Variable argument list */
+ trace_custom l_custom;
+ struct custom_event_desc *p_event_desc;
+
+ read_lock(&custom_list_lock);
+
+ /* Find the event description matching this event */
+ for (p_event_desc = custom_events->next;
+ p_event_desc != custom_events;
+ p_event_desc = p_event_desc->next)
+ if (p_event_desc->event.id == pm_event_id)
+ break;
+
+ /* If we haven't found anything */
+ if (p_event_desc == custom_events) {
+ read_unlock(&custom_list_lock);
+
+ return -ENOMEDIUM;
+ }
+ /* Set custom event Id */
+ l_custom.id = pm_event_id;
+
+ /* Initialize variable argument list access */
+ va_start(l_var_arg_list, pm_event_id);
+
+ /* Print the description out to the temporary buffer */
+ l_string_size = vsprintf(l_string, p_event_desc->event.desc, l_var_arg_list);
+
+ read_unlock(&custom_list_lock);
+
+ /* Facilitate return to caller */
+ va_end(l_var_arg_list);
+
+ /* Set the size of the event */
+ l_custom.data_size = (u32) (l_string_size + 1);
+
+ /* Set the pointer to the event data */
+ l_custom.data = l_string;
+
+ /* Log the custom event */
+ return trace_event(TRACE_EV_CUSTOM, &l_custom);
+}
+
+/**
+ * trace_raw_event: - Trace a raw event
+ * @pm_event_id, the event Id provided upon creation
+ * @pm_event_size, the size of the data provided
+ * @pm_event_data, data buffer describing event
+ *
+ * Returns:
+ * Trace fct return code if OK.
+ * -ENOMEDIUM, there is no registered tracer or event doesn't exist.
+ */
+int trace_raw_event(int pm_event_id, int pm_event_size, void *pm_event_data)
+{
+ trace_custom l_custom;
+ struct custom_event_desc *p_event_desc;
+
+ read_lock(&custom_list_lock);
+
+ /* Find the event description matching this event */
+ for (p_event_desc = custom_events->next;
+ p_event_desc != custom_events;
+ p_event_desc = p_event_desc->next)
+ if (p_event_desc->event.id == pm_event_id)
+ break;
+
+ read_unlock(&custom_list_lock);
+
+ /* If we haven't found anything */
+ if (p_event_desc == custom_events)
+ return -ENOMEDIUM;
+
+ /* Set custom event Id */
+ l_custom.id = pm_event_id;
+
+ /* Set the data size */
+ if (pm_event_size <= CUSTOM_EVENT_MAX_SIZE)
+ l_custom.data_size = (u32) pm_event_size;
+ else
+ l_custom.data_size = (u32) CUSTOM_EVENT_MAX_SIZE;
+
+ /* Set the pointer to the event data */
+ l_custom.data = pm_event_data;
+
+ /* Log the custom event */
+ return trace_event(TRACE_EV_CUSTOM, &l_custom);
+}
+
+/**
+ * trace_event: - Trace an event
+ * @pm_event_id, the event's ID (check out trace.h)
+ * @pm_event_struct, the structure describing the event
+ *
+ * Returns:
+ * Trace fct return code if OK.
+ * -ENOMEDIUM, there is no registered tracer
+ * -ENOMEM, couldn't access ltt_info
+ */
+int trace_event(u8 pm_event_id,
+ void *pm_event_struct)
+{
+ int l_ret_value;
+ struct trace_callback_table_entry *p_tct_entry;
+ ltt_info_t *trace_info = (ltt_info_t *)current->trace_info;
+
+ if(trace_info != NULL)
+ atomic_inc(&trace_info->pending_write_count);
+
+ read_lock(&tracer_register_lock);
+
+ /* Is there a tracer registered */
+ if (tracer_registered != 1)
+ l_ret_value = -ENOMEDIUM;
+ else
+ /* Call the tracer */
+ l_ret_value = tracer->trace(pm_event_id, pm_event_struct);
+
+ read_unlock(&tracer_register_lock);
+
+ /* Is this a native event */
+ if (pm_event_id <= TRACE_EV_MAX) {
+ /* Are there any callbacks to call */
+ if (trace_callback_table[pm_event_id - 1].next != NULL) {
+ /* Call all the callbacks linked to this event */
+ for (p_tct_entry = trace_callback_table[pm_event_id - 1].next;
+ p_tct_entry != NULL;
+ p_tct_entry = p_tct_entry->next)
+ p_tct_entry->callback(pm_event_id, pm_event_struct);
+ }
+ }
+
+ if(trace_info != NULL)
+ atomic_dec(&trace_info->pending_write_count);
+
+ return l_ret_value;
+}
+
+/**
+ * alloc_trace_info: - Allocate and zero-initialize a new ltt_info struct.
+ * @p: pointer to task struct.
+ *
+ * Returns:
+ * 0, if allocation was succesfful.
+ * -ENOMEM, if memory allocation failed.
+ */
+int alloc_trace_info(struct task_struct *p)
+{
+ p->trace_info = kmalloc(sizeof(ltt_info_t), GFP_KERNEL);
+
+ if(p->trace_info == NULL)
+ return -ENOMEM;
+
+ memset(p->trace_info, 0x00, sizeof(p->trace_info));
+
+ return 0;
+}
+
+/**
+ * free_trace_info: - Clean up the process trace_info.
+ * @p: pointer to task struct.
+ */
+void free_trace_info(struct task_struct *p)
+{
+ ltt_info_t * trace_info = p->trace_info;
+
+ if(trace_info != NULL) {
+ p->trace_info = NULL;
+ kfree(trace_info);
+ }
+}
+
+/**
+ * trace_get_pending_write_count: - Get nbr threads with pending writes.
+ *
+ * Returns the number of current threads with trace event writes in
+ * progress.
+ */
+int trace_get_pending_write_count(void)
+{
+ struct task_struct *p = NULL;
+ int total_pending = 0;
+ ltt_info_t * trace_info;
+
+ read_lock(&tasklist_lock);
+ for_each_process(p) {
+ if(p->state != TASK_ZOMBIE && p->state != TASK_STOPPED) {
+ trace_info = (ltt_info_t *)p->trace_info;
+ if(trace_info != NULL)
+ total_pending += atomic_read(&trace_info->pending_write_count);
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ return total_pending;
+}
+
+/**
+ * trace_init: - Initialize trace facility
+ *
+ * Returns:
+ * 0, if everything went ok.
+ */
+static int __init trace_init(void)
+{
+ int i;
+
+ /* Initialize callback table */
+ for (i = 0; i < TRACE_EV_MAX; i++) {
+ trace_callback_table[i].callback = NULL;
+ trace_callback_table[i].next = NULL;
+ }
+
+ /* Initialize next event ID to be used */
+ next_event_id = TRACE_EV_MAX + 1;
+
+ /* Initialize custom events list */
+ custom_events = &custom_events_head;
+ custom_events->next = custom_events;
+ custom_events->prev = custom_events;
+
+ return 0;
+}
+
+module_init(trace_init);
+
+/* Export symbols so that can be visible from outside this file */
+EXPORT_SYMBOL(register_tracer);
+EXPORT_SYMBOL(unregister_tracer);
+EXPORT_SYMBOL(trace_set_config);
+EXPORT_SYMBOL(trace_get_config);
+EXPORT_SYMBOL(trace_register_callback);
+EXPORT_SYMBOL(trace_unregister_callback);
+EXPORT_SYMBOL(trace_create_event);
+EXPORT_SYMBOL(trace_create_owned_event);
+EXPORT_SYMBOL(trace_destroy_event);
+EXPORT_SYMBOL(trace_destroy_owners_events);
+EXPORT_SYMBOL(trace_reregister_custom_events);
+EXPORT_SYMBOL(trace_std_formatted_event);
+EXPORT_SYMBOL(trace_raw_event);
+EXPORT_SYMBOL(trace_event);
+EXPORT_SYMBOL(trace_get_pending_write_count);
+
+EXPORT_SYMBOL(syscall_entry_trace_active);
+EXPORT_SYMBOL(syscall_exit_trace_active);


2002-09-22 10:30:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> D: The core tracing infrastructure serves as the main rallying point for
> D: all the tracing activity in the kernel. (Tracing here isn't meant in
> D: the ptrace sense, but in the sense of recording key kernel events along
> D: with a time-stamp in order to reconstruct the system's behavior post-
> D: mortem.) Whether the trace driver (which buffers the data collected
> D: and provides it to the user-space trace daemon via a char dev) is loaded
> D: or not, the kernel sees a unique tracing function: trace_event().
> D: Basically, this provides a trace driver register/unregister service.
> D: When a trace driver registers, it is forwarded all the events generated
> D: by the kernel. If no trace driver is registered, then the events go
> D: nowhere.

my problem with this stuff is conceptual: it introduces a constant drag on
the kernel sourcecode, while 99% of development will not want to trace,
ever. When i do need tracing occasionally, then i take those 30 minutes to
write up a tracer from pre-existing tracing patches, tailored to specific
problems. Eg. for the scheduler i wrote a simple tracer, but the rate of
trace points that started to make sense for me from a development and
debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
always kept the tracer separate and did the hacking in the untained code.

also, the direction things are taking these days seems to be towards
hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
traversed by the CPU recently. Stuff like this is powerful and can can
debug bugs that cannot be debugged via software. I've seen and debugged
dozens of subtle bugs that went away if a software-tracer was enabled, in
fact i debugged at least 3 scheduler bugs which triggered on the removal
of a specific trace point. Sw-tracing, and especially the kind of
intrusive stuff you are doing has its limitations and side-effects. It's
also something that comes from the closed-source world, there kernels must
have tracing APIs because otherwise debugging drivers and subsystems would
be much easier. It does have its uses, no doubt, but usually we apply
things to the kernel that have either a positive, or at worst, a neutral
impact on the kernel proper - kernel tracing clearly is not such a
feature.

so use the power of the GPL-ed kernel and keep your patches separate,
releasing them for specific stable kernel branches (or even development
kernels). If anything then i'm biased towards tracer code, eg. i wrote the
first versions of ktrace (source-unintrusive tracer) and iotrace
(source-intrusive tracer), and i for one do not want to have *any* trace
points in any of the code i hack on a daily basis. This stuff must stay
separate.

Ingo

2002-09-22 10:37:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


> [...] It's also something that comes from the closed-source world, there
> kernels must have tracing APIs because otherwise debugging drivers and
> subsystems would be much easier. [...]
^------harder

2002-09-22 17:21:36

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Hi,

On Sun, 22 Sep 2002, Ingo Molnar wrote:

> Eg. for the scheduler i wrote a simple tracer, but the rate of
> trace points that started to make sense for me from a development and
> debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
> always kept the tracer separate and did the hacking in the untained code.
>
> also, the direction things are taking these days seems to be towards
> hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
> traversed by the CPU recently. Stuff like this is powerful and can can
> debug bugs that cannot be debugged via software.

To summarize: You find tracing useful, but software tracing is only of
limited value in areas you're working at.
What about other developers, which only want to develop a simple driver,
without having to understand the whole kernel? Traces still work where
printk() or kgdb don't work. I think it's reasonable to ask an user to
enable tracing and reproduce the problem, which you can't reproduce
yourself.

> It does have its uses, no doubt, but usually we apply
> things to the kernel that have either a positive, or at worst, a neutral
> impact on the kernel proper - kernel tracing clearly is not such a
> feature.

Last time I checked it has no impact on the kernel as long as it's not
enabled. Anyway, it would already be very useful to have at least the core
integrated. How many drivers currently define a "dprint"? Some even
implement its own tracing. While debug prints are mostly useful during
early development, they are usually completely useless, when you have to
reproduce a problem.

> so use the power of the GPL-ed kernel and keep your patches separate,
> releasing them for specific stable kernel branches (or even development
> kernels).

While I agree that this acceptable approach for things like kgdb, I think
it would very useful to have at least the tracing core in the kernel.

bye, Roman

2002-09-22 18:29:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Roman Zippel wrote:
>
> To summarize: You find tracing useful, but software tracing is only of
> limited value in areas you're working at.
>
> What about other developers, which only want to develop a simple driver,
> without having to understand the whole kernel? Traces still work where
> printk() or kgdb don't work. I think it's reasonable to ask an user to
> enable tracing and reproduce the problem, which you can't reproduce
> yourself.

That makes adding source bloat ok? I've debugged some drivers with
dprintk() style tracing, and it often makes the code harder to follow,
even if it eds up being compiled away.

>From what I've seen from the LTT thing, it's too heavy-weight to be good
for many things (taking SMP-global locks for trace events is _not_ a good
idea if the trace is for doing things like doing performance tracing,
where a tracer that adds synchronization fundamentally _changes_ what is
going on in ways that have nothing to do with timing).

I suspect we'll want to have some form of event tracing eventually, but
I'm personally pretty convinced that it needs to be a per-CPU thing, and
the core mechanism would need to be very lightweight. It's easier to build
up complexity on top of a lightweight interface than it is to make a
lightweight interface out of a heavy one.

Linus

2002-09-22 18:58:21

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Hello Ingo,

Thanks for taking the time to look at this.

Ingo Molnar wrote:
> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,
> ever.

It seems my description was misleading. So here's the skinny: LTT's main
purpose is to enable users and developers to observe the system's dynamics
in order to retrieve exact information regarding the behavior of the
entire system WITHOUT modifying the system's behavior or degrading the
system's performance. In turn, this can be used for identifying
synchronization and performance problems. In doing so, however, the services
implemented by LTT in the kernel happen to be quite useful to many other
kernel subsystems and device drivers since they too occasionnally need
tracing.

Here are some actual practical cases:
- How do you debug process synchronization problems in user-space? You
can't use anything that calls on ptrace() since it modifies the
processes' behavior and you can't use printf's for anything the least
bit complicated. The only way you can do this is if you use a tracing
tool such as LTT that enables you to see which services were called,
what happened as a consequence of the processes' requests, and where
the synchronization failed.
- How do you measure the exact time processes spend in kernel space,
identify why they spend it there, which processes they had to wait
for, etc.?
- How do you measure the exact time it takes for an interrupt's
effects to propagate through the entire system? As a simple example, say
you want to follow the exact sequence of processes that run from the
moment you press a key on the keyboard until a character shows up in
the command terminal in X. LTT will shows this quite easily.
- Take tools like oprofile and syscalltrack which need the same
information available through the trace points added by LTT. Instead
of diverting the system call table, as they currently do, they could
retrieve the information they need easily from LTT without using
clean interfaces and no table redirection.
- Say you have thousands of servers in an installation and one of them
has some sporadic problem. How are you going to debug this sytem?
Should the sysadmin be expected to download the kernel's source, patch
it for tracing and restart the system to find the problem? Rather,
wouldn't it be simpler if he could run the tracing in the background
for the time until the problem occurs and then look at the trace to
see what's the real problem before digging deeper?
- etc.

Do I think that the kernel should be instrumented in a way that it is
"a constant drag on the kernel sourcecode"? No. This is why the trace points
inserted really have more to do with the way a classic Unix kernel is
structured (system calls, process switching, forks, execs, ...) than
anything peculiar to Linux's source code. Hence, you could reimplement
the entire Linux source an entirely different way, you would still find
those very same events taking place. Also, all these trace points result
in zero code if the kernel is compilled without tracing support.

For adding additional trace points wherever you want, you can use
kernel probes to add them dynamically (kprobes already interfaces with
LTT and is slated to go in 2.5) or you can use the custom even API
available from LTT to create your own events and logging them as
part of the trace.

In brief, no LTT isn't a kernel debugging tool, but yes its integration
into the kernel would certainly help subsystems that do need this sort
of service.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:10:47

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Linus Torvalds wrote:
> On Sun, 22 Sep 2002, Roman Zippel wrote:
> > What about other developers, which only want to develop a simple driver,
> > without having to understand the whole kernel? Traces still work where
> > printk() or kgdb don't work. I think it's reasonable to ask an user to
> > enable tracing and reproduce the problem, which you can't reproduce
> > yourself.
>
> That makes adding source bloat ok? I've debugged some drivers with
> dprintk() style tracing, and it often makes the code harder to follow,
> even if it eds up being compiled away.

Source bloat is certainly not desirable, as I said to my reply to Ingo.
What is desirable, however, is to have a uniform tracing mechanism
replace the ad-hoc tracing mechanisms already implemented in many drivers
and subsystems.

> >From what I've seen from the LTT thing, it's too heavy-weight to be good
> for many things (taking SMP-global locks for trace events is _not_ a good
> idea if the trace is for doing things like doing performance tracing,
> where a tracer that adds synchronization fundamentally _changes_ what is
> going on in ways that have nothing to do with timing).

Sure, but there are no locks anymore in the tracer with the addition of
the lockless code which is part of the set of patches I just sent. So yes,
this was a problem with LTT, but it isn't anymore.

The lockless scheme is pretty simple, instead of using a spinlock to
ensure atomic allocation of buffer space, the code does an allocate-and-test
routine where it tries to allocate space in the buffer and tests if it
succeeded in doing so. If so, then it goes on to write the data in the
event buffer, otherwise it tries again. In most cases, it does this loop
only once and in most worst cases twice.

> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

I fully agree with the requirements you list. LTT is already lightweight
in terms of its performance impact on the system and it doesn't use any
form of locking anymore. The only remaining issue is the use of per-CPU
buffers and this is currently being worked on by the team at IBM that
had already developed the lockless scheme and will be ready shortly.
However, there clearly is no more lock contention.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:22:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Linus Torvalds <[email protected]> writes:
>
> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

There is an old patch around from SGI that does exactly this. It is a
very lightweight binary value tracer that has per CPU buffers. It
traces using macros that you can easily add. It's called ktrace (not
to be confused with Ingo's ktrace). I've been porting it for some time
for my own tracing needs (adding tracing macros as needed but never submitting
them). If you're interested I can submit it for 2.5 (without any hooks, people
should just add them as needed and then remove them again)

-Andi

2002-09-22 19:25:01

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


err...

Karim Yaghmour wrote:
> retrieve the information they need easily from LTT without using
^^^^^^^ => by
> clean interfaces and no table redirection.

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 19:27:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> Source bloat is certainly not desirable, as I said to my reply to Ingo.

(then how should i interpret 90% of the patches you sent to lkml today?)

> What is desirable, however, is to have a uniform tracing mechanism
> replace the ad-hoc tracing mechanisms already implemented in many
> drivers and subsystems.

exactly what is the problem with keeping intrusive debugging patches
separate, just like all the other ones are kept separate? It's not like
this came out of the blue, per-CPU trace buffers (and other tracers) were
done years ago for Linux.

> The lockless scheme is pretty simple, instead of using a spinlock to
> ensure atomic allocation of buffer space, the code does an
> allocate-and-test routine where it tries to allocate space in the buffer
> and tests if it succeeded in doing so. If so, then it goes on to write
> the data in the event buffer, otherwise it tries again. In most cases,
> it does this loop only once and in most worst cases twice.

(this is in essence a moving spinlock at the tail of the trace buffer -
same problem.)

Ingo

2002-09-22 21:15:50

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

There is no drag on the kernel. The concept that we are working on is
consistent with your below recommendations. Only place in the kernel an
efficient tracing infrastructure, keep trace points as patches. This adds
no overhead to kernel, allows your suggested patches to use a standard
efficient infrastructure, reduces replicated work from specific problem to
specific problem.

> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,

If you care about performance you will want to trace. On two previous
kernels I have worked on I've heard this comment. Once the infrastructure
was in it was used and appreciated.


Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

----

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, Karim Yaghmour wrote:
>
> > D: The core tracing infrastructure serves as the main rallying point for
> > D: all the tracing activity in the kernel. (Tracing here isn't meant in
> > D: the ptrace sense, but in the sense of recording key kernel events along
> > D: with a time-stamp in order to reconstruct the system's behavior post-
> > D: mortem.) Whether the trace driver (which buffers the data collected
> > D: and provides it to the user-space trace daemon via a char dev) is loaded
> > D: or not, the kernel sees a unique tracing function: trace_event().
> > D: Basically, this provides a trace driver register/unregister service.
> > D: When a trace driver registers, it is forwarded all the events generated
> > D: by the kernel. If no trace driver is registered, then the events go
> > D: nowhere.
>
> my problem with this stuff is conceptual: it introduces a constant drag on
> the kernel sourcecode, while 99% of development will not want to trace,
> ever. When i do need tracing occasionally, then i take those 30 minutes to
> write up a tracer from pre-existing tracing patches, tailored to specific
> problems. Eg. for the scheduler i wrote a simple tracer, but the rate of
> trace points that started to make sense for me from a development and
> debugging POV also made kernel/sched.c butt-ugly and unmaintainable, so i
> always kept the tracer separate and did the hacking in the untained code.
>
> also, the direction things are taking these days seems to be towards
> hardware-assisted tracing. Ie. on the P4 we can recover a trace of EIPs
> traversed by the CPU recently. Stuff like this is powerful and can can
> debug bugs that cannot be debugged via software. I've seen and debugged
> dozens of subtle bugs that went away if a software-tracer was enabled, in
> fact i debugged at least 3 scheduler bugs which triggered on the removal
> of a specific trace point. Sw-tracing, and especially the kind of
> intrusive stuff you are doing has its limitations and side-effects. It's
> also something that comes from the closed-source world, there kernels must
> have tracing APIs because otherwise debugging drivers and subsystems would
> be much easier. It does have its uses, no doubt, but usually we apply
> things to the kernel that have either a positive, or at worst, a neutral
> impact on the kernel proper - kernel tracing clearly is not such a
> feature.
>
> so use the power of the GPL-ed kernel and keep your patches separate,
> releasing them for specific stable kernel branches (or even development
> kernels). If anything then i'm biased towards tracer code, eg. i wrote the
> first versions of ktrace (source-unintrusive tracer) and iotrace
> (source-intrusive tracer), and i for one do not want to have *any* trace
> points in any of the code i hack on a daily basis. This stuff must stay
> separate.
>
> Ingo
>
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://www.listserv.shafik.org/listserv/listinfo/ltt-dev

2002-09-22 21:25:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, bob wrote:

> There is no drag on the kernel. The concept that we are working on is
> consistent with your below recommendations. Only place in the kernel an
> efficient tracing infrastructure, keep trace points as patches. [...]

well, this is not the impression i got from the patches posted to lkml ...

> [...] This adds no overhead to kernel, allows your suggested patches to
> use a standard efficient infrastructure, reduces replicated work from
> specific problem to specific problem.

so why not keep the core parts as separate patches as well? If it does
nothing then i dont see why it should get into the kernel proper.

> > my problem with this stuff is conceptual: it introduces a constant drag on
> > the kernel sourcecode, while 99% of development will not want to trace,
>
> If you care about performance you will want to trace. On two previous
> kernels I have worked on I've heard this comment. Once the
> infrastructure was in it was used and appreciated.

(i think you have not read what i have written. I use tracing pretty
frequently, and no, i dont need tracing in the kernel, during development
i can apply patches to kernel trees just fine.)

Ingo

2002-09-22 21:24:23

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Linus Torvalds writes:
>
> >From what I've seen from the LTT thing, it's too heavy-weight to be good

Not true anymore.

> I suspect we'll want to have some form of event tracing eventually, but
> I'm personally pretty convinced that it needs to be a per-CPU thing, and
> the core mechanism would need to be very lightweight. It's easier to build
> up complexity on top of a lightweight interface than it is to make a
> lightweight interface out of a heavy one.

We have removed locks (code now atomically reserves space in the trace
buffer), significantly reduced the cost of taking timestamps by using the
real-time clock, and are in the process of implementing per-CPU buffers.
As per previous email, the intent is to get only the core infrastructure
into the kernel and keep trace points as patches. Some of the work going
into LTT is modeled after the tracing infrastructure in K42, which is
extremely lightweight, lock-free, and designed for multiprocessors.


Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:01:38

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Ingo Molnar wrote:
> On Sun, 22 Sep 2002, Karim Yaghmour wrote:
>
> > Source bloat is certainly not desirable, as I said to my reply to Ingo.
>
> (then how should i interpret 90% of the patches you sent to lkml today?)

Please refer to my other email where I explain why tracing is essential
to the day-to-day usage of any kernel. I don't think this is bloat and
the distributions which already include LTT certainly think it's quite
useful to their clients. In fact, most embedded distro actually make
the inclusion of LTT one of the main features with which they sell
Linux to their clients.

> > What is desirable, however, is to have a uniform tracing mechanism
> > replace the ad-hoc tracing mechanisms already implemented in many
> > drivers and subsystems.
>
> exactly what is the problem with keeping intrusive debugging patches
> separate, just like all the other ones are kept separate?

Again, this is not a kernel debugging patch. As you yourself have stated
elsewhere, instrumenting a kernel for it to yield useful information to
a kernel developer makes the code "butt-ugly" (your words). The trace
statements currently inserted by LTT are clearly useless for any kernel
debugging whatsoever. The trace statements inserted are only useful for
the day-to-day tracing needs of any Linux user.

> It's not like
> this came out of the blue, per-CPU trace buffers (and other tracers) were
> done years ago for Linux.

I don't remember claiming to having implemented the first tracer. However,
I have been working very hard in putting together a rock solid tracer
which includes the best ideas of all existing tracers and offers a wide
range of tools for _any_ user to use. The decision of the attendees of
the RAS BoF at the OLS to standardize on LTT clearly goes in this direction.

Again, please understand that LTT is not a kernel debugger. Any look at
the set of trace statements inserted by LTT will reveal their low value
for kernel developers. These trace statements are meant for providing
users with in-depth and complete understanding of the system's dynamics.

> > The lockless scheme is pretty simple, instead of using a spinlock to
> > ensure atomic allocation of buffer space, the code does an
> > allocate-and-test routine where it tries to allocate space in the buffer
> > and tests if it succeeded in doing so. If so, then it goes on to write
> > the data in the event buffer, otherwise it tries again. In most cases,
> > it does this loop only once and in most worst cases twice.
>
> (this is in essence a moving spinlock at the tail of the trace buffer -
> same problem.)

Hmm. No offense, but I think you ought to take a better look at the code.

Because events can occur at the interrupt level and on normal non-interrupt
path, any tracer that has to record a broad range of event types needs
to use spin_lock_irqsave(), which is what LTT's tracer used. Now, last
I checked, spin_lock_irqsave() calls on local_irq_save() which, on an
i386 for example, is defined as follows:
#define local_irq_save(x) __asm__ __volatile__("pushfl ; popl %0 ; cli":"=g" (x): /* no input */ :"memory")

There's a cli() in there. No cli's in the lockless code. Among other
things, this makes the lockless code quite different from any usual
Linux spinlock.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 22:11:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> > (this is in essence a moving spinlock at the tail of the trace buffer -
> > same problem.)
>
> Hmm. No offense, but I think you ought to take a better look at the
> code.

i have, and i see stuff like this:

+ TRACE_PROCESS(TRACE_EV_PROCESS_WAKEUP, p->pid, p->state);

+static inline void TRACE_PROCESS(u8 ev_id, u32 data1, u32 data2)
+{
+ trace_process proc_event;
+
+ proc_event.event_sub_id = ev_id;
+ proc_event.event_data1 = data1;
+ proc_event.event_data2 = data2;
+
+ trace_event(TRACE_EV_PROCESS, &proc_event);
+}

where trace_event() is defined as:

+int trace_event(u8 pm_event_id,
+ void *pm_event_struct)
[...]
+ read_lock(&tracer_register_lock);

ie. it's using a global spinlock. (sure, it can be made lockless, as other
tracers have done it.)

Ingo

2002-09-22 22:32:53

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > There is no drag on the kernel. The concept that we are working on is
> > consistent with your below recommendations. Only place in the kernel an
> > efficient tracing infrastructure, keep trace points as patches. [...]
>
> well, this is not the impression i got from the patches posted to lkml ...

The intent is to split LTT, get the infrastructure into the kernel, have
the trace points as patches.

>
> > [...] This adds no overhead to kernel, allows your suggested patches to
> > use a standard efficient infrastructure, reduces replicated work from
> > specific problem to specific problem.
>
> so why not keep the core parts as separate patches as well? If it does
> nothing then i dont see why it should get into the kernel proper.

:-) It does do something. It provides a common infrastructure for anyone
wanting to use trace points. What I meant is that when not enabled it
doesn't cause any overhead.

As a performance tool it will be used not only be kernel developers but by
people writing device drivers, sub-systems, and apps. Having an accepted
infrastructure in the kernel allows a common vocabulary to be used across
kernel, devices, sub-systems, and applications. It allows sub-system
developers who know their system best to put in the events and developers
of other sub-systems of apps to use those events to understand what is
going on. If the infrastructure is in the kernel, users could dynamically
enable and feedback performance results to the kernel developers.

In short this will provide a common way to discuss performance issues
across kernel, sub-system, and application space.

> > > my problem with this stuff is conceptual: it introduces a constant drag on
> > > the kernel sourcecode, while 99% of development will not want to trace,
> >
> > If you care about performance you will want to trace. On two previous
> > kernels I have worked on I've heard this comment. Once the
> > infrastructure was in it was used and appreciated.
>
> (i think you have not read what i have written. I use tracing pretty
> frequently, and no, i dont need tracing in the kernel, during development
> i can apply patches to kernel trees just fine.)

Good - I'm glad you find tracing useful - sorry if I reacted to the
statement that most of the time it's not needed. As above, it should be in
the kernel proper not as just a patch.

> > The lockless scheme is pretty simple, instead of using a spinlock to
> > ensure atomic allocation of buffer space, the code does an
> > allocate-and-test routine where it tries to allocate space in the buffer
>
> (this is in essence a moving spinlock at the tail of the trace buffer -
> same problem.)

No, we use lock-free atomic operations to reserve a place in the buffer to
write the data. What happens is you attempt to atomic move the current
index pointer forward. If you succeed then you have bought yourself that
many data words in the queue. In the unlikely event you happened to
collide with someone you perform the atomic operation again.


Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:33:11

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Ingo Molnar wrote:
> +int trace_event(u8 pm_event_id,
> + void *pm_event_struct)
> [...]
> + read_lock(&tracer_register_lock);
>
> ie. it's using a global spinlock. (sure, it can be made lockless, as other
> tracers have done it.)

It is, but this is separate from the trace driver. This global
spinlock is only used to avoid a race condition in the registration/
unregistration of the tracing function with the trace infrastructure.
The only case where the lock is taken in write mode is when a
tracer in being registered or unregistered (register_tracer() and
unregister_tracer()). Since tracing itself is NOT registeration/
unregistration intensive, there is no contention over this lock.

Any trace infrastructure that allows dynamic registration of tracers
needs this sort of lock in order to make sure that the function pointer
it has for the tracer is actually valid when it calls it. Of course if
the tracer itself was directly called from the inline trace statements,
then this would be a different story, but then the tracer has to be
in there all the time (which is exactly what happens with most, if
not all, the tracers already included in the kernel).

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 22:36:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


a number of suggestions to make the tracer truly lightweight:

- remove the 'event registration' and callback stuff. It just introduces
unnecessery runtime overhead. Use an include file as a registry of
events instead. This will simplify things greatly. Why do you need a
table of callbacks registered to an event? Nothing in your patches
actually uses it ... Just use one tracing function that copies the
arguments into a per-CPU ringbuffer. It's really just a few lines.

- do not disable interrupts when writing events. I used this method in
a tracer and it works well. Just get an irq-safe index to the trace
ring-buffer and fill it in. [eg. on x86 incl can be used for this
purpose.]

- get rid of p->trace_info and the pending_write_count - it's completely
unnecessery.

- drivers/trace/tracer.c is a complex mess of strange coding style and
#ifdefs, it's not proper Linux kernel code.

it's possible to have lightweight tracing - this patch clearly is not
achieving that goal yet.

Ingo

2002-09-22 22:41:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, bob wrote:

> > (this is in essence a moving spinlock at the tail of the trace buffer -
> > same problem.)
>
> No, we use lock-free atomic operations to reserve a place in the buffer
> to write the data. What happens is you attempt to atomic move the
> current index pointer forward. If you succeed then you have bought
> yourself that many data words in the queue. In the unlikely event you
> happened to collide with someone you perform the atomic operation again.

you have not understood what i have written.

what you do has the same (bad) effect as a global spinlock, it in essence
has the same cache effect as a constantly moving spinlock at the 'end' of
the trace buffer. Cachelines bounce between CPUs. Only completely per-CPU
trace buffers solve this problem.

Ingo

2002-09-22 22:49:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, bob wrote:

> [...] On a technical note: a cache-line ping-ponging is bad - a global
> spinlock is horrendous. They're different - the lock-less MP scheme gets
> rid of them both.

(on the contrary - a global spinlock is bad for exactly that reason,
because it causes a cacheline ping-pong. So if two CPUs are trying to
write trace events at once, you'll get the same effect as if they were
using a global spinlock.)

Ingo

2002-09-22 22:47:27

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > > (this is in essence a moving spinlock at the tail of the trace buffer -
> > > same problem.)
> >
> > No, we use lock-free atomic operations to reserve a place in the buffer
> > to write the data. What happens is you attempt to atomic move the
> > current index pointer forward. If you succeed then you have bought
> > yourself that many data words in the queue. In the unlikely event you
> > happened to collide with someone you perform the atomic operation again.
>
> you have not understood what i have written.
>
> what you do has the same (bad) effect as a global spinlock, it in essence
> has the same cache effect as a constantly moving spinlock at the 'end' of
> the trace buffer. Cachelines bounce between CPUs. Only completely per-CPU
> trace buffers solve this problem.

As per previous email, we are moving to a per-CPU scheme. On a technical
note: a cache-line ping-ponging is bad - a global spinlock is horrendous.
They're different - the lock-less MP scheme gets rid of them both.

> - do not disable interrupts when writing events. I used this method in
> a tracer and it works well. Just get an irq-safe index to the trace
> ring-buffer and fill it in. [eg. on x86 incl can be used for this
> purpose.]

The lock-less scheme does not disable interrupts - we've eliminated that.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 22:45:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> Ingo Molnar wrote:
> > +int trace_event(u8 pm_event_id,
> > + void *pm_event_struct)
> > [...]
> > + read_lock(&tracer_register_lock);
> >
> > ie. it's using a global spinlock. (sure, it can be made lockless, as other
> > tracers have done it.)
>
> It is, but this is separate from the trace driver. [...]

it does not matter, it's called for every event.

> [...] This global spinlock is only used to avoid a race condition in the
> registration/ unregistration of the tracing function with the trace
> infrastructure.

(here you make the incorrect assumption that read-locking a rwlock is a
lightweight operation.)

Ingo

2002-09-22 22:58:17

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

> > [...] On a technical note: a cache-line ping-ponging is bad - a global
> > spinlock is horrendous. They're different - the lock-less MP scheme gets
> > rid of them both.
>
> (on the contrary - a global spinlock is bad for exactly that reason,
> because it causes a cacheline ping-pong. So if two CPUs are trying to
> write trace events at once, you'll get the same effect as if they were
> using a global spinlock.)
>
> Ingo

Just want to be clear that we are going to a per-CPU buffer scheme.

However, for sake of argument, the above is still not true. A global lock
has a different (worse) performance problem then the lock-free atomic
operation even given a global queue. The difference is 1) the Linux global
lock is very expensive and interacts with potential other processes, and 2)
you have to hold the lock for the entire duration of logging the event;
with the atomic operation you are finished once you've reserved you space.
If you didn't use the expensive Linux global lock and just a global lock,
you could be interrupted in the middle of holding the lock and performance
would fall off the map.

-bob


Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-22 23:06:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, bob wrote:

> However, for sake of argument, the above is still not true. A global
> lock has a different (worse) performance problem then the lock-free
> atomic operation even given a global queue. The difference is 1) the
> Linux global lock is very expensive [... and interacts with potential
> other processes, [...]

huh? what is 'the Linux global lock'?

> [...] and 2) you have to hold the lock for the entire duration of
> logging the event; with the atomic operation you are finished once
> you've reserved you space. [...]

you dont have to hold the lock for the duration of saving the event, the
lock could as well protect a 'current entry' index. (Not that those 2-3
cycles saving off the event into a single cacheline counts that much ...)

the tail-atomic method is precisely equivalent to a global spinlock. The
tail of a global event buffer acts precisely as a global spinlock: if one
CPU writes to it in a stream then it performs okay, if two CPUs trace in
parallel then it causes cachelines to bounce like crazy.

> [...] If you didn't use the expensive Linux global lock and just a
> global lock, you could be interrupted in the middle of holding the lock
> and performance would fall off the map.

again, what 'expensive Linux global lock' are you talking about?

Ingo

2002-09-22 23:19:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


this is that a trace point should do, at most:

--------------------->
task_t *tracer_task;

int curr_idx[NR_CPUS];
int curr_pending[NR_CPUS];

struct trace_event **trace_ring;

void trace(event, data1, data2, data3)
{
int cpu = smp_processor_id();
int idx, pending, *curr = curr_idx + cpu;
struct trace_event *t;
unsigned long flags;

if (!event_wanted(current, event, data1, data2, data3))
return;

local_irq_save(flags);

idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
pending = ++curr_pending[cpu];

t = trace_ring[cpu] + idx;

t->event = event;
rdtscll(t->timestamp);
t->data1 = data1;
t->data2 = data2;
t->data3 = data3;

if (curr_pending == TRACE_LOW_WATERMARK && tracer_task)
wake_up_process(tracer_task);

local_irq_restore(flags);
}

this should cover most of what's needed. The event_wanted() filter
function should be made as fast as possible. Note that the irq-disabled
section is not strictly needed but nice and also makes it work on the
preemptible kernel. (It's not a big issue at all to run these few
instructions with irqs disabled.)

[there are also other details like putting curr_index and curr_pending
into the per-cpu area and similar stuff.]

Ingo

2002-09-22 23:27:17

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Thanks for the recommendations, we will certainly direct the development
to address these issues.

Ingo Molnar wrote:
> - remove the 'event registration' and callback stuff. It just introduces
> unnecessery runtime overhead. Use an include file as a registry of
> events instead. This will simplify things greatly.

OK, basically then all the trace points call the trace driver directly.

> Why do you need a
> table of callbacks registered to an event? Nothing in your patches
> actually uses it ...

True, nothing in the patches actually uses it as this point. This was
added with the mindset of letting other tools than LTT use the trace
points already provided by LTT.

> Just use one tracing function that copies the
> arguments into a per-CPU ringbuffer. It's really just a few lines.

Sure, the writing of data itself is trivial. The reason you find the
driver to be rather full is because of its need to do a couple of
extra operations:
- Get timestamp and use delta since begining of buffer to reduce
trace size. (i.e. because of the rate at which traces are filled, it's
essential to be able to cut down in the data written as much as possible).
- Filter events according to event mask.
- Copy extra data in case of some events (e.g. filenames). (We're working on
ways to simplify this).
- Synchronize with trace daemon to save trace data. (A single per-CPU
circular buffer may be useful when doing kernel devleopment, but user
tracing often requires N buffers).

In addition, because this data is available from user-space, you need
to be able to deal with many buffers. For example, you don't want some
random user to know everything that's happening on the entire system
for obvious security reasons. So the tracer will need to be able to
have per-user and per-process buffers.

The writing of the data itself is not a problem, the real problem is
having a flexible lightweight tracer that can be used in a variety
of different situations.

> - do not disable interrupts when writing events. I used this method in
> a tracer and it works well. Just get an irq-safe index to the trace
> ring-buffer and fill it in. [eg. on x86 incl can be used for this
> purpose.]

Done.

> - get rid of p->trace_info and the pending_write_count - it's completely
> unnecessery.

But then how do we keep track of whether processes have pointers to the
trace buffer or not? We need to be able to allocate/free trace buffers
in runtime. That's what the pending_write_count is for. A buffer can't
be freed is someone still has pending writes. Alternatives are welcomed.

Also, though this hasn't been implemented yet, users may desire to trace a
certain set of processes and trace_info could include a flag to this end.

> - drivers/trace/tracer.c is a complex mess of strange coding style and
> #ifdefs, it's not proper Linux kernel code.

We'll fix that.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-22 23:45:12

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > However, for sake of argument, the above is still not true. A global
> > lock has a different (worse) performance problem then the lock-free
> > atomic operation even given a global queue. The difference is 1) the
> > Linux global lock is very expensive [... and interacts with potential
> > other processes, [...]
>
> huh? what is 'the Linux global lock'?

sorry - LTT just uses a global lock - but to do so it must disable
interrupts. This is not a cheap operation. With lockless code you do not
need to disable interrrupts (or grab a lock) -> many less cycles.

>
> > [...] and 2) you have to hold the lock for the entire duration of
> > logging the event; with the atomic operation you are finished once
> > you've reserved you space. [...]
>
> you dont have to hold the lock for the duration of saving the event, the
> lock could as well protect a 'current entry' index. (Not that those 2-3
> cycles saving off the event into a single cacheline counts that much ...)
>
> the tail-atomic method is precisely equivalent to a global spinlock. The
> tail of a global event buffer acts precisely as a global spinlock: if one
> CPU writes to it in a stream then it performs okay, if two CPUs trace in
> parallel then it causes cachelines to bounce like crazy.

If 2 cpus ping-pong back and forth there will be significant cache cost -
true, but the cost of having to acquire the lock (which also ping-pongs)
and disabling the interrupts, adds even more. The additional cache line
ping pong for the lock (latency probably won't be hidden in fetching the
trace buffer data) plus the disabling interrupts still more than doubles
the cost.

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-23 00:02:32

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Yes this is simple code - similar to the model we use in K42. Still,
couple of things about the below.

1) the !event_wanted can be done outside the function, in a macro so that
the only cost if tracing is disabled is a hot cache hit on a mask (not
function call) - that helps with your comment:
> The event_wanted() filter function should be made as fast as possible.

2) If you use the lockless scheme you do not need to disable interrupts.
In K42 we manage to do the entire log operation in 21 instructions and
about as many cycles (couple more for getting time). We do this from user
space as well, disabling interrupts precludes this model (may of may not be
a problem). I was really leaning hard away from even the cost of making a
system call and disabling interrupts. Do people on the kernel dev team
feel this is an acceptable cost? Is migration prevented when interrupts
are disabled? This is something for us to consider.

3) All trace events should not have to have the same number of data words
logged - though I think that's just a packaging/interface issue the code
below would just be placed behind macros which correctly package up the
right number of arguments.


Ingo Molnar writes:
>
> this is that a trace point should do, at most:
>
> --------------------->
> task_t *tracer_task;
>
> int curr_idx[NR_CPUS];
> int curr_pending[NR_CPUS];
>
> struct trace_event **trace_ring;
>
> void trace(event, data1, data2, data3)
> {
> int cpu = smp_processor_id();
> int idx, pending, *curr = curr_idx + cpu;
> struct trace_event *t;
> unsigned long flags;
>
> if (!event_wanted(current, event, data1, data2, data3))
> return;
>
> local_irq_save(flags);
>
> idx = ++curr_idx[cpu] & (NR_TRACE_ENTRIES - 1);
> pending = ++curr_pending[cpu];
>
> t = trace_ring[cpu] + idx;
>
> t->event = event;
> rdtscll(t->timestamp);
> t->data1 = data1;
> t->data2 = data2;
> t->data3 = data3;
>
> if (curr_pending == TRACE_LOW_WATERMARK && tracer_task)
> wake_up_process(tracer_task);
>
> local_irq_restore(flags);
> }
>
> this should cover most of what's needed. The event_wanted() filter
> function should be made as fast as possible. Note that the irq-disabled
> section is not strictly needed but nice and also makes it work on the
> preemptible kernel. (It's not a big issue at all to run these few
> instructions with irqs disabled.)
>
> [there are also other details like putting curr_index and curr_pending
> into the per-cpu area and similar stuff.]
>
> Ingo
>
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://www.listserv.shafik.org/listserv/listinfo/ltt-dev

2002-09-23 07:14:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, bob wrote:

> Yes this is simple code - similar to the model we use in K42. Still,
> couple of things about the below.
>
> 1) the !event_wanted can be done outside the function, in a macro so
> that the only cost if tracing is disabled is a hot cache hit on a mask
> (not function call) - that helps with your comment:
> > The event_wanted() filter function should be made as fast as possible.

yes. It's a cost to be considered, but the main issue these days is the
icache cost of inlining. So generally we are leaning towards the
least-impact inlining cost.

> 2) If you use the lockless scheme you do not need to disable interrupts.
> In K42 we manage to do the entire log operation in 21 instructions and
> about as many cycles (couple more for getting time). We do this from
> user space as well, disabling interrupts precludes this model (may of
> may not be a problem). I was really leaning hard away from even the
> cost of making a system call and disabling interrupts. Do people on the
> kernel dev team feel this is an acceptable cost? Is migration prevented
> when interrupts are disabled? This is something for us to consider.

the trace() functions runs purely in kernel-space, so doing a cli/sti is
not a performance problem - if it can be avoided it saves a few cycles,
but it does not have any global costs. But i dont think reliable tracing
can be done without disabling interrupts - how do you guarantee that there
will be no trace 'holes' due to interruption at the wrong instruction?

> 3) All trace events should not have to have the same number of data
> words logged - though I think that's just a packaging/interface issue
> the code below would just be placed behind macros which correctly
> package up the right number of arguments.

yes, agreed, this can be solved by having some sort of RLA, tightly packed
trace buffer. Trace buffer usage is definitely one of the more important
points.

Ingo


2002-09-23 07:27:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


On Sun, 22 Sep 2002, Karim Yaghmour wrote:

> > - remove the 'event registration' and callback stuff. It just introduces
> > unnecessery runtime overhead. Use an include file as a registry of
> > events instead. This will simplify things greatly.
>
> OK, basically then all the trace points call the trace driver directly.

yes. And in fact i'd suggest to not make it a driver but create a new
kernel/trace.c file - if it's a central mechanism then it should live in a
central place.

> > Why do you need a
> > table of callbacks registered to an event? Nothing in your patches
> > actually uses it ...
>
> True, nothing in the patches actually uses it as this point. This was
> added with the mindset of letting other tools than LTT use the trace
> points already provided by LTT.

okay. The thing is that generic callbacks and data hooks in the task
structure are an invitation for various types of abuses - security and GPL
type abuses. People do get very nervous when seeing such stuff - eg. read
back Christoph Hellwig's comment from a few weeks ago. It's a red flag for
many people. Provide a clean and concentrated set of APIs, no callbacks,
no unnecessery hooks. I can see the technical reasons why you have added
it - it's in theory an extensible interface, but generally we tend to add
such stuff when it's needed - if it's needed at all.

> > Just use one tracing function that copies the
> > arguments into a per-CPU ringbuffer. It's really just a few lines.
>
> Sure, the writing of data itself is trivial. The reason you find the
> driver to be rather full is because of its need to do a couple of
> extra operations:
> - Get timestamp and use delta since begining of buffer to reduce
> trace size. (i.e. because of the rate at which traces are filled, it's
> essential to be able to cut down in the data written as much as possible).

yes - but even this one can also be solved by providing 2-3 macros that
each are hardcoded for one specific event length each - this should cover
about 90% of the events. Plus perhaps a more generic entry to handle the
longer/rarer event lengths, and the variable event length stuff.

> - Filter events according to event mask.

yes - this is handled by the event_allowed() function.

> - Copy extra data in case of some events (e.g. filenames). (We're
> working on ways to simplify this).

are you sure you want to copy filenames? File descriptor and inode numbers
ought to be enough.

> - Synchronize with trace daemon to save trace data. (A single per-CPU
> circular buffer may be useful when doing kernel devleopment, but user
> tracing often requires N buffers).
>
> In addition, because this data is available from user-space, you need to
> be able to deal with many buffers. For example, you don't want some
> random user to know everything that's happening on the entire system for
> obvious security reasons. So the tracer will need to be able to have
> per-user and per-process buffers.

in fact i have the feeling that you should not expose any of this to
ordinary users. Performance measurements are to be done by administrator
types - all this stuff has heavy memory allocation impact anyway.

in exactly which cases do you want to have multiple trace buffers? A
single (large enough if needed) buffer should be enough. This i think is
one of the core issues of your design.

Ingo

2002-09-23 13:54:19

by Robert Wisniewski

[permalink] [raw]
Subject: Re: [ltt-dev] Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

Ingo Molnar writes:
>
> On Sun, 22 Sep 2002, bob wrote:
>
> > Yes this is simple code - similar to the model we use in K42. Still,
> > couple of things about the below.
> >
> > 1) the !event_wanted can be done outside the function, in a macro so
> > that the only cost if tracing is disabled is a hot cache hit on a mask
> > (not function call) - that helps with your comment:
> > > The event_wanted() filter function should be made as fast as possible.
>
> yes. It's a cost to be considered, but the main issue these days is the
> icache cost of inlining. So generally we are leaning towards the
> least-impact inlining cost.

mmm - that seems a reasonable trade-off.

> > 2) If you use the lockless scheme you do not need to disable interrupts.
> > In K42 we manage to do the entire log operation in 21 instructions and
> > about as many cycles (couple more for getting time). We do this from
> > user space as well, disabling interrupts precludes this model (may of
> > may not be a problem). I was really leaning hard away from even the
> > cost of making a system call and disabling interrupts. Do people on the
> > kernel dev team feel this is an acceptable cost? Is migration prevented
> > when interrupts are disabled? This is something for us to consider.
>
> the trace() functions runs purely in kernel-space, so doing a cli/sti is
> not a performance problem - if it can be avoided it saves a few cycles,
> but it does not have any global costs. But i dont think reliable tracing
> can be done without disabling interrupts - how do you guarantee that there
> will be no trace 'holes' due to interruption at the wrong instruction?

We do have a way of guaranteeing no 'holes' get created unless the process
is interrupted for a *very* long time or killed (which could happen) during
the logging of an event. The code is a little more complicated and does
require an atomic operation that may be more or less equivalent to the cli
cost. In K42, and other OSes I worked on, we wanted very efficient logging
from user space as well. I think there might be a place for understanding
libc, database, jvm, performance, for examples, but if we really only do
log events in kernel space then the cli/sti approach is simpler and roughly
equivalent performance.

>
> > 3) All trace events should not have to have the same number of data
> > words logged - though I think that's just a packaging/interface issue
> > the code below would just be placed behind macros which correctly
> > package up the right number of arguments.
>
> yes, agreed, this can be solved by having some sort of RLA, tightly packed
> trace buffer. Trace buffer usage is definitely one of the more important
> points.

Yes! and we also have a scheme to allowed such a packed buffer stream to be
randomaly accessed on disk (useful if you have 100Ms or Gs of data).

-bob

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
[email protected]

2002-09-23 18:43:00

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


OK, I think we've agreed on most of the issues already. Just a couple of
details:

Ingo Molnar wrote:
> yes. And in fact i'd suggest to not make it a driver but create a new
> kernel/trace.c file - if it's a central mechanism then it should live in a
> central place.

OK, will do. Need to add a syscall for controlling tracing though (currently
done through device ioctl()).

[...]
Regarding callbacks:

Will be removed.

> are you sure you want to copy filenames? File descriptor and inode numbers
> ought to be enough.

We record the filename only once (i.e. upon exec or open). After that,
the fd is used.

> > - Synchronize with trace daemon to save trace data. (A single per-CPU
> > circular buffer may be useful when doing kernel devleopment, but user
> > tracing often requires N buffers).
> >
> > In addition, because this data is available from user-space, you need to
> > be able to deal with many buffers. For example, you don't want some
> > random user to know everything that's happening on the entire system for
> > obvious security reasons. So the tracer will need to be able to have
> > per-user and per-process buffers.
>
> in fact i have the feeling that you should not expose any of this to
> ordinary users. Performance measurements are to be done by administrator
> types - all this stuff has heavy memory allocation impact anyway.

Sure, for performance measurements it's the admin, but per my earlier
descriptions:
- users who want to debug synchronization problems of their own tasks
shouldn't see the kernel's behavior.
- users who want to log custom events separate from the kernel events
don't want to see the kernel's beavhior.

In any case, what the admin sees and what the users see of the tracing
facility will certainly be different (i.e. not the same level of
flexibility).

> in exactly which cases do you want to have multiple trace buffers? A
> single (large enough if needed) buffer should be enough. This i think is
> one of the core issues of your design.

OK, we'll revisit this issue.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-23 20:06:11

by Andreas Ferber

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Mon, Sep 23, 2002 at 11:12:12AM -0400, Karim Yaghmour wrote:
>
> Sure, for performance measurements it's the admin, but per my earlier
> descriptions:
> - users who want to debug synchronization problems of their own tasks
> shouldn't see the kernel's behavior.
> - users who want to log custom events separate from the kernel events
> don't want to see the kernel's beavhior.

Fairly simple to achieve: provide some sort of userspace trace daemon
from which the users request the trace events they want to see
(communicating through standard IPC channels). The daemon provides a
unified event mask to the kernel (to prevent unnecessary overhead in
the kernel proper) and dispatches the events read from the kernel.
AFAICS LTT doesn't try to achieve realtime event monitoring, so
somewhat delaying the event propagation to the final receiver should
not be a problem (at least as long as it generally stays within a
reasonable timewindow, which should be no problem as long as the
system is not heavily overloaded, in which case in-kernel dispatching
would be nothing better).

Apart from taking complexity out of the kernel it also reduces the
tracing impact in case of event bursts because (provided the
ringbuffer is large enough) the (potentially timeconsuming in case of
many active tracers) dispatching of events is decoupled (in time) from
the event recording.

You will have to record uid/gid/pid/whatever criteria you might think
of with the event, somewhat enlarging (by a few bytes) a single event
record (don't know how much of this data you are currently gathering),
but that is a minor tradeoff IMHO.

Andreas
--
Andreas Ferber - dev/consulting GmbH - Bielefeld, FRG
---------------------------------------------------------
+49 521 1365800 - [email protected] - http://www.devcon.net

2002-09-23 23:22:12

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure


Andreas Ferber wrote:
> Fairly simple to achieve: provide some sort of userspace trace daemon
> from which the users request the trace events they want to see
> (communicating through standard IPC channels). The daemon provides a
> unified event mask to the kernel (to prevent unnecessary overhead in
> the kernel proper) and dispatches the events read from the kernel.

This, though, is the rather simple scenario. What would you do with
events generated by a user-space application and for which the daemon
has no idea of the format? What if there are five such applications
running in parallel each asking for its own separate buffer? What
about the telco guys at the RAS BoF (any many others for that matter)
that want to have a flight recorder (i.e. you always keep one buffer
going ALL the time and you only check it when something bad happens)?
There really needs to be a way to handle multiple input and output
streams. Fortunately, nontheless, our preliminary work on this issue
indicates that the overhead is minimal (i.e. add a "channel ID" to
each event; the tracer then uses the ID to put the event in the right
buffer; no filtering at this level).

> You will have to record uid/gid/pid/whatever criteria you might think
> of with the event, somewhat enlarging (by a few bytes) a single event
> record (don't know how much of this data you are currently gathering),
> but that is a minor tradeoff IMHO.

Minimizing event size is pretty high on the list of priorities. A few
bytes more per event makes a huge difference when you know that there
are above 10,000 events per second for a middle to low speed machine.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-09-24 00:58:34

by john slee

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Sun, Sep 22, 2002 at 09:27:29PM +0200, Andi Kleen wrote:
> There is an old patch around from SGI that does exactly this. It is a
> very lightweight binary value tracer that has per CPU buffers. It
> traces using macros that you can easily add. It's called ktrace (not
> to be confused with Ingo's ktrace). I've been porting it for some time

and different again from *bsd ktrace?

The Ravenous Bugblatter Beast of Traal hits! -- More --
You hear the wailing of the Banshee... -- More --
The Christmas Tree hits! -- More --
You die...

(guessing here, i've not seen ingo's ktrace)

j.

--
toyota power: http://indigoid.net/

2002-09-24 11:35:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] LTT for 2.5.38 1/9: Core infrastructure

On Tue, Sep 24, 2002 at 11:07:59AM +1000, john slee wrote:
> On Sun, Sep 22, 2002 at 09:27:29PM +0200, Andi Kleen wrote:
> > There is an old patch around from SGI that does exactly this. It is a
> > very lightweight binary value tracer that has per CPU buffers. It
> > traces using macros that you can easily add. It's called ktrace (not
> > to be confused with Ingo's ktrace). I've been porting it for some time
>
> and different again from *bsd ktrace?

Yes, of course.

-Andi