2004-11-08 18:58:13

by Jay Lan

[permalink] [raw]
Subject: [PATCH 2.6.9 0/2] new enhanced accounting data collection

In earlier round of discussion, all partipants favored a common
layer of accounting data collection.

This is intended to offer common data collection method for various
accounting packages including BSD accounting, ELSA, CSA, and any other
acct packages that use a common layer of data collection.

This patchset consists of two parts: acct_io and acct_mm. Discussion
identified that improved data collection in the area of I/O and Memory
are useful to larger systems.

acct_io:
collects per process data on charater read/written in bytes
and number of read/write syscalls made.

acct_mm:
collects per process data on rss and vm total usage and
peak usage.

Andrew, this new version incorporated feedback from your prior comment.


Best Regards,

Jay Lan - Linux System Software
Silicon Graphics Inc., Mountain View, CA


2004-11-08 19:11:06

by Jay Lan

[permalink] [raw]
Subject: [PATCH 2.6.9 1/2] enhanced I/O accounting data patch

Index: linux/fs/read_write.c
===================================================================
--- linux.orig/fs/read_write.c 2004-10-18 14:54:37.000000000 -0700
+++ linux/fs/read_write.c 2004-11-03 16:38:10.270235494 -0800
@@ -216,8 +216,11 @@ ssize_t vfs_read(struct file *file, char
ret = file->f_op->read(file, buf, count, pos);
else
ret = do_sync_read(file, buf, count, pos);
- if (ret > 0)
+ if (ret > 0) {
dnotify_parent(file->f_dentry, DN_ACCESS);
+ current->rchar += ret;
+ }
+ current->syscr++;
}
}

@@ -260,8 +263,11 @@ ssize_t vfs_write(struct file *file, con
ret = file->f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
- if (ret > 0)
+ if (ret > 0) {
dnotify_parent(file->f_dentry, DN_MODIFY);
+ current->wchar += ret;
+ }
+ current->syscw++;
}
}

@@ -540,6 +546,9 @@ sys_readv(unsigned long fd, const struct
fput_light(file, fput_needed);
}

+ if (ret > 0)
+ current->rchar += ret;
+ current->syscr++;
return ret;
}

@@ -558,6 +567,9 @@ sys_writev(unsigned long fd, const struc
fput_light(file, fput_needed);
}

+ if (ret > 0)
+ current->wchar += ret;
+ current->syscw++;
return ret;
}

@@ -636,6 +648,13 @@ static ssize_t do_sendfile(int out_fd, i

retval = in_file->f_op->sendfile(in_file, ppos, count, file_send_actor, out_file);

+ if (retval > 0) {
+ current->rchar += retval;
+ current->wchar += retval;
+ }
+ current->syscr++;
+ current->syscw++;
+
if (*ppos > max)
retval = -EOVERFLOW;

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h 2004-10-18 14:53:13.000000000 -0700
+++ linux/include/linux/sched.h 2004-11-03 15:52:01.803397172 -0800
@@ -580,6 +580,8 @@ struct task_struct {
* to a stack based synchronous wait) if its doing sync IO.
*/
wait_queue_t *io_wait;
+/* i/o counters(bytes read/written, #syscalls */
+ u64 rchar, wchar, syscr, syscw;
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next; /* could be shared with used_math */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2004-10-18 14:53:13.000000000 -0700
+++ linux/kernel/fork.c 2004-11-03 16:44:23.266042599 -0800
@@ -985,12 +985,21 @@ static task_t *copy_process(unsigned lon
clear_tsk_thread_flag(p, TIF_SIGPENDING);
init_sigpending(&p->pending);

- p->it_real_value = p->it_virt_value = p->it_prof_value = 0;
- p->it_real_incr = p->it_virt_incr = p->it_prof_incr = 0;
+ p->it_real_value = 0;
+ p->it_virt_value = 0;
+ p->it_prof_value = 0;
+ p->it_real_incr = 0;
+ p->it_virt_incr = 0;
+ p->it_prof_incr = 0;
init_timer(&p->real_timer);
p->real_timer.data = (unsigned long) p;

- p->utime = p->stime = 0;
+ p->utime = 0;
+ p->stime = 0;
+ p->rchar = 0; /* I/O counter: bytes read */
+ p->wchar = 0; /* I/O counter: bytes written */
+ p->syscr = 0; /* I/O counter: read syscalls */
+ p->syscw = 0; /* I/O counter: write syscalls */
p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
p->security = NULL;


Attachments:
acct_io (3.12 kB)

2004-11-08 19:15:19

by Jay Lan

[permalink] [raw]
Subject: [PATCH 2.6.9 2/2] enhanced Memory accounting data collection

Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c 2004-10-18 14:53:51.000000000 -0700
+++ linux/fs/exec.c 2004-11-03 18:10:34.126623587 -0800
@@ -46,6 +46,7 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/rmap.h>
+#include <linux/acct.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1161,6 +1162,8 @@ int do_execve(char * filename,

/* execve success */
security_bprm_free(bprm);
+ acct_update_integrals();
+ update_mem_hiwater();
kfree(bprm);
return retval;
}
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h 2004-11-03 15:52:01.803397172 -0800
+++ linux/include/linux/sched.h 2004-11-05 14:02:56.240526520 -0800
@@ -249,6 +249,9 @@ struct mm_struct {
struct kioctx *ioctx_list;

struct kioctx default_kioctx;
+
+ unsigned long hiwater_rss; /* High-water RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
};

extern int mmlist_nr;
@@ -582,6 +585,11 @@ struct task_struct {
wait_queue_t *io_wait;
/* i/o counters(bytes read/written, #syscalls */
u64 rchar, wchar, syscr, syscw;
+#if defined(CONFIG_BSD_PROCESS_ACCT)
+ u64 acct_rss_mem1; /* accumulated rss usage */
+ u64 acct_vm_mem1; /* accumulated virtual memory usage */
+ clock_t acct_stimexpd; /* clock_t-converted stime since last update */
+#endif
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next; /* could be shared with used_math */
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c 2004-10-18 14:55:06.000000000 -0700
+++ linux/kernel/exit.c 2004-11-03 18:10:34.155920668 -0800
@@ -807,6 +807,8 @@ asmlinkage NORET_TYPE void do_exit(long
ptrace_notify((PTRACE_EVENT_EXIT << 8) | SIGTRAP);
}

+ acct_update_integrals();
+ update_mem_hiwater();
acct_process(code);
__exit_mm(tsk);

Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c 2004-11-03 16:44:23.266042599 -0800
+++ linux/kernel/fork.c 2004-11-03 18:14:53.035673250 -0800
@@ -38,6 +38,7 @@
#include <linux/audit.h>
#include <linux/profile.h>
#include <linux/rmap.h>
+#include <linux/acct.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -605,6 +606,9 @@ static int copy_mm(unsigned long clone_f
if (retval)
goto free_pt;

+ mm->hiwater_rss = mm->rss;
+ mm->hiwater_vm = mm->total_vm;
+
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
@@ -1000,6 +1004,8 @@ static task_t *copy_process(unsigned lon
p->wchar = 0; /* I/O counter: bytes written */
p->syscr = 0; /* I/O counter: read syscalls */
p->syscw = 0; /* I/O counter: write syscalls */
+ acct_clear_integrals(p);
+
p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
p->security = NULL;
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2004-10-18 14:54:07.000000000 -0700
+++ linux/mm/memory.c 2004-11-05 14:14:00.825358944 -0800
@@ -44,6 +44,7 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/rmap.h>
+#include <linux/acct.h>
#include <linux/module.h>
#include <linux/init.h>

@@ -605,6 +606,7 @@ void zap_page_range(struct vm_area_struc
tlb = tlb_gather_mmu(mm, 0);
unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
tlb_finish_mmu(tlb, address, end);
+ acct_update_integrals();
spin_unlock(&mm->page_table_lock);
}

@@ -1095,9 +1097,11 @@ static int do_wp_page(struct mm_struct *
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
- if (PageReserved(old_page))
+ if (PageReserved(old_page)) {
++mm->rss;
- else
+ acct_update_integrals();
+ update_mem_hiwater();
+ } else
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1379,6 +1383,9 @@ static int do_swap_page(struct mm_struct
remove_exclusive_swap_page(page);

mm->rss++;
+ acct_update_integrals();
+ update_mem_hiwater();
+
pte = mk_pte(page, vma->vm_page_prot);
if (write_access && can_share_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -1444,6 +1451,8 @@ do_anonymous_page(struct mm_struct *mm,
goto out;
}
mm->rss++;
+ acct_update_integrals();
+ update_mem_hiwater();
entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,
vma->vm_page_prot)),
vma);
@@ -1553,6 +1562,9 @@ retry:
if (pte_none(*page_table)) {
if (!PageReserved(new_page))
++mm->rss;
+ acct_update_integrals();
+ update_mem_hiwater();
+
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
@@ -1787,6 +1799,24 @@ struct page * vmalloc_to_page(void * vma

EXPORT_SYMBOL(vmalloc_to_page);

+/*
+ * update_mem_hiwater
+ * - update per process rss and vm high water data
+ */
+void update_mem_hiwater(void)
+{
+ struct task_struct *parent = current;
+
+ if (parent->mm) {
+ if (parent->mm->hiwater_rss < parent->mm->rss) {
+ parent->mm->hiwater_rss = parent->mm->rss;
+ }
+ if (parent->mm->hiwater_vm < parent->mm->total_vm) {
+ parent->mm->hiwater_vm = parent->mm->total_vm;
+ }
+ }
+}
+
#if !defined(CONFIG_ARCH_GATE_AREA)

#if defined(AT_SYSINFO_EHDR)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2004-10-18 14:54:37.000000000 -0700
+++ linux/mm/mmap.c 2004-11-05 14:00:13.138780763 -0800
@@ -7,6 +7,7 @@
*/

#include <linux/slab.h>
+#include <linux/mm.h>
#include <linux/shm.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
@@ -20,6 +21,7 @@
#include <linux/hugetlb.h>
#include <linux/profile.h>
#include <linux/module.h>
+#include <linux/acct.h>
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
@@ -1016,6 +1018,8 @@ out:
down_write(&mm->mmap_sem);
}
__vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
+ acct_update_integrals();
+ update_mem_hiwater();
return addr;

unmap_and_free_vma:
@@ -1362,6 +1366,8 @@ int expand_stack(struct vm_area_struct *
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm += grow;
__vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, grow);
+ acct_update_integrals();
+ update_mem_hiwater();
anon_vma_unlock(vma);
return 0;
}
@@ -1818,6 +1824,8 @@ out:
mm->locked_vm += len >> PAGE_SHIFT;
make_pages_present(addr, addr + len);
}
+ acct_update_integrals();
+ update_mem_hiwater();
return addr;
}

Index: linux/mm/mremap.c
===================================================================
--- linux.orig/mm/mremap.c 2004-10-18 14:54:31.000000000 -0700
+++ linux/mm/mremap.c 2004-11-03 18:10:34.194006874 -0800
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/highmem.h>
#include <linux/security.h>
+#include <linux/acct.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -232,6 +233,9 @@ static unsigned long move_vma(struct vm_
new_addr + new_len);
}

+ acct_update_integrals();
+ update_mem_hiwater();
+
return new_addr;
}

@@ -368,6 +372,8 @@ unsigned long do_mremap(unsigned long ad
make_pages_present(addr + old_len,
addr + new_len);
}
+ acct_update_integrals();
+ update_mem_hiwater();
ret = addr;
goto out;
}
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2004-10-18 14:55:18.000000000 -0700
+++ linux/mm/rmap.c 2004-11-03 18:10:34.200842860 -0800
@@ -50,6 +50,7 @@
#include <linux/swapops.h>
#include <linux/slab.h>
#include <linux/init.h>
+#include <linux/acct.h>
#include <linux/rmap.h>
#include <linux/rcupdate.h>

@@ -581,6 +582,7 @@ static int try_to_unmap_one(struct page
}

mm->rss--;
+ acct_update_integrals();
page_remove_rmap(page);
page_cache_release(page);

@@ -680,6 +682,7 @@ static void try_to_unmap_cluster(unsigne

page_remove_rmap(page);
page_cache_release(page);
+ acct_update_integrals();
mm->rss--;
(*mapcount)--;
}
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c 2004-10-18 14:53:43.000000000 -0700
+++ linux/mm/swapfile.c 2004-11-03 18:10:34.208655415 -0800
@@ -24,6 +24,7 @@
#include <linux/module.h>
#include <linux/rmap.h>
#include <linux/security.h>
+#include <linux/acct.h>
#include <linux/backing-dev.h>

#include <asm/pgtable.h>
@@ -435,6 +436,8 @@ unuse_pte(struct vm_area_struct *vma, un
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_anon_rmap(page, vma, address);
swap_free(entry);
+ acct_update_integrals();
+ update_mem_hiwater();
}

/* vma->vm_mm->page_table_lock is held */
Index: linux/include/linux/acct.h
===================================================================
--- linux.orig/include/linux/acct.h 2004-10-18 14:53:43.000000000 -0700
+++ linux/include/linux/acct.h 2004-11-04 17:23:33.595596763 -0800
@@ -120,9 +120,13 @@ struct acct_v3
struct super_block;
extern void acct_auto_close(struct super_block *sb);
extern void acct_process(long exitcode);
+extern void acct_update_integrals(void);
+extern void acct_clear_integrals(struct task_struct *tsk);
#else
#define acct_auto_close(x) do { } while (0)
#define acct_process(x) do { } while (0)
+#define acct_update_integrals() do { } while (0)
+#define acct_clear_integrals(task) do { } while (0)
#endif

/*
Index: linux/kernel/acct.c
===================================================================
--- linux.orig/kernel/acct.c 2004-10-18 14:54:31.000000000 -0700
+++ linux/kernel/acct.c 2004-11-04 17:47:22.785444845 -0800
@@ -521,3 +521,35 @@ void acct_process(long exitcode)
do_acct_process(exitcode, file);
fput(file);
}
+
+
+/*
+ * acct_update_integrals
+ * - update mm integral fields in task_struct
+ */
+void acct_update_integrals(void)
+{
+ long delta;
+ struct task_struct *parent = current;
+
+ if (parent->mm) {
+ delta = parent->stime - parent->acct_stimexpd;
+ parent->acct_stimexpd = parent->stime;
+ parent->acct_rss_mem1 += delta * parent->mm->rss;
+ parent->acct_vm_mem1 += delta * parent->mm->total_vm;
+ }
+}
+
+
+/*
+ * acct_clear_integrals
+ * - clear the mm integral fields in task_struct
+ */
+void acct_clear_integrals(struct task_struct *tsk)
+{
+ if (tsk) {
+ tsk->acct_stimexpd = 0;
+ tsk->acct_rss_mem1 = 0;
+ tsk->acct_vm_mem1 = 0;
+ }
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2004-10-18 14:53:07.000000000 -0700
+++ linux/include/linux/mm.h 2004-11-05 14:11:24.004898462 -0800
@@ -782,6 +782,9 @@ static inline void vm_stat_unaccount(str
-vma_pages(vma));
}

+/* update per process rss and vm hiwater data */
+extern void update_mem_hiwater(void);
+
#ifndef CONFIG_DEBUG_PAGEALLOC
static inline void
kernel_map_pages(struct page *page, int numpages, int enable)


Attachments:
acct_mm (10.90 kB)

2004-11-09 13:42:56

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [PATCH 2.6.9 0/2] new enhanced accounting data collection

On Mon, 2004-11-08 at 19:52, Jay Lan wrote:
> In earlier round of discussion, all partipants favored a common
> layer of accounting data collection.
>
> This is intended to offer common data collection method for various
> accounting packages including BSD accounting, ELSA, CSA, and any other
> acct packages that use a common layer of data collection.

I found this great. Now I think, as you already pointed, we need to
modify the end-of-process handling. Currently I use the BSD structure
but this part of ELSA can be changed very easily.

Regards,
Guillaume

2004-11-10 01:31:03

by Jay Lan

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 2.6.9 0/2] new enhanced accounting data collection

I looked at the latest 2.6.10-rc1-mm4, and found the eop handler
acct_process(code) used to be invoked per process from do_exit()
has been hijacked :) to become a per group thing. I would still like
to have a per process eop handling. How about BSD and ELSA?

Thanks,
- jay


Guillaume Thouvenin wrote:
> On Mon, 2004-11-08 at 19:52, Jay Lan wrote:
>
>>In earlier round of discussion, all partipants favored a common
>>layer of accounting data collection.
>>
>>This is intended to offer common data collection method for various
>>accounting packages including BSD accounting, ELSA, CSA, and any other
>>acct packages that use a common layer of data collection.
>
>
> I found this great. Now I think, as you already pointed, we need to
> modify the end-of-process handling. Currently I use the BSD structure
> but this part of ELSA can be changed very easily.
>
> Regards,
> Guillaume
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Sybase ASE Linux Express Edition - download now for FREE
> LinuxWorld Reader's Choice Award Winner for best database on Linux.
> http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

2004-11-10 08:50:21

by Guillaume Thouvenin

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 2.6.9 0/2] new enhanced accounting data collection

On Wed, 2004-11-10 at 02:28, Jay Lan wrote:
> I looked at the latest 2.6.10-rc1-mm4, and found the eop handler
> acct_process(code) used to be invoked per process from do_exit()
> has been hijacked :) to become a per group thing. I would still like
> to have a per process eop handling. How about BSD and ELSA?

ELSA uses a daemon called "jobd" which is able to produce a file that
contains informations about the relationship between a process and its
job. The conjunction of this file and the per-job accounting information
(currently the BSD-accounting) allows ELSA to provide per-job
accounting. Thus, I also need per-process accounting. Maybe the per
process eop handling can be done with CSA...

Guillaume