2002-07-26 15:07:05

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [RFC] Scalable statistics counters using kmalloc_percpu

Here is a Scalable statistics counter implementation which works on top
of the kmalloc_percpu dynamic allocator published by Dipankar.
This patch is against 2.5.27.

Description:
The following patch provides easy to use interfaces to replace
kernel counters where read accuracy is not that important and write is
frequent, for better cache characteristics. The foll patch also exports
the kernel counters to user apps via /proc.
Major benefits can be acheived when these statistics counters are used
inplace of atomic_t counters as you avoid expensive locked instructions.

This patch provides the following interfaces :

1. statctr_init(statctr_t *ctr, unsigned long initval,
struct proc_dir_entry *parent, const char *procname);
Allocates memory to the counter and initialises it;
2. statctr_cleanup(statctr_t *);
Cleans up allocated counter
3. statctr_inc(statctr_t *);
Increments the counter
4. statctr_dec(statctr_t *);
Decrements the counter
5. statctr_add
.
.

For more details visit
http://lse.sf.net/counters

Rik, You were interested in using this. Does this implementation suit
your needs?

Comments most welcome

Thanks,
Kiran



diff -ruN -X dontdiff linux-2.5.25/fs/proc/root.c statctr-2.5.25/fs/proc/root.c
--- linux-2.5.25/fs/proc/root.c Sat Jul 6 05:12:33 2002
+++ statctr-2.5.25/fs/proc/root.c Sun Jul 14 14:24:47 2002
@@ -19,6 +19,7 @@
#include <linux/smp_lock.h>

struct proc_dir_entry *proc_net, *proc_bus, *proc_root_fs, *proc_root_driver;
+struct proc_dir_entry *proc_stats;

#ifdef CONFIG_SYSCTL
struct proc_dir_entry *proc_sys_root;
@@ -77,6 +78,7 @@
proc_rtas_init();
#endif
proc_bus = proc_mkdir("bus", 0);
+ proc_stats = proc_mkdir("stats", 0);
}

static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry)

diff -ruN -X dontdiff linux-2.5.25/include/linux/statctr.h statctr-2.5.25/include/linux/statctr.h
--- linux-2.5.25/include/linux/statctr.h Thu Jan 1 05:30:00 1970
+++ statctr-2.5.25/include/linux/statctr.h Sun Jul 14 14:24:47 2002
@@ -0,0 +1,186 @@
+/*
+ * Scalable Statistics Counters.
+ *
+ * Visit http://lse.sourceforge.net/counters for detailed explanation of
+ * Scalable Statistic Counters
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Author: Ravikiran Thirumalai <[email protected]>
+ *
+ * include/linux/statctr.h
+ *
+ */
+
+#if !defined(_LINUX_STATCTR_H)
+#define _LINUX_STATCTR_H
+
+#if defined(__KERNEL__)
+
+#include <linux/config.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+
+#ifdef CONFIG_PROC_FS
+extern struct proc_dir_entry *proc_stats;
+#endif
+
+typedef struct {
+#ifdef CONFIG_SMP
+ unsigned long *ctr;
+#else
+ unsigned long ctr;
+#endif
+#ifdef CONFIG_PROC_FS
+ struct proc_dir_entry *base;
+ char *name;
+#endif /* CONFIG_PROC_FS */
+} statctr_t;
+
+/* prototypes */
+extern int statctr_init(statctr_t *, unsigned long,
+ struct proc_dir_entry *, const char *);
+extern void statctr_cleanup(statctr_t *);
+extern int statctr_ninit(statctr_t *, unsigned long, int);
+extern void statctr_ncleanup(statctr_t *, int);
+
+#ifdef CONFIG_SMP
+
+static inline int __statctr_init(statctr_t *stctr)
+{
+ stctr->ctr = kmalloc_percpu(sizeof(*(stctr->ctr)), GFP_ATOMIC);
+ if(!stctr->ctr)
+ return -1;
+ return 0;
+}
+
+static inline void __statctr_cleanup(statctr_t *stctr)
+{
+ kfree_percpu(stctr->ctr);
+}
+
+#else /* CONFIG_SMP */
+
+static inline int __statctr_init(statctr_t *stctr)
+{
+ return 0;
+}
+
+static inline void __statctr_cleanup(statctr_t *stctr) {}
+
+#endif /* CONFIG_SMP */
+
+/* inlines */
+#ifdef CONFIG_SMP
+/**
+ * statctr_inc - Increment the statistics counter by one.
+ * @stctr: Statistics counter
+ *
+ * Increments the counter by one. Internally only the per-cpu counter is
+ * incremented.
+ */
+
+static inline void statctr_inc(statctr_t *stctr)
+{
+ (*per_cpu_ptr(stctr->ctr, smp_processor_id()))++;
+}
+
+/**
+ * statctr_dec - Deccrement the statistics counter by one.
+ * @stctr: Statistics counter
+ *
+ * Decrements the counter by one. Internally only the per-cpu counter is
+ * incremented.
+ */
+
+static inline void statctr_dec(statctr_t *stctr)
+{
+ (*per_cpu_ptr(stctr->ctr, smp_processor_id()))--;
+}
+
+/**
+ * statctr_set - Set the statistics counter to value passed.
+ * @stctr: Statistcs counter
+ * @val: Value to be set..
+ *
+ * Sets the statistics counter. If statctr_read() is invoked after a counter
+ * is set, return value of statctr_read shud reflect the value set.
+ */
+
+static inline void statctr_set(statctr_t *stctr, unsigned long val)
+{
+ int i;
+
+ for (i=0; i < NR_CPUS; i++) {
+ *per_cpu_ptr(stctr->ctr, i) = 0;
+ }
+ *this_cpu_ptr(stctr->ctr) = val;
+}
+
+/**
+ * statctr_read - Returns the counter value.
+ * @stctr: Statistics counter
+ *
+ * Reads all of the other per-cpu versions of this counter, consolidates them
+ * and returns to the caller.
+ */
+
+static inline long statctr_read(statctr_t *stctr)
+{
+ int i;
+ unsigned long res = 0;
+ for( i=0; i < NR_CPUS; i++ )
+ res += *per_cpu_ptr(stctr->ctr, i);
+ return res;
+}
+
+/**
+ * statctr_add - Adds the passed val to the counter value.
+ * @stctr: Statistics counter
+ * @val: Addend
+ *
+ */
+
+static inline void statctr_add(statctr_t *stctr, unsigned long val)
+{
+ *per_cpu_ptr(stctr->ctr, smp_processor_id()) += val;
+}
+
+/**
+ * statctr_sub - Subtracts the passed val from the counter value.
+ * @stctr: Statistics counter
+ * @val: Subtrahend
+ *
+ */
+
+static inline void statctr_sub(statctr_t *stctr, unsigned long val)
+{
+ *per_cpu_ptr(stctr->ctr, smp_processor_id()) -= val;
+}
+#else /* CONFIG_SMP */
+#define statctr_inc(stctr) (((stctr)->ctr)++)
+#define statctr_dec(stctr) (((stctr)->ctr)--)
+#define statctr_read(stctr) ((stctr)->ctr)
+#define statctr_set(stctr,val) ((stctr)->ctr = (val))
+#define statctr_add(stctr,val) (((stctr)->ctr)+=(val))
+#define statctr_sub(stctr,val) (((stctr)->ctr)-=(val))
+#endif
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_STATCTR_H */
diff -ruN -X dontdiff linux-2.5.25/kernel/Makefile statctr-2.5.25/kernel/Makefile
--- linux-2.5.25/kernel/Makefile Sat Jul 6 05:12:18 2002
+++ statctr-2.5.25/kernel/Makefile Sun Jul 14 14:24:47 2002
@@ -10,12 +10,12 @@
O_TARGET := kernel.o

export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
- printk.o platform.o suspend.o
+ printk.o platform.o suspend.o statctr.o

obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o statctr.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -ruN -X dontdiff linux-2.5.25/kernel/statctr.c statctr-2.5.25/kernel/statctr.c
--- linux-2.5.25/kernel/statctr.c Thu Jan 1 05:30:00 1970
+++ statctr-2.5.25/kernel/statctr.c Sun Jul 14 14:24:47 2002
@@ -0,0 +1,165 @@
+/*
+ * Scalable Statistics Counters.
+ *
+ * Visit http://lse.sourceforge.net/counters for detailed explanation of
+ * Scalable Statistic Counters
+ *
+ * Copyright (c) International Business Machines Corp., 2001
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Author: Ravikiran Thirumalai <[email protected]>
+ *
+ * kernel/statctr.c
+ *
+ */
+
+#include <linux/statctr.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+
+#ifdef CONFIG_PROC_FS
+static int proc_read_statctr(char *page, char **start,
+ off_t off, int count, int *eof, void *data)
+{
+ int len;
+ statctr_t *stctr = (statctr_t *) data;
+ len = sprintf(page, "%ld\n", statctr_read(stctr));
+ return len;
+}
+
+static int statctr_proc_init(statctr_t *stctr, struct proc_dir_entry *procbase,
+ const char *procname)
+{
+ struct proc_dir_entry *tmpbase, *tmp;
+
+ stctr->name = NULL;
+ stctr->base = NULL;
+ tmpbase = proc_stats;
+
+ if (procname != NULL) {
+ if(procbase != NULL)
+ tmpbase = procbase;
+ tmp = create_proc_read_entry( procname, 0444, tmpbase,
+ proc_read_statctr, stctr);
+ if (!tmp)
+ return -1;
+ stctr->name = kmalloc(strlen(procname) + 1, GFP_ATOMIC);
+ if(!stctr->name) {
+ remove_proc_entry(procname, tmpbase);
+ return -1;
+ }
+ memcpy(stctr->name, procname, strlen(procname)+1);
+ stctr->base = tmpbase;
+ }
+ return 0;
+}
+
+static void statctr_proc_remove(statctr_t *stctr)
+{
+ if(stctr->name) {
+ remove_proc_entry(stctr->name, stctr->base);
+ kfree(stctr->name);
+ }
+}
+
+#else /* CONFIG_PROC_FS */
+#define statctr_proc_init(ctr, base, name) ({do { } while (0); 0;})
+#define statctr_proc_remove(ctr) do { } while (0)
+#endif /* CONFIG_PROC_FS */
+
+/**
+ * statctr_init - Sets up the statistics counter
+ * @ctr: Pointer to statctr_t type; counter to be initialised
+ * @val: Initialization value.
+ * @procbase: The base /proc dir where the statctr entry is to be created -
+ * (return value of a proc_mkdir)
+ * /proc/stats will be considered as the base if
+ * NULL is passed to this parameter
+ * @procname: /proc entry name. No /proc stuff inits are done if the
+ * name is NULL, but statctr is nevertheless initialized
+ *
+ * Allocates memory to a statistics counter. Returns 0 on successful
+ * allocation and non zero otherwise. Makes a /proc entry based on the
+ * procbase and procname parameters.
+ */
+int statctr_init(statctr_t *stctr, unsigned long val,
+ struct proc_dir_entry *procbase, const char *procname)
+{
+ if(statctr_proc_init(stctr, procbase, procname))
+ return -1;
+ if(__statctr_init(stctr)) {
+ statctr_proc_remove(stctr);
+ return -1;
+ }
+ statctr_set(stctr, val);
+ return 0;
+}
+
+/**
+ * statctr_cleanup - Perform cleanup functions on a statctr counter
+ * @ctr: Pointer to statctr_t type;
+ */
+void statctr_cleanup(statctr_t *stctr)
+{
+ __statctr_cleanup(stctr);
+ statctr_proc_remove(stctr);
+}
+
+/**
+ * statctr_ninit - Inits a bunch of contiguos statctrs
+ * @ctr: Ptr to first ctr to be inited in the bunch
+ * @val: Val to be set
+ * @no : No of ctrs to be inited
+ *
+ * Inits a bunch of contigious statctrs. Useful when statctrs
+ * are used as arrays or in "packed" structures.
+ * Returns non zero if inits fail
+ */
+int statctr_ninit(statctr_t *stctr, unsigned long val, int no)
+{
+ int i;
+ for(i=0; i < no; i++) {
+ if(statctr_init(&stctr[i], val, NULL, NULL))
+ goto cleanup;
+ }
+ return 0;
+
+cleanup:
+ i--;
+ for(; i >= 0; i--) {
+ statctr_cleanup(&stctr[i]);
+ }
+ return -1;
+}
+
+/**
+ * statctr_ncleanup - cleans up a bunch of contiguos statctrs
+ * @ctr: Ptr to first ctr to be inited in the bunch
+ * @no : No of ctrs to be inited
+ *
+ */
+void statctr_ncleanup(statctr_t *stctr, int no)
+{
+ int i;
+ for(i=0; i < no; i++)
+ statctr_cleanup(&stctr[i]);
+}
+
+EXPORT_SYMBOL(statctr_init);
+EXPORT_SYMBOL(statctr_cleanup);
+EXPORT_SYMBOL(statctr_ninit);
+EXPORT_SYMBOL(statctr_ncleanup);
+


2002-07-26 15:25:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, 26 Jul 2002, Ravikiran G Thirumalai wrote:

> Rik, You were interested in using this. Does this implementation suit
> your needs?

>From a quick glance it looks like it will.

However, it might be more efficient to put the statistics
in one file in /proc with named fields, or have a way to
group them in one or multiple files.

Not sure about that, though ... really depends on how
expensive stat+open+read+close is compared to parsing a
file with multiple fields.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-26 15:47:58

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 12:27:40PM -0300, Rik van Riel wrote:
> On Fri, 26 Jul 2002, Ravikiran G Thirumalai wrote:
>
> > Rik, You were interested in using this. Does this implementation suit
> > your needs?
>
> >From a quick glance it looks like it will.
>
> However, it might be more efficient to put the statistics
> in one file in /proc with named fields, or have a way to
> group them in one or multiple files.
>
> Not sure about that, though ... really depends on how
> expensive stat+open+read+close is compared to parsing a
> file with multiple fields.

Hi Rik,

It seems that either way it might not have the scalability
required for system monitoring software that needs faster
access. One of the possibilities is to see if they can be
mapped to user space, but that requires significant chage
in the percpu allocator. Does this seem like a logical next
step for exploration to you ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-07-26 19:44:09

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 11:46:34AM -0700, Andrew Morton wrote:
> Oh dear. Most people only have two CPUs.
> Rusty, can we *please* fix this? Really soon?

I'll post the panic triggered by lowering NR_CPUS shortly. There's
an ugly showstopping i386 arch code issue here.


Cheers,
Bill

2002-07-26 19:47:04

by Robert Love

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, 2002-07-26 at 12:46, William Lee Irwin III wrote:
> On Fri, Jul 26, 2002 at 11:46:34AM -0700, Andrew Morton wrote:
> > Oh dear. Most people only have two CPUs.
> > Rusty, can we *please* fix this? Really soon?
>
> I'll post the panic triggered by lowering NR_CPUS shortly. There's
> an ugly showstopping i386 arch code issue here.

In current 2.5? I thought Andrew and I fixed all those issues and
pushed them to Linus...

The `configurable NR_CPUS' patch works fine for me. I always boot with
NR_CPUS=2.

Robert love

2002-07-26 18:45:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

Ravikiran G Thirumalai wrote:
>
> Here is a Scalable statistics counter implementation which works on top
> of the kmalloc_percpu dynamic allocator published by Dipankar.
> This patch is against 2.5.27.
>
> ...
> +static inline int __statctr_init(statctr_t *stctr)
> +{
> + stctr->ctr = kmalloc_percpu(sizeof(*(stctr->ctr)), GFP_ATOMIC);
> + if(!stctr->ctr)
> + return -1;
> + return 0;
> +}

Minor nit: please force the caller to pass in the gfp_flags when
designing an API like this. The fact that you were forced to use
GFP_ATOMIC here shows why...

> + for( i=0; i < NR_CPUS; i++ )
> + res += *per_cpu_ptr(stctr->ctr, i);
> + return res;
> +}

Oh dear. Most people only have two CPUs.

Rusty, can we *please* fix this? Really soon?


General comment: we need to clean up the kernel_stat stuff. We
cannot just make it per-cpu because it is 32k in size already. I
would suggest that we should break out the disk accounting and
make the rest of kernel_stat per CPU.

That would be a great application of your interface, and a good
way to get your interface merged ;) Is that something which you
have time to do?

Thanks.

2002-07-26 19:50:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 11:46:34AM -0700, Andrew Morton wrote:
>>> Oh dear. Most people only have two CPUs.
>>> Rusty, can we *please* fix this? Really soon?

On Fri, 2002-07-26 at 12:46, William Lee Irwin III wrote:
>> I'll post the panic triggered by lowering NR_CPUS shortly. There's
>> an ugly showstopping i386 arch code issue here.

On Fri, Jul 26, 2002 at 12:50:12PM -0700, Robert Love wrote:
> In current 2.5? I thought Andrew and I fixed all those issues and
> pushed them to Linus...
> The `configurable NR_CPUS' patch works fine for me. I always boot with
> NR_CPUS=2.

Sorry I didn't get a chance to test in time, these things are slow to
boot and my testing bandwidth is limited. Please hold off until the
issue is resolved. You *will* prevent me from booting. IO-APIC APIC ID
reassignment panics. I'll follow up after this when the thing comes up
into the kernel exhibiting the problem.


Cheers,
Bill

2002-07-26 20:14:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 11:46:34AM -0700, Andrew Morton wrote:
>>> Oh dear. Most people only have two CPUs.
>>> Rusty, can we *please* fix this? Really soon?

On Fri, 2002-07-26 at 12:46, William Lee Irwin III wrote:
>> I'll post the panic triggered by lowering NR_CPUS shortly. There's
>> an ugly showstopping i386 arch code issue here.

On Fri, Jul 26, 2002 at 12:50:12PM -0700, Robert Love wrote:
> In current 2.5? I thought Andrew and I fixed all those issues and
> pushed them to Linus...
> The `configurable NR_CPUS' patch works fine for me. I always boot with
> NR_CPUS=2.

No idea who it works for, it sure doesn't work here. Behold:


Script started on Fri Jul 26 12:55:23 2002
Linux version 2.5.28-akpm-1 (wli@megeira) (gcc version 2.95.4 20011002 (Debian prerelease)) #1 SMP Fri Jul 26 09:05:07 PDT 2002
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 0000000000100000 - 00000000e0000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec09000 (reserved)
BIOS-e820: 00000000ffe80000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000400000000 (usable)
user-defined physical RAM map:
user: 0000000000000000 - 000000000009fc00 (usable)
user: 0000000000100000 - 00000000e0000000 (usable)
user: 00000000fec00000 - 00000000fec09000 (reserved)
user: 00000000ffe80000 - 0000000100000000 (reserved)
user: 0000000100000000 - 0000000400000000 (usable)
15488MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f6040
hm, page 000f6000 reserved twice.
hm, page 000f7000 reserved twice.
On node 0 totalpages: 4194304
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 3964928 pages.
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: IBM NUMA Product ID: SBB APIC at: 0xFEC08000
Found an OEM MPC table at 7009c8 - parsing it ...
Translation: record 0, type 1, quad 0, global 3, local 3
Translation: record 1, type 1, quad 0, global 1, local 1
Translation: record 2, type 1, quad 0, global 1, local 1
Translation: record 3, type 1, quad 0, global 1, local 1
Translation: record 4, type 1, quad 1, global 1, local 3
Translation: record 5, type 1, quad 1, global 1, local 1
Translation: record 6, type 1, quad 1, global 1, local 1
Translation: record 7, type 1, quad 1, global 1, local 1
Translation: record 8, type 1, quad 2, global 1, local 3
Translation: record 9, type 1, quad 2, global 1, local 1
Translation: record 10, type 1, quad 2, global 1, local 1
Translation: record 11, type 1, quad 2, global 1, local 1
Translation: record 12, type 1, quad 3, global 1, local 3
Translation: record 13, type 1, quad 3, global 1, local 1
Translation: record 14, type 1, quad 3, global 1, local 1
Translation: record 15, type 1, quad 3, global 1, local 1
Translation: record 16, type 3, quad 0, global 0, local 0
Translation: record 17, type 3, quad 0, global 1, local 1
Translation: record 18, type 3, quad 0, global 2, local 2
Translation: record 19, type 4, quad 0, global 12, local 18
Translation: record 20, type 3, quad 1, global 3, local 0
Translation: record 21, type 3, quad 1, global 4, local 1
Translation: record 22, type 3, quad 1, global 5, local 2
Translation: record 23, type 4, quad 1, global 13, local 18
Translation: record 24, type 3, quad 2, global 6, local 0
Translation: record 25, type 3, quad 2, global 7, local 1
Translation: record 26, type 3, quad 2, global 8, local 2
Translation: record 27, type 4, quad 2, global 14, local 18
Translation: record 28, type 3, quad 3, global 9, local 0
Translation: record 29, type 3, quad 3, global 10, local 1
Translation: record 30, type 3, quad 3, global 11, local 2
Translation: record 31, type 4, quad 3, global 15, local 18
Translation: record 32, type 2, quad 0, global 13, local 14
Translation: record 33, type 2, quad 0, global 14, local 13
Translation: record 34, type 2, quad 1, global 15, local 14
Translation: record 35, type 2, quad 1, global 16, local 13
Translation: record 36, type 2, quad 2, global 17, local 14
Translation: record 37, type 2, quad 2, global 18, local 13
Translation: record 38, type 2, quad 3, global 19, local 14
Translation: record 39, type 2, quad 3, global 20, local 13
Processor #0 6:10 APIC version 17 (quad 0, apic 1)
Processor #4 6:10 APIC version 17 (quad 0, apic 8)
Processor #1 6:10 APIC version 17 (quad 0, apic 2)
Processor #2 6:10 APIC version 17 (quad 0, apic 4)
Processor #0 6:10 APIC version 17 (quad 1, apic 17)
Processor #4 6:10 APIC version 17 (quad 1, apic 24)
Processor #1 6:10 APIC version 17 (quad 1, apic 18)
Processor #2 6:10 APIC version 17 (quad 1, apic 20)
Processor #0 6:10 APIC version 17 (quad 2, apic 33)
Processor #4 6:10 APIC version 17 (quad 2, apic 40)
Processor #1 6:10 APIC version 17 (quad 2, apic 34)
Processor #2 6:10 APIC version 17 (quad 2, apic 36)
Processor #0 6:10 APIC version 17 (quad 3, apic 49)
Processor #4 6:10 APIC version 17 (quad 3, apic 56)
Processor #1 6:10 APIC version 17 (quad 3, apic 50)
Processor #2 6:10 APIC version 17 (quad 3, apic 52)
Bus #0 is PCI (node 0)
Bus #1 is PCI (node 0)
Bus #2 is PCI (node 0)
Bus #12 is EISA (node 0)
Bus #3 is PCI (node 1)
Bus #4 is PCI (node 1)
Bus #5 is PCI (node 1)
Bus #13 is EISA (node 1)
Bus #6 is PCI (node 2)
Bus #7 is PCI (node 2)
Bus #8 is PCI (node 2)
Bus #14 is EISA (node 2)
Bus #9 is PCI (node 3)
Bus #10 is PCI (node 3)
Bus #11 is PCI (node 3)
Bus #15 is EISA (node 3)
I/O APIC #13 Version 17 at 0xFE800000.
I/O APIC #14 Version 17 at 0xFE801000.
I/O APIC #15 Version 17 at 0xFE840000.
I/O APIC #16 Version 17 at 0xFE841000.
I/O APIC #17 Version 17 at 0xFE880000.
I/O APIC #18 Version 17 at 0xFE881000.
I/O APIC #19 Version 17 at 0xFE8C0000.
I/O APIC #20 Version 17 at 0xFE8C1000.
Processors: 16
Kernel command line: root=/dev/sda2 console=ttyS0,38400n8 mem=16777216K
Initializing CPU#0
Loading GDT/IDT for CPU#0
Loaded per-cpu LDT/TSS for CPU#0
Cleaned up FPU and debug regs for CPU#0
Detected 700.274 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1380.35 BogoMIPS
Memory: 16069096k/16777216k available (1370k kernel code, 183444k reserved, 530k data, 260k init, 15335424k highmem)
Security Scaffold v1.0.0 initialized
Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Remapping cross-quad port I/O for 4 quads
xquad_portio vaddr 0x00000000, len 00200000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU0: Intel 00/0a stepping 04
per-CPU timeslice cutoff: 5846.34 usecs.
task migration cache decay timeout: 6 msecs.
enabled ExtINT on CPU#0
Leaving ESR disabled.
Booting processor 1/2 eip 2000
Initializing CPU#1
Loading GDT/IDT for CPU#1
Loaded per-cpu LDT/TSS for CPU#1
Cleaned up FPU and debug regs for CPU#1
masked ExtINT on CPU#1
Leaving ESR disabled.
Calibrating delay loop... 1396.73 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU1: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 2/4 eip 2000
Initializing CPU#2
Loading GDT/IDT for CPU#2
Loaded per-cpu LDT/TSS for CPU#2
Cleaned up FPU and debug regs for CPU#2
masked ExtINT on CPU#2
Leaving ESR disabled.
Calibrating delay loop... 1396.73 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU2: Intel 00/0a stepping 00
Restoring NMI vector
Booting processor 3/8 eip 2000
Initializing CPU#3
Loading GDT/IDT for CPU#3
Loaded per-cpu LDT/TSS for CPU#3
Cleaned up FPU and debug regs for CPU#3
masked ExtINT on CPU#3
Leaving ESR disabled.
Calibrating delay loop... 1396.73 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU3: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 4/17 eip 2000
Initializing CPU#4
Loading GDT/IDT for CPU#4
Loaded per-cpu LDT/TSS for CPU#4
Cleaned up FPU and debug regs for CPU#4
masked ExtINT on CPU#4
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU4: Intel 00/0a stepping 01
Restoring NMI vector
Booting processor 5/18 eip 2000
Initializing CPU#5
Loading GDT/IDT for CPU#5
Loaded per-cpu LDT/TSS for CPU#5
Cleaned up FPU and debug regs for CPU#5
masked ExtINT on CPU#5
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU5: Intel 00/0a stepping 01
Restoring NMI vector
Booting processor 6/20 eip 2000
Initializing CPU#6
Loading GDT/IDT for CPU#6
Loaded per-cpu LDT/TSS for CPU#6
Cleaned up FPU and debug regs for CPU#6
masked ExtINT on CPU#6
Leaving ESR disabled.
Calibrating delay loop... 1388.54 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU6: Intel 00/0a stepping 01
Restoring NMI vector
Booting processor 7/24 eip 2000
Initializing CPU#7
Loading GDT/IDT for CPU#7
Loaded per-cpu LDT/TSS for CPU#7
Cleaned up FPU and debug regs for CPU#7
masked ExtINT on CPU#7
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU7: Intel 00/0a stepping 01
Restoring NMI vector
Booting processor 8/33 eip 2000
Initializing CPU#8
Loading GDT/IDT for CPU#8
Loaded per-cpu LDT/TSS for CPU#8
Cleaned up FPU and debug regs for CPU#8
masked ExtINT on CPU#8
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU8: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 9/34 eip 2000
Initializing CPU#9
Loading GDT/IDT for CPU#9
Loaded per-cpu LDT/TSS for CPU#9
Cleaned up FPU and debug regs for CPU#9
masked ExtINT on CPU#9
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU9: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 10/36 eip 2000
Initializing CPU#10
Loading GDT/IDT for CPU#10
Loaded per-cpu LDT/TSS for CPU#10
Cleaned up FPU and debug regs for CPU#10
masked ExtINT on CPU#10
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU10: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 11/40 eip 2000
Initializing CPU#11
Loading GDT/IDT for CPU#11
Loaded per-cpu LDT/TSS for CPU#11
Cleaned up FPU and debug regs for CPU#11
masked ExtINT on CPU#11
Leaving ESR disabled.
Calibrating delay loop... 1388.54 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU serial number disabled.
CPU11: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 12/49 eip 2000
Initializing CPU#12
Loading GDT/IDT for CPU#12
Loaded per-cpu LDT/TSS for CPU#12
Cleaned up FPU and debug regs for CPU#12
masked ExtINT on CPU#12
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU12: Intel 00/0a stepping 00
Restoring NMI vector
Booting processor 13/50 eip 2000
Initializing CPU#13
Loading GDT/IDT for CPU#13
Loaded per-cpu LDT/TSS for CPU#13
Cleaned up FPU and debug regs for CPU#13
masked ExtINT on CPU#13
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU13: Intel 00/0a stepping 04
Restoring NMI vector
Booting processor 14/52 eip 2000
Initializing CPU#14
Loading GDT/IDT for CPU#14
Loaded per-cpu LDT/TSS for CPU#14
Cleaned up FPU and debug regs for CPU#14
masked ExtINT on CPU#14
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU14: Intel 00/0a stepping 00
Restoring NMI vector
Booting processor 15/56 eip 2000
Initializing CPU#15
Loading GDT/IDT for CPU#15
Loaded per-cpu LDT/TSS for CPU#15
Cleaned up FPU and debug regs for CPU#15
masked ExtINT on CPU#15
Leaving ESR disabled.
Calibrating delay loop... 1392.64 BogoMIPS
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 2048K
CPU serial number disabled.
CPU15: Intel 00/0a stepping 00
Restoring NMI vector
Total of 16 processors activated (22274.04 BogoMIPS).
ENABLING IO-APIC IRQs
BIOS bug, IO-APIC#0 ID 0 is already used!...
... fixing up to 4. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 4 ... ok.
BIOS bug, IO-APIC#1 ID 0 is already used!...
... fixing up to 5. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 5 ... ok.
BIOS bug, IO-APIC#2 ID 0 is already used!...
... fixing up to 6. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 6 ... ok.
BIOS bug, IO-APIC#3 ID 0 is already used!...
... fixing up to 7. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 7 ... ok.
BIOS bug, IO-APIC#4 ID 0 is already used!...
... fixing up to 8. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 8 ... ok.
BIOS bug, IO-APIC#5 ID 0 is already used!...
... fixing up to 9. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 9 ... ok.
BIOS bug, IO-APIC#6 ID 0 is already used!...
... fixing up to 10. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 10 ... ok.
BIOS bug, IO-APIC#7 ID 0 is already used!...
... fixing up to 11. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 11 ... ok.
BIOS bug, IO-APIC#8 ID 0 is already used!...
... fixing up to 12. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 12 ... ok.
BIOS bug, IO-APIC#9 ID 0 is already used!...
... fixing up to 13. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 13 ... ok.
BIOS bug, IO-APIC#10 ID 0 is already used!...
... fixing up to 14. (tell your hw vendor)
...changing IO-APIC physical APIC ID to 14 ... ok.
BIOS bug, IO-APIC#11 ID 0 is already used!...
Kernel panic: Max APIC ID exceeded!

In idle task - not syncing

[detached]
$

Script done on Fri Jul 26 13:01:19 2002

2002-07-26 20:19:24

by Robert Love

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, 2002-07-26 at 13:15, William Lee Irwin III wrote:

> On Fri, Jul 26, 2002 at 12:50:12PM -0700, Robert Love wrote:
> > In current 2.5? I thought Andrew and I fixed all those issues and
> > pushed them to Linus...
> > The `configurable NR_CPUS' patch works fine for me. I always boot with
> > NR_CPUS=2.
>
> No idea who it works for, it sure doesn't work here. Behold:

Hmm, is your CPU-space sparse?

If that is the case, and the max APIC ID is set to NR_CPUS, and the
kernel expects a 1:1 between NR_CPU value and logical CPU #... boom.

Robert Love

2002-07-27 01:54:07

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

> > + for( i=0; i < NR_CPUS; i++ )
> > + res += *per_cpu_ptr(stctr->ctr, i);
> > + return res;
> > +}
>
> Oh dear. Most people only have two CPUs.
>
> Rusty, can we *please* fix this? Really soon?

Linus just applied the hotplug cpu boot patch in bk, which gives
cpu_possible(i), for exactly this purpose.

> General comment: we need to clean up the kernel_stat stuff. We
> cannot just make it per-cpu because it is 32k in size already. I
> would suggest that we should break out the disk accounting and
> make the rest of kernel_stat per CPU.

kernel_stat is dynamically allocated???

Personally, I think that dynamically allocated per-cpu datastructures,
like dynamically-allocated brlocks, are something we might need
eventually, but look at what a certain driver did with the "make it
per-cpu" concept already. I don't want to rush in that direction.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-07-27 04:33:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

Rusty Russell wrote:
>
> > > + for( i=0; i < NR_CPUS; i++ )
> > > + res += *per_cpu_ptr(stctr->ctr, i);
> > > + return res;
> > > +}
> >
> > Oh dear. Most people only have two CPUs.
> >
> > Rusty, can we *please* fix this? Really soon?
>
> Linus just applied the hotplug cpu boot patch in bk, which gives
> cpu_possible(i), for exactly this purpose.

Good. And will it be possible to iterate across all CPUs
without having to iterate across NR_CPUS?

> > General comment: we need to clean up the kernel_stat stuff. We
> > cannot just make it per-cpu because it is 32k in size already. I
> > would suggest that we should break out the disk accounting and
> > make the rest of kernel_stat per CPU.
>
> kernel_stat is dynamically allocated???

No. It's jut a big lump of bss.

> Personally, I think that dynamically allocated per-cpu datastructures,
> like dynamically-allocated brlocks, are something we might need
> eventually, but look at what a certain driver did with the "make it
> per-cpu" concept already. I don't want to rush in that direction.

What driver is that?

And no, we need to do something about the NR_CPUS bloat Right Now.

In my build there is a quarter megabyte of per cpu data. And that
does not include the (currently small) .data.percpu * 32.

The is pretty much entirely wasted memory, and it will only get
worse. Making NR_CPUS compile-time configurable is a lame solution.
Wasting the memory is out of the question.

Dynamic allocation is the only thing left, yes?

-

2002-07-27 05:20:57

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

In message <[email protected]> you write:
> Good. And will it be possible to iterate across all CPUs
> without having to iterate across NR_CPUS?

Hmm, define *all cpus* please? All online cpus? All possible CPUs?

Interface discussed with DaveM was: first_cpu(), next_cpu(cpu), which
covers online CPUs, and gives a nice interface for things like irq
balancers which want to snarf the next online cpu.

Like it?

> > Personally, I think that dynamically allocated per-cpu datastructures,
> > like dynamically-allocated brlocks, are something we might need
> > eventually, but look at what a certain driver did with the "make it
> > per-cpu" concept already. I don't want to rush in that direction.
>
> What driver is that?

NTFS.

> The is pretty much entirely wasted memory, and it will only get
> worse. Making NR_CPUS compile-time configurable is a lame solution.
> Wasting the memory is out of the question.
>
> Dynamic allocation is the only thing left, yes?

Um, no! Here is the plan:

1) change per-cpu data to only allocate data for cpus where
cpu_possible(i) (add cache coloring and NUMA allocation as desired).

2) Convert non-modular cases to use per-cpu data (once the interface
changes again, <SIGH>).

We'll end up using *less* memory than before. We're just doing it in
easy stages.

Feel better now?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-07-27 06:04:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

Rusty Russell wrote:
>
> In message <[email protected]> you write:
> > Good. And will it be possible to iterate across all CPUs
> > without having to iterate across NR_CPUS?
>
> Hmm, define *all cpus* please? All online cpus? All possible CPUs?

All possible.

> Interface discussed with DaveM was: first_cpu(), next_cpu(cpu), which
> covers online CPUs, and gives a nice interface for things like irq
> balancers which want to snarf the next online cpu.
>
> Like it?

for (cpu = first_cpu(); cpu != (what?); cpu = next_cpu(cpu))

that'll work.

> ...
>
> > The is pretty much entirely wasted memory, and it will only get
> > worse. Making NR_CPUS compile-time configurable is a lame solution.
> > Wasting the memory is out of the question.
> >
> > Dynamic allocation is the only thing left, yes?
>
> Um, no! Here is the plan:
>
> 1) change per-cpu data to only allocate data for cpus where
> cpu_possible(i) (add cache coloring and NUMA allocation as desired).
>
> 2) Convert non-modular cases to use per-cpu data (once the interface
> changes again, <SIGH>).
>
> We'll end up using *less* memory than before. We're just doing it in
> easy stages.
>
> Feel better now?

Yup.

-

2002-07-27 11:39:38

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On 26 Jul 2002, Robert Love wrote:

> Hmm, is your CPU-space sparse?

If you're referring to APIC IDs, even on generic SMP boxes APIC IDs can be
sparse, its not a requirment that they are consecutive (e.g. i have a dual
rig here which has a BSP with an ID of 4

> If that is the case, and the max APIC ID is set to NR_CPUS, and the
> kernel expects a 1:1 between NR_CPU value and logical CPU #... boom.

the NR_CPU should only be used to satisfy the current running cpu count
no?

Cheers,
Zwane
--
function.linuxpower.ca

2002-07-27 12:00:39

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, 26 Jul 2002, William Lee Irwin III wrote:

> On Fri, Jul 26, 2002 at 12:50:12PM -0700, Robert Love wrote:
> > In current 2.5? I thought Andrew and I fixed all those issues and
> > pushed them to Linus...
> > The `configurable NR_CPUS' patch works fine for me. I always boot with
> > NR_CPUS=2.
>
> No idea who it works for, it sure doesn't work here. Behold:
> ...changing IO-APIC physical APIC ID to 14 ... ok.
> BIOS bug, IO-APIC#11 ID 0 is already used!...
> Kernel panic: Max APIC ID exceeded!
>
> In idle task - not syncing

hmm

Since you can only have 4 bits for your IOAPIC ID, you need to stuff them
all into the 4 bit address space, looking at the IDs there should be
plenty space for 8 IOAPICs in the 4 bit region. Another funny, is how come
it tries to reassign IDs for 12 IOAPICs? unless it picked up more from the
proprietory vendor section of mp tables seeing as it only picked up 8 at
boot, i think that code might need a once over.

Another strange check is the following;

if (phys_id_present_map & (1 << mp_ioapics[apic].mpc_apicid))

and earlier...

if (clustered_apic_mode)
/* We don't have a good way to do this yet - hack */
phys_id_present_map = (u_long) 0xf;

urgh...

Overrall i think arch/i386/kernel/io_apic.c needs a looking over.

Cheers,
Zwane
--
function.linuxpower.ca









2002-07-28 21:31:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

> Since you can only have 4 bits for your IOAPIC ID, you need to stuff them
> all into the 4 bit address space, looking at the IDs there should be
> plenty space for 8 IOAPICs in the 4 bit region.

Well, there's 16 cpus and 8 ioapics to theoretically go into that
space. The trouble is that it's not really one physical address
space globally, it's one per node, so the whole thing's a total hack.

> Another funny, is how come
> it tries to reassign IDs for 12 IOAPICs? unless it picked up more from the
> proprietory vendor section of mp tables seeing as it only picked up 8 at
> boot, i think that code might need a once over.

The strangest thing to me is that all the IDs are 0, when they
weren't to start with. Something's corrupting memory ... I'd
happily blame my own code, but allegedly this has happened without
CONFIG_MULTIQUAD too.

> if (clustered_apic_mode)
> /* We don't have a good way to do this yet - hack */
> phys_id_present_map = (u_long) 0xf;
>
> urgh...

Yeah, OK ... that's my fault. This is the physical space, and
we should really look up the physical ID of each device and put
into into a per-quad mapping. But that works just fine as a hack
without the NR_CPUs parameter, and it's only an exclusion thing
anyway, so even if a CPU doesn't come online it's correct enough.

The real question is why does this work if NR_CPUS is 32, but not
if we take it down somewhat? Likely one of the arrays is getting
overshot, but we need to do some debug to find out where & why.

> Overrall i think arch/i386/kernel/io_apic.c needs a looking over.

No denying that, but it takes a fearless man to stare into the
eyes of Intel's IOAPIC and survive ;-) I barely survived the last
time, and am not desperately keen to do it again ... <runs away>

M.


2002-07-29 10:53:27

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 11:46:34AM -0700, Andrew Morton wrote:
> Ravikiran G Thirumalai wrote:
> >
> > Here is a Scalable statistics counter implementation which works on top
> > of the kmalloc_percpu dynamic allocator published by Dipankar.
> > This patch is against 2.5.27.
> >
> > ...
> > +static inline int __statctr_init(statctr_t *stctr)
> > +{
> > + stctr->ctr = kmalloc_percpu(sizeof(*(stctr->ctr)), GFP_ATOMIC);
> > + if(!stctr->ctr)
> > + return -1;
> > + return 0;
> > +}
>
> Minor nit: please force the caller to pass in the gfp_flags when
> designing an API like this. The fact that you were forced to use
> GFP_ATOMIC here shows why...

Yep, I'll do that with the next release.

<snip>

> General comment: we need to clean up the kernel_stat stuff. We
> cannot just make it per-cpu because it is 32k in size already. I
> would suggest that we should break out the disk accounting and
> make the rest of kernel_stat per CPU.
>
> That would be a great application of your interface, and a good
> way to get your interface merged ;) Is that something which you
> have time to do?

Sure,... anything to get these interfaces merged :)

Are you looking at something on the lines as the diff below?
What about /proc/stat ? Is it a good idea to have separate /proc files
for disk stats and cpu usage stats? (It'll be good for statctrs that way,
applications monitoring disk_stats only don't cause statctr_reads on
cpu_usage stats then)

--- linux-2.5.29/include/linux/kernel_stat.h Sat Jul 27 08:28:36 2002
+++ statctr-2.5.29/include/linux/kernel_stat.h Mon Jul 29 14:21:18 2002
@@ -15,10 +15,13 @@
#define DK_MAX_MAJOR 16
#define DK_MAX_DISK 16

-struct kernel_stat {
- unsigned int per_cpu_user[NR_CPUS],
- per_cpu_nice[NR_CPUS],
- per_cpu_system[NR_CPUS];
+struct cpu_usage_stat {
+ statctr_t cpu_user;
+ statctr_t cpu_nice;
+ statctr_t cpu_system;
+};
+
+struct disk_stat {
unsigned int dk_drive[DK_MAX_MAJOR][DK_MAX_DISK];
unsigned int dk_drive_rio[DK_MAX_MAJOR][DK_MAX_DISK];
unsigned int dk_drive_wio[DK_MAX_MAJOR][DK_MAX_DISK];
@@ -26,13 +29,18 @@
unsigned int dk_drive_wblk[DK_MAX_MAJOR][DK_MAX_DISK];
unsigned int pgpgin, pgpgout;
unsigned int pswpin, pswpout;
- unsigned int pgalloc, pgfree;
- unsigned int pgactivate, pgdeactivate;
- unsigned int pgfault, pgmajfault;
- unsigned int pgscan, pgsteal;
- unsigned int pageoutrun, allocstall;
+ unsigned int pgalloc, pgfree;
+ unsigned int pgactivate, pgdeactivate;
+ unsigned int pgfault, pgmajfault;
+ unsigned int pgscan, pgsteal;
+ unsigned int pageoutrun, allocstall;
+};
+
+struct kernel_stat {
+ struct cpu_usage_stat cpu_stat;
+ struct disk_stat disk_stat;
#if !defined(CONFIG_ARCH_S390)
- unsigned int irqs[NR_CPUS][NR_IRQS];
+ statctr_t irqs[NR_IRQS];
#endif
};

@@ -44,8 +52,8 @@
* Maybe we need to smp-ify kernel_stat some day. It would be nice to do
* that without having to modify all the code that increments the stats.
*/
-#define KERNEL_STAT_INC(x) kstat.x++
-#define KERNEL_STAT_ADD(x, y) kstat.x += y
+#define DISK_STAT_INC(x) kstat.disk_stat.x++
+#define DISK_STAT_ADD(x, y) kstat.disk_sat.x += y

#if !defined(CONFIG_ARCH_S390)
/*
@@ -53,12 +61,7 @@
*/
static inline int kstat_irqs (int irq)
{
- int i, sum=0;
-
- for (i = 0 ; i < NR_CPUS ; i++)
- sum += kstat.irqs[i][irq];
-
- return sum;
+ return statctr_read(&kstat.irqs[irq]);
}
#endif

2002-07-29 12:23:16

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Sun, 28 Jul 2002, Martin J. Bligh wrote:

> The strangest thing to me is that all the IDs are 0, when they
> weren't to start with. Something's corrupting memory ... I'd
> happily blame my own code, but allegedly this has happened without
> CONFIG_MULTIQUAD too.

Do you know wether its just the IDs or the whole mp_ioapic[x] struct?

Zwane Mwaikambo

--
function.linuxpower.ca

2002-07-29 14:14:19

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Fri, Jul 26, 2002 at 12:27:40PM -0300, Rik van Riel wrote:
> On Fri, 26 Jul 2002, Ravikiran G Thirumalai wrote:
>
> > Rik, You were interested in using this. Does this implementation suit
> > your needs?
>
> >From a quick glance it looks like it will.
>
> However, it might be more efficient to put the statistics
> in one file in /proc with named fields, or have a way to
> group them in one or multiple files.
>

Ok, Here's what I have in mind;
Introduce a statctr_base_t type which represents a group of counters on
the same proc file, have statctr_base_init and statctr_base_cleanup
to create and destroy these datatypes. Each line in the /proc file
will be of the form
countername value

Here are the changed and new interfaces:
1. int statctr_init(statctr_t *statctr, unsigned long value,
statctr_base_t *base, const char *countername, int flags);
2. void statctr_cleanup(statctr_t *);
3. int statctr_base_init(statctr_base_t *base,
struct proc_dir_entry *procbase, const char *procname);
4. extern void statctr_base_cleanup(satctr_base_t *base);

Does this look ok?

> Not sure about that, though ... really depends on how
> expensive stat+open+read+close is compared to parsing a
> file with multiple fields.
>

Also, when you group counters into a single file, you inadvertently
end up reading counters you may not require at that time.
Reads to statctrs are not as cheap as cpu local writes. You'll have to
take that into account too i guess .....

Thanks,
Kiran

2002-07-29 14:52:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

>> The strangest thing to me is that all the IDs are 0, when they
>> weren't to start with. Something's corrupting memory ... I'd
>> happily blame my own code, but allegedly this has happened without
>> CONFIG_MULTIQUAD too.
>
> Do you know wether its just the IDs or the whole mp_ioapic[x] struct?

Probably the whole struct, but no ... I don't know for sure.
The debug is easy once we know 1 element that gets corrupted
though.

M.

2002-07-29 18:22:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

Ravikiran G Thirumalai wrote:
>
> ...
>
> > General comment: we need to clean up the kernel_stat stuff. We
> > cannot just make it per-cpu because it is 32k in size already. I
> > would suggest that we should break out the disk accounting and
> > make the rest of kernel_stat per CPU.
> >
> > That would be a great application of your interface, and a good
> > way to get your interface merged ;) Is that something which you
> > have time to do?
>
> Sure,... anything to get these interfaces merged :)

Well Rusty's point about just using the percpu API seemed
reasonable - that's basically equivalent to statically defining
the data. In what situation is dynamic allocation needed?

Well, in modules for one. Unless we work out a way to make the
percpu API work with storage which is defined within modules.

> Are you looking at something on the lines as the diff below?

Looks nice.

> What about /proc/stat ? Is it a good idea to have separate /proc files
> for disk stats and cpu usage stats? (It'll be good for statctrs that way,
> applications monitoring disk_stats only don't cause statctr_reads on
> cpu_usage stats then)

Stick with the existing format, I'd say. Numerous applications
would break otherwise.

-

2002-07-29 18:43:14

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Mon, Jul 29, 2002 at 11:23:06AM -0700, Andrew Morton wrote:
> Ravikiran G Thirumalai wrote:
> >
> > ...
> >
> > > General comment: we need to clean up the kernel_stat stuff. We
> > > cannot just make it per-cpu because it is 32k in size already. I
> > > would suggest that we should break out the disk accounting and
> > > make the rest of kernel_stat per CPU.
> > >
> > > That would be a great application of your interface, and a good
> > > way to get your interface merged ;) Is that something which you
> > > have time to do?
> >
> > Sure,... anything to get these interfaces merged :)
>
> Well Rusty's point about just using the percpu API seemed
> reasonable - that's basically equivalent to statically defining
> the data. In what situation is dynamic allocation needed?
>
> Well, in modules for one. Unless we work out a way to make the
> percpu API work with storage which is defined within modules.

Apart from modules, I can see two advantages -

1. static per_cpu data has to be allocated based on NR_CPUS because
of the current offset calculations. kmalloc_percpu() needs to allocate only
for cpu_possible() cpus.

2. If the statistics is a part of a dynamically allocated structure.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-07-30 11:21:52

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [RFC] Scalable statistics counters using kmalloc_percpu

On Mon, Jul 29, 2002 at 11:23:06AM -0700, Andrew Morton wrote:
> ...
> Well Rusty's point about just using the percpu API seemed
> reasonable - that's basically equivalent to statically defining
> the data. In what situation is dynamic allocation needed?
>

As Dipankar pointed out, counters part of dynamically allocated structures;
Like:

1. k_atm_dev_stats in atm_dev -- they are all atomic_t types right now.
we must see a major benefit with statctrs here.
I don't have access to atm infrastructure to test it though :(((
2. Atomic counters like dst->__refcnt; we might be able to hack
route cache so that the periodic garbage collector can tolerate
stale reads on the refcnt. A dst_entry might get gc'd late, nevertheless
it would boost performance on networking loads. (I still have to
investigate if we can do it at all...).

This is what I have in mind for now. Of course, if these primitives
are available to all, programmers will have more flexibility in
designing datastructures. We can then afford to have non atomic_t
global counters for debugging etc., anywhere in the kernel; modules/dynamic
structures...

Thanks,
Kiran