2006-03-08 01:57:31

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 0/4] net: percpufy frequently used vars on struct proto

Following patchset converts struct proto.memory_allocated to use batching
per-cpu counters, struct proto.sockets_allocated to use per-cpu counters and
changes the proto.inuse per-cpu variable to use alloc_percpu instead of the
NR_CPUS x cacheline size padding.

We observed 5% improvement in apache bench requests per second with this
patchset on a multi NIC 8 way IBM x460 box.

(This was posted earlier
http://marc.theaimsgroup.com/?l=linux-kernel&m=113830220408812&w=2 )

Can this go into -mm please?

Thanks,
Kiran


2006-03-08 01:58:54

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Add percpu_counter_mod_bh for using these counters safely from
both softirq and process context.

Signed-off by: Pravin B. Shelar <[email protected]>
Signed-off by: Ravikiran G Thirumalai <[email protected]>
Signed-off by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/linux/percpu_counter.h 2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/linux/percpu_counter.h 2006-03-07 15:09:21.000000000 -0800
@@ -11,6 +11,7 @@
#include <linux/smp.h>
#include <linux/threads.h>
#include <linux/percpu.h>
+#include <linux/interrupt.h>

#ifdef CONFIG_SMP

@@ -110,4 +111,11 @@ static inline void percpu_counter_dec(st
percpu_counter_mod(fbc, -1);
}

+static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+ local_bh_disable();
+ percpu_counter_mod(fbc, amount);
+ local_bh_enable();
+}
+
#endif /* _LINUX_PERCPU_COUNTER_H */

2006-03-08 02:00:38

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated

Change struct proto->memory_allocated to a batching per-CPU counter
(percpu_counter) from an atomic_t. A batching counter is better than a
plain per-CPU counter as this field is read often.

Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-07 15:11:48.000000000 -0800
@@ -48,6 +48,7 @@
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/percpu_counter.h>

#include <linux/filter.h>

@@ -541,8 +542,9 @@ struct proto {

/* Memory pressure */
void (*enter_memory_pressure)(void);
- atomic_t *memory_allocated; /* Current allocated memory. */
+ struct percpu_counter *memory_allocated; /* Current allocated memory. */
atomic_t *sockets_allocated; /* Current number of sockets. */
+
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-07 15:11:48.000000000 -0800
@@ -225,7 +225,7 @@ extern int sysctl_tcp_abc;
extern int sysctl_tcp_mtu_probing;
extern int sysctl_tcp_base_mss;

-extern atomic_t tcp_memory_allocated;
+extern struct percpu_counter tcp_memory_allocated;
extern atomic_t tcp_sockets_allocated;
extern int tcp_memory_pressure;

Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c 2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-07 15:11:48.000000000 -0800
@@ -1616,12 +1616,13 @@ static char proto_method_implemented(con

static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{
- seq_printf(seq, "%-9s %4u %6d %6d %-3s %6u %-3s %-10s "
+ seq_printf(seq, "%-9s %4u %6d %6lu %-3s %6u %-3s %-10s "
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
- proto->memory_allocated != NULL ? atomic_read(proto->memory_allocated) : -1,
+ proto->memory_allocated != NULL ?
+ percpu_counter_read_positive(proto->memory_allocated) : -1,
proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
proto->max_header,
proto->slab == NULL ? "no" : "yes",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-07 15:11:48.000000000 -0800
@@ -196,11 +196,11 @@ EXPORT_SYMBOL(sk_stream_error);
void __sk_stream_mem_reclaim(struct sock *sk)
{
if (sk->sk_forward_alloc >= SK_STREAM_MEM_QUANTUM) {
- atomic_sub(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM,
- sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated,
+ -(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM));
sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;
if (*sk->sk_prot->memory_pressure &&
- (atomic_read(sk->sk_prot->memory_allocated) <
+ (percpu_counter_read(sk->sk_prot->memory_allocated) <
sk->sk_prot->sysctl_mem[0]))
*sk->sk_prot->memory_pressure = 0;
}
@@ -213,23 +213,26 @@ int sk_stream_mem_schedule(struct sock *
int amt = sk_stream_pages(size);

sk->sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM;
- atomic_add(amt, sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated, amt);

/* Under limit. */
- if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
+ if (percpu_counter_read(sk->sk_prot->memory_allocated) <
+ sk->sk_prot->sysctl_mem[0]) {
if (*sk->sk_prot->memory_pressure)
*sk->sk_prot->memory_pressure = 0;
return 1;
}

/* Over hard limit. */
- if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
+ if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+ sk->sk_prot->sysctl_mem[2]) {
sk->sk_prot->enter_memory_pressure();
goto suppress_allocation;
}

/* Under pressure. */
- if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[1])
+ if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+ sk->sk_prot->sysctl_mem[1])
sk->sk_prot->enter_memory_pressure();

if (kind) {
@@ -259,7 +262,7 @@ suppress_allocation:

/* Alas. Undo changes. */
sk->sk_forward_alloc -= amt * SK_STREAM_MEM_QUANTUM;
- atomic_sub(amt, sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated, -amt);
return 0;
}

Index: linux-2.6.16-rc5mm3/net/decnet/af_decnet.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/decnet/af_decnet.c 2006-03-07 15:07:43.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/decnet/af_decnet.c 2006-03-07 15:09:22.000000000 -0800
@@ -154,7 +154,7 @@ static const struct proto_ops dn_proto_o
static DEFINE_RWLOCK(dn_hash_lock);
static struct hlist_head dn_sk_hash[DN_SK_HASH_SIZE];
static struct hlist_head dn_wild_sk;
-static atomic_t decnet_memory_allocated;
+static struct percpu_counter decnet_memory_allocated;

static int __dn_setsockopt(struct socket *sock, int level, int optname, char __user *optval, int optlen, int flags);
static int __dn_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen, int flags);
@@ -2383,7 +2383,8 @@ static int __init decnet_init(void)
rc = proto_register(&dn_proto, 1);
if (rc != 0)
goto out;
-
+
+ percpu_counter_init(&decnet_memory_allocated);
dn_neigh_init();
dn_dev_init();
dn_route_init();
@@ -2424,6 +2425,7 @@ static void __exit decnet_exit(void)
proc_net_remove("decnet");

proto_unregister(&dn_proto);
+ percpu_counter_destroy(&decnet_memory_allocated);
}
module_exit(decnet_exit);
#endif
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c 2006-03-07 15:07:44.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c 2006-03-07 15:11:48.000000000 -0800
@@ -61,10 +61,10 @@ static int fold_prot_inuse(struct proto
static int sockstat_seq_show(struct seq_file *seq, void *v)
{
socket_seq_show(seq);
- seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %d\n",
+ seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
- atomic_read(&tcp_memory_allocated));
+ percpu_counter_read_positive(&tcp_memory_allocated));
seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
seq_printf(seq, "FRAG: inuse %d memory %d\n", ip_frag_nqueues,
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c 2006-03-07 15:07:44.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c 2006-03-07 15:11:48.000000000 -0800
@@ -283,7 +283,7 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
EXPORT_SYMBOL(sysctl_tcp_rmem);
EXPORT_SYMBOL(sysctl_tcp_wmem);

-atomic_t tcp_memory_allocated; /* Current allocated memory. */
+struct percpu_counter tcp_memory_allocated; /* Current allocated memory. */
atomic_t tcp_sockets_allocated; /* Current number of TCP sockets. */

EXPORT_SYMBOL(tcp_memory_allocated);
@@ -1593,7 +1593,7 @@ adjudge_to_death:
sk_stream_mem_reclaim(sk);
if (atomic_read(sk->sk_prot->orphan_count) > sysctl_tcp_max_orphans ||
(sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+ percpu_counter_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
if (net_ratelimit())
printk(KERN_INFO "TCP: too many of orphaned "
"sockets\n");
@@ -2127,6 +2127,8 @@ void __init tcp_init(void)
"(established %d bind %d)\n",
tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);

+ percpu_counter_init(&tcp_memory_allocated);
+
tcp_register_congestion_control(&tcp_reno);
}

Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_input.c 2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c 2006-03-07 15:09:22.000000000 -0800
@@ -333,7 +333,7 @@ static void tcp_clamp_window(struct sock
if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
!tcp_memory_pressure &&
- atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+ percpu_counter_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
sysctl_tcp_rmem[2]);
}
@@ -3555,7 +3555,7 @@ static int tcp_should_expand_sndbuf(stru
return 0;

/* If we are under soft global TCP memory pressure, do not expand. */
- if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+ if (percpu_counter_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
return 0;

/* If we filled the congestion window, do not expand. */
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_timer.c 2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c 2006-03-07 15:09:22.000000000 -0800
@@ -80,7 +80,7 @@ static int tcp_out_of_resources(struct s

if (orphans >= sysctl_tcp_max_orphans ||
(sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+ percpu_counter_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
if (net_ratelimit())
printk(KERN_INFO "Out of socket memory\n");

2006-03-08 02:01:47

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated

Change the atomic_t sockets_allocated member of struct proto to a
per-cpu counter.

Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-07 15:09:52.000000000 -0800
@@ -543,7 +543,7 @@ struct proto {
/* Memory pressure */
void (*enter_memory_pressure)(void);
struct percpu_counter *memory_allocated; /* Current allocated memory. */
- atomic_t *sockets_allocated; /* Current number of sockets. */
+ int *sockets_allocated; /* Current number of sockets (percpu counter). */

/*
* Pressure flag: try to collapse.
@@ -579,6 +579,24 @@ struct proto {
} stats[NR_CPUS];
};

+static inline int read_sockets_allocated(struct proto *prot)
+{
+ int total = 0;
+ int cpu;
+ for_each_cpu(cpu)
+ total += *per_cpu_ptr(prot->sockets_allocated, cpu);
+ return total;
+}
+
+static inline void mod_sockets_allocated(int *sockets_allocated, int count)
+{
+ (*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
+ put_cpu();
+}
+
+#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1)
+#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1)
+
extern int proto_register(struct proto *prot, int alloc_slab);
extern void proto_unregister(struct proto *prot);

Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-07 15:09:52.000000000 -0800
@@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing;
extern int sysctl_tcp_base_mss;

extern struct percpu_counter tcp_memory_allocated;
-extern atomic_t tcp_sockets_allocated;
+extern int *tcp_sockets_allocated;
extern int tcp_memory_pressure;

/*
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-07 15:09:52.000000000 -0800
@@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock
newsk->sk_sleep = NULL;

if (newsk->sk_prot->sockets_allocated)
- atomic_inc(newsk->sk_prot->sockets_allocated);
+ inc_sockets_allocated(newsk->sk_prot->sockets_allocated);
}
out:
return newsk;
@@ -1620,7 +1620,7 @@ static void proto_seq_printf(struct seq_
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
- proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
+ proto->sockets_allocated != NULL ? read_sockets_allocated(proto) : -1,
proto->memory_allocated != NULL ?
percpu_counter_read_positive(proto->memory_allocated) : -1,
proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-07 15:09:52.000000000 -0800
@@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock *
return 1;

if (!*sk->sk_prot->memory_pressure ||
- sk->sk_prot->sysctl_mem[2] > atomic_read(sk->sk_prot->sockets_allocated) *
+ sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) *
sk_stream_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
sk->sk_forward_alloc))
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c 2006-03-07 15:09:52.000000000 -0800
@@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_
socket_seq_show(seq);
seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
- tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
+ tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot),
percpu_counter_read_positive(&tcp_memory_allocated));
seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c 2006-03-07 15:09:22.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c 2006-03-07 15:09:52.000000000 -0800
@@ -284,7 +284,7 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
EXPORT_SYMBOL(sysctl_tcp_wmem);

struct percpu_counter tcp_memory_allocated; /* Current allocated memory. */
-atomic_t tcp_sockets_allocated; /* Current number of TCP sockets. */
+int *tcp_sockets_allocated; /* Current number of TCP sockets. */

EXPORT_SYMBOL(tcp_memory_allocated);
EXPORT_SYMBOL(tcp_sockets_allocated);
@@ -2055,6 +2055,12 @@ void __init tcp_init(void)
if (!tcp_hashinfo.bind_bucket_cachep)
panic("tcp_init: Cannot alloc tcp_bind_bucket cache.");

+ tcp_sockets_allocated = alloc_percpu(int);
+ if (!tcp_sockets_allocated)
+ panic("tcp_init: Cannot alloc tcp_sockets_allocated");
+
+ tcp_prot.sockets_allocated = tcp_sockets_allocated;
+
/* Size and allocate the main established and bind bucket
* hash tables.
*
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_ipv4.c 2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c 2006-03-07 15:09:52.000000000 -0800
@@ -1273,7 +1273,7 @@ static int tcp_v4_init_sock(struct sock
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];

- atomic_inc(&tcp_sockets_allocated);
+ inc_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1307,7 +1307,7 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- atomic_dec(&tcp_sockets_allocated);
+ dec_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1815,7 +1815,6 @@ struct proto tcp_prot = {
.unhash = tcp_unhash,
.get_port = tcp_v4_get_port,
.enter_memory_pressure = tcp_enter_memory_pressure,
- .sockets_allocated = &tcp_sockets_allocated,
.orphan_count = &tcp_orphan_count,
.memory_allocated = &tcp_memory_allocated,
.memory_pressure = &tcp_memory_pressure,
Index: linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv6/tcp_ipv6.c 2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c 2006-03-07 15:11:43.000000000 -0800
@@ -1375,7 +1375,7 @@ static int tcp_v6_init_sock(struct sock
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];

- atomic_inc(&tcp_sockets_allocated);
+ inc_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1573,7 +1573,6 @@ struct proto tcpv6_prot = {
.unhash = tcp_unhash,
.get_port = tcp_v6_get_port,
.enter_memory_pressure = tcp_enter_memory_pressure,
- .sockets_allocated = &tcp_sockets_allocated,
.memory_allocated = &tcp_memory_allocated,
.memory_pressure = &tcp_memory_pressure,
.orphan_count = &tcp_orphan_count,
@@ -1605,6 +1604,7 @@ static struct inet_protosw tcpv6_protosw

void __init tcpv6_init(void)
{
+ tcpv6_prot.sockets_allocated = tcp_sockets_allocated;
/* register inet6 protocol */
if (inet6_add_protocol(&tcpv6_protocol, IPPROTO_TCP) < 0)
printk(KERN_ERR "tcpv6_init: Could not register protocol\n");

2006-03-08 02:03:15

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [patch 4/4] net: percpufy frequently used vars -- proto.inuse

This patch uses alloc_percpu to allocate per-CPU memory for the proto->inuse
field. The inuse field is currently per-CPU as in NR_CPUS * cacheline size --
a big bloat on arches with large cachelines. Also marks some frequently
used protos read mostly.

Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc2/include/net/sock.h
===================================================================
--- linux-2.6.16-rc2.orig/include/net/sock.h 2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/include/net/sock.h 2006-02-09 15:01:52.000000000 -0800
@@ -573,10 +573,7 @@ struct proto {
#ifdef SOCK_REFCNT_DEBUG
atomic_t socks;
#endif
- struct {
- int inuse;
- u8 __pad[SMP_CACHE_BYTES - sizeof(int)];
- } stats[NR_CPUS];
+ int *inuse;
};

static inline int read_sockets_allocated(struct proto *prot)
@@ -628,12 +625,12 @@ static inline void sk_refcnt_debug_relea
/* Called with local bh disabled */
static __inline__ void sock_prot_inc_use(struct proto *prot)
{
- prot->stats[smp_processor_id()].inuse++;
+ (*per_cpu_ptr(prot->inuse, smp_processor_id())) += 1;
}

static __inline__ void sock_prot_dec_use(struct proto *prot)
{
- prot->stats[smp_processor_id()].inuse--;
+ (*per_cpu_ptr(prot->inuse, smp_processor_id())) -= 1;
}

/* With per-bucket locks this operation is not-atomic, so that
Index: linux-2.6.16-rc2/net/core/sock.c
===================================================================
--- linux-2.6.16-rc2.orig/net/core/sock.c 2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/core/sock.c 2006-02-09 15:01:52.000000000 -0800
@@ -1508,12 +1508,21 @@ int proto_register(struct proto *prot, i
}
}

+ prot->inuse = alloc_percpu(int);
+ if (prot->inuse == NULL) {
+ if (alloc_slab)
+ goto out_free_timewait_sock_slab_name_cache;
+ else
+ goto out;
+ }
write_lock(&proto_list_lock);
list_add(&prot->node, &proto_list);
write_unlock(&proto_list_lock);
rc = 0;
out:
return rc;
+out_free_timewait_sock_slab_name_cache:
+ kmem_cache_destroy(prot->twsk_prot->twsk_slab);
out_free_timewait_sock_slab_name:
kfree(timewait_sock_slab_name);
out_free_request_sock_slab:
@@ -1537,6 +1546,11 @@ void proto_unregister(struct proto *prot
list_del(&prot->node);
write_unlock(&proto_list_lock);

+ if (prot->inuse != NULL) {
+ free_percpu(prot->inuse);
+ prot->inuse = NULL;
+ }
+
if (prot->slab != NULL) {
kmem_cache_destroy(prot->slab);
prot->slab = NULL;
Index: linux-2.6.16-rc2/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/proc.c 2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/proc.c 2006-02-09 15:01:52.000000000 -0800
@@ -50,7 +50,7 @@ static int fold_prot_inuse(struct proto
int cpu;

for_each_cpu(cpu)
- res += proto->stats[cpu].inuse;
+ res += (*per_cpu_ptr(proto->inuse, cpu));

return res;
}
Index: linux-2.6.16-rc2/net/ipv4/raw.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/raw.c 2006-02-07 23:14:04.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/raw.c 2006-02-09 15:01:52.000000000 -0800
@@ -718,7 +718,7 @@ static int raw_ioctl(struct sock *sk, in
}
}

-struct proto raw_prot = {
+struct proto raw_prot __read_mostly = {
.name = "RAW",
.owner = THIS_MODULE,
.close = raw_close,
Index: linux-2.6.16-rc2/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/tcp_ipv4.c 2006-02-09 15:01:47.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/tcp_ipv4.c 2006-02-09 15:01:52.000000000 -0800
@@ -1794,7 +1794,7 @@ void tcp4_proc_exit(void)
}
#endif /* CONFIG_PROC_FS */

-struct proto tcp_prot = {
+struct proto tcp_prot __read_mostly = {
.name = "TCP",
.owner = THIS_MODULE,
.close = tcp_close,
Index: linux-2.6.16-rc2/net/ipv4/udp.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv4/udp.c 2006-02-07 23:14:04.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv4/udp.c 2006-02-09 15:01:52.000000000 -0800
@@ -1340,7 +1340,7 @@ unsigned int udp_poll(struct file *file,

}

-struct proto udp_prot = {
+struct proto udp_prot __read_mostly = {
.name = "UDP",
.owner = THIS_MODULE,
.close = udp_close,
Index: linux-2.6.16-rc2/net/ipv6/proc.c
===================================================================
--- linux-2.6.16-rc2.orig/net/ipv6/proc.c 2006-02-07 23:19:10.000000000 -0800
+++ linux-2.6.16-rc2/net/ipv6/proc.c 2006-02-09 15:01:52.000000000 -0800
@@ -39,7 +39,7 @@ static int fold_prot_inuse(struct proto
int cpu;

for_each_cpu(cpu)
- res += proto->stats[cpu].inuse;
+ res += (*per_cpu_ptr(proto->inuse, cpu));

return res;
}

2006-03-08 02:15:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Ravikiran G Thirumalai <[email protected]> wrote:
>
> +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> +{
> + local_bh_disable();
> + percpu_counter_mod(fbc, amount);
> + local_bh_enable();
> +}
> +

percpu_counter_mod() does preempt_disable(), which is redundant in this
context. So just do fbc->count += amount; here.


2006-03-08 02:16:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated

Ravikiran G Thirumalai <[email protected]> wrote:
>
> - if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
> + if (percpu_counter_read(sk->sk_prot->memory_allocated) <
> + sk->sk_prot->sysctl_mem[0]) {

Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
CPU counts.

It might be worth running percpu_counter_sum() to get the exact count if we
think we're about to cause something to fail.

2006-03-08 02:18:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated

Ravikiran G Thirumalai <[email protected]> wrote:
>
> --- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-07 15:09:22.000000000 -0800
> +++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-07 15:09:52.000000000 -0800
> @@ -543,7 +543,7 @@ struct proto {
> /* Memory pressure */
> void (*enter_memory_pressure)(void);
> struct percpu_counter *memory_allocated; /* Current allocated memory. */
> - atomic_t *sockets_allocated; /* Current number of sockets. */
> + int *sockets_allocated; /* Current number of sockets (percpu counter). */
>
> /*
> * Pressure flag: try to collapse.
> @@ -579,6 +579,24 @@ struct proto {
> } stats[NR_CPUS];
> };
>
> +static inline int read_sockets_allocated(struct proto *prot)
> +{
> + int total = 0;
> + int cpu;
> + for_each_cpu(cpu)
> + total += *per_cpu_ptr(prot->sockets_allocated, cpu);
> + return total;
> +}

This is likely too big to be inlined, plus sock.h doesn't include enough
headers to reliably compile this code.

> +static inline void mod_sockets_allocated(int *sockets_allocated, int count)
> +{
> + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
> + put_cpu();
> +}
> +

Ditto.

2006-03-08 03:07:24

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated

On Tue, Mar 07, 2006 at 06:14:22PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > - if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
> > + if (percpu_counter_read(sk->sk_prot->memory_allocated) <
> > + sk->sk_prot->sysctl_mem[0]) {
>
> Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
> CPU counts.
>
> It might be worth running percpu_counter_sum() to get the exact count if we
> think we're about to cause something to fail.

The problem is percpu_counter_sum has to read all the cpus cachelines. If
we have to use percpu_counter_sum everywhere, then might as well use plain
per-cpu counters instead of batching ones no?

sysctl_mem[0] is about 196K and on a 16 cpu box variance is 512 bytes, which
is OK with just percpu_counter_read I hope. Maybe, on very large cpu counts,
we should just change the FBC_BATCH so that variance does not go quadratic.
Something like 32. So that variance is 32 * NR_CPUS in that case, instead
of (NR_CPUS * NR_CPUS * 2) currently. Comments?

2006-03-08 03:24:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Tue, Mar 07, 2006 at 06:14:22PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[email protected]> wrote:
> > >
> > > - if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
> > > + if (percpu_counter_read(sk->sk_prot->memory_allocated) <
> > > + sk->sk_prot->sysctl_mem[0]) {
> >
> > Bear in mind that percpu_counter_read[_positive] can be inaccurate on large
> > CPU counts.
> >
> > It might be worth running percpu_counter_sum() to get the exact count if we
> > think we're about to cause something to fail.
>
> The problem is percpu_counter_sum has to read all the cpus cachelines. If
> we have to use percpu_counter_sum everywhere, then might as well use plain
> per-cpu counters instead of batching ones no?

I didn't say "use it everywhere" ;)

Just in places like this:

if (percpu_counter_read(something) > something_else)
make_an_application_fail();

in that case it's worth running percpu_counter_sum(). And bear in mind
that once we've done that, the following percpu_counter_read()s become
accurate, so we won't run the expensive percpu_counter_sum() again
for a while. Unless we're really close to or over the limit, in which case
blowing a few cycles is relatively unimportant.

All that should be captured in library code (per_cpu_counter_exceeds(ptr,
threshold), for example) rather than open-coded everywhere.

> sysctl_mem[0] is about 196K and on a 16 cpu box variance is 512 bytes, which
> is OK with just percpu_counter_read I hope.

You mean a 16 CPU box with NR_CPUS=16 as well...

> Maybe, on very large cpu counts,
> we should just change the FBC_BATCH so that variance does not go quadratic.
> Something like 32. So that variance is 32 * NR_CPUS in that case, instead
> of (NR_CPUS * NR_CPUS * 2) currently. Comments?

Sure, we need to make that happen. But it got all mixed up with the
spinlock removal and it does need quite some thought and testing and
documentation to help developers to choose the right settings and
appropriate selection of defaults, etc.

2006-03-08 20:26:30

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Tue, Mar 07, 2006 at 06:13:01PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> > +{
> > + local_bh_disable();
> > + percpu_counter_mod(fbc, amount);
> > + local_bh_enable();
> > +}
> > +
>
> percpu_counter_mod() does preempt_disable(), which is redundant in this
> context. So just do fbc->count += amount; here.

OK. Here is the revised patch.


Add percpu_counter_mod_bh for using these counters safely from
both softirq and process context.

Signed-off by: Pravin B. Shelar <[email protected]>
Signed-off by: Ravikiran G Thirumalai <[email protected]>
Signed-off by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/linux/percpu_counter.h 2006-03-07 15:08:00.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/linux/percpu_counter.h 2006-03-08 10:53:13.000000000 -0800
@@ -11,6 +11,7 @@
#include <linux/smp.h>
#include <linux/threads.h>
#include <linux/percpu.h>
+#include <linux/interrupt.h>

#ifdef CONFIG_SMP

@@ -40,6 +41,7 @@ static inline void percpu_counter_destro

void percpu_counter_mod(struct percpu_counter *fbc, long amount);
long percpu_counter_sum(struct percpu_counter *fbc);
+void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount);

static inline long percpu_counter_read(struct percpu_counter *fbc)
{
@@ -98,6 +100,12 @@ static inline long percpu_counter_sum(st
return percpu_counter_read_positive(fbc);
}

+static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+ local_bh_disable();
+ fbc->count += amount;
+ local_bh_enable();
+}
#endif /* CONFIG_SMP */

static inline void percpu_counter_inc(struct percpu_counter *fbc)
@@ -110,4 +118,5 @@ static inline void percpu_counter_dec(st
percpu_counter_mod(fbc, -1);
}

+
#endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6.16-rc5mm3/mm/swap.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/mm/swap.c 2006-03-07 15:08:01.000000000 -0800
+++ linux-2.6.16-rc5mm3/mm/swap.c 2006-03-08 12:11:19.000000000 -0800
@@ -521,11 +521,11 @@ static int cpu_swap_callback(struct noti
#endif /* CONFIG_SMP */

#ifdef CONFIG_SMP
-void percpu_counter_mod(struct percpu_counter *fbc, long amount)
+static void __percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
long count;
long *pcount;
- int cpu = get_cpu();
+ int cpu = smp_processor_id();

pcount = per_cpu_ptr(fbc->counters, cpu);
count = *pcount + amount;
@@ -537,9 +537,21 @@ void percpu_counter_mod(struct percpu_co
} else {
*pcount = count;
}
- put_cpu();
}
-EXPORT_SYMBOL(percpu_counter_mod);
+
+void percpu_counter_mod(struct percpu_counter *fbc, long amount)
+{
+ preempt_disable();
+ __percpu_counter_mod(fbc, amount);
+ preempt_enable();
+}
+
+void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
+{
+ local_bh_disable();
+ __percpu_counter_mod(fbc, amount);
+ local_bh_enable();
+}

/*
* Add up all the per-cpu counts, return the result. This is a more accurate
@@ -559,6 +571,9 @@ long percpu_counter_sum(struct percpu_co
spin_unlock(&fbc->lock);
return ret < 0 ? 0 : ret;
}
+
+EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(percpu_counter_mod_bh);
EXPORT_SYMBOL(percpu_counter_sum);
#endif

2006-03-08 20:42:04

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote:
> +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> +{
> + local_bh_disable();
> + fbc->count += amount;
> + local_bh_enable();
> +}

Please use local_t instead, then you don't have to do the local_bh_disable()
and enable() and the whole thing collapses down into 1 instruction on x86.

-ben

2006-03-08 20:53:29

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 2/4] net: percpufy frequently used vars -- struct proto.memory_allocated

On Tue, Mar 07, 2006 at 07:22:34PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > The problem is percpu_counter_sum has to read all the cpus cachelines. If
> > we have to use percpu_counter_sum everywhere, then might as well use plain
> > per-cpu counters instead of batching ones no?
>
> I didn't say "use it everywhere" ;)

:) No you did not. Sorry. When the counter shows large varience it will affect
all places where it is read and it wasn't clear to me where to use
percpu_counter_sum vs percpu_counter_read in the context of the patch.
But yes, using percpu_counter_sum at the critical points makes things better.
Thanks for pointing it out. Here is an attempt on those lines.

>
> > Maybe, on very large cpu counts,
> > we should just change the FBC_BATCH so that variance does not go quadratic.
> > Something like 32. So that variance is 32 * NR_CPUS in that case, instead
> > of (NR_CPUS * NR_CPUS * 2) currently. Comments?
>
> Sure, we need to make that happen. But it got all mixed up with the
> spinlock removal and it does need quite some thought and testing and
> documentation to help developers to choose the right settings and
> appropriate selection of defaults, etc.

I was thinking on the lines of something like the following as a simple fix
to the quadratic variance. But yes, it needs some experimentation before hand..

#if NR_CPUS >= 16
#define FBC_BATCH (32)
#else
#define FBC_BATCH (NR_CPUS*4)
#endif

Thanks,
Kiran


Change struct proto->memory_allocated to a batching per-CPU counter
(percpu_counter) from an atomic_t. A batching counter is better than a
plain per-CPU counter as this field is read often.

Changes since last post: Use the more accurate percpu_counter_sum() at
critical places.

Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-08 11:03:19.000000000 -0800
@@ -48,6 +48,7 @@
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/percpu_counter.h>

#include <linux/filter.h>

@@ -541,8 +542,9 @@ struct proto {

/* Memory pressure */
void (*enter_memory_pressure)(void);
- atomic_t *memory_allocated; /* Current allocated memory. */
+ struct percpu_counter *memory_allocated; /* Current allocated memory. */
atomic_t *sockets_allocated; /* Current number of sockets. */
+
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-08 11:03:19.000000000 -0800
@@ -225,7 +225,7 @@ extern int sysctl_tcp_abc;
extern int sysctl_tcp_mtu_probing;
extern int sysctl_tcp_base_mss;

-extern atomic_t tcp_memory_allocated;
+extern struct percpu_counter tcp_memory_allocated;
extern atomic_t tcp_sockets_allocated;
extern int tcp_memory_pressure;

Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-08 11:03:19.000000000 -0800
@@ -1616,12 +1616,13 @@ static char proto_method_implemented(con

static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{
- seq_printf(seq, "%-9s %4u %6d %6d %-3s %6u %-3s %-10s "
+ seq_printf(seq, "%-9s %4u %6d %6lu %-3s %6u %-3s %-10s "
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
- proto->memory_allocated != NULL ? atomic_read(proto->memory_allocated) : -1,
+ proto->memory_allocated != NULL ?
+ percpu_counter_read_positive(proto->memory_allocated) : -1,
proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
proto->max_header,
proto->slab == NULL ? "no" : "yes",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-08 11:08:28.000000000 -0800
@@ -196,11 +196,11 @@ EXPORT_SYMBOL(sk_stream_error);
void __sk_stream_mem_reclaim(struct sock *sk)
{
if (sk->sk_forward_alloc >= SK_STREAM_MEM_QUANTUM) {
- atomic_sub(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM,
- sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated,
+ -(sk->sk_forward_alloc / SK_STREAM_MEM_QUANTUM));
sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;
if (*sk->sk_prot->memory_pressure &&
- (atomic_read(sk->sk_prot->memory_allocated) <
+ (percpu_counter_read(sk->sk_prot->memory_allocated) <
sk->sk_prot->sysctl_mem[0]))
*sk->sk_prot->memory_pressure = 0;
}
@@ -213,23 +213,26 @@ int sk_stream_mem_schedule(struct sock *
int amt = sk_stream_pages(size);

sk->sk_forward_alloc += amt * SK_STREAM_MEM_QUANTUM;
- atomic_add(amt, sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated, amt);

/* Under limit. */
- if (atomic_read(sk->sk_prot->memory_allocated) < sk->sk_prot->sysctl_mem[0]) {
+ if (percpu_counter_read(sk->sk_prot->memory_allocated) <
+ sk->sk_prot->sysctl_mem[0]) {
if (*sk->sk_prot->memory_pressure)
*sk->sk_prot->memory_pressure = 0;
return 1;
}

/* Over hard limit. */
- if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
+ if (percpu_counter_sum(sk->sk_prot->memory_allocated) >
+ sk->sk_prot->sysctl_mem[2]) {
sk->sk_prot->enter_memory_pressure();
goto suppress_allocation;
}

/* Under pressure. */
- if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[1])
+ if (percpu_counter_read(sk->sk_prot->memory_allocated) >
+ sk->sk_prot->sysctl_mem[1])
sk->sk_prot->enter_memory_pressure();

if (kind) {
@@ -259,7 +262,7 @@ suppress_allocation:

/* Alas. Undo changes. */
sk->sk_forward_alloc -= amt * SK_STREAM_MEM_QUANTUM;
- atomic_sub(amt, sk->sk_prot->memory_allocated);
+ percpu_counter_mod_bh(sk->sk_prot->memory_allocated, -amt);
return 0;
}

Index: linux-2.6.16-rc5mm3/net/decnet/af_decnet.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/decnet/af_decnet.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/decnet/af_decnet.c 2006-03-08 11:03:19.000000000 -0800
@@ -154,7 +154,7 @@ static const struct proto_ops dn_proto_o
static DEFINE_RWLOCK(dn_hash_lock);
static struct hlist_head dn_sk_hash[DN_SK_HASH_SIZE];
static struct hlist_head dn_wild_sk;
-static atomic_t decnet_memory_allocated;
+static struct percpu_counter decnet_memory_allocated;

static int __dn_setsockopt(struct socket *sock, int level, int optname, char __user *optval, int optlen, int flags);
static int __dn_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen, int flags);
@@ -2383,7 +2383,8 @@ static int __init decnet_init(void)
rc = proto_register(&dn_proto, 1);
if (rc != 0)
goto out;
-
+
+ percpu_counter_init(&decnet_memory_allocated);
dn_neigh_init();
dn_dev_init();
dn_route_init();
@@ -2424,6 +2425,7 @@ static void __exit decnet_exit(void)
proc_net_remove("decnet");

proto_unregister(&dn_proto);
+ percpu_counter_destroy(&decnet_memory_allocated);
}
module_exit(decnet_exit);
#endif
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c 2006-03-08 11:03:19.000000000 -0800
@@ -61,10 +61,10 @@ static int fold_prot_inuse(struct proto
static int sockstat_seq_show(struct seq_file *seq, void *v)
{
socket_seq_show(seq);
- seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %d\n",
+ seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
- atomic_read(&tcp_memory_allocated));
+ percpu_counter_read_positive(&tcp_memory_allocated));
seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
seq_printf(seq, "FRAG: inuse %d memory %d\n", ip_frag_nqueues,
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c 2006-03-08 11:16:33.000000000 -0800
@@ -283,7 +283,7 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
EXPORT_SYMBOL(sysctl_tcp_rmem);
EXPORT_SYMBOL(sysctl_tcp_wmem);

-atomic_t tcp_memory_allocated; /* Current allocated memory. */
+struct percpu_counter tcp_memory_allocated; /* Current allocated memory. */
atomic_t tcp_sockets_allocated; /* Current number of TCP sockets. */

EXPORT_SYMBOL(tcp_memory_allocated);
@@ -1593,7 +1593,7 @@ adjudge_to_death:
sk_stream_mem_reclaim(sk);
if (atomic_read(sk->sk_prot->orphan_count) > sysctl_tcp_max_orphans ||
(sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+ percpu_counter_sum(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
if (net_ratelimit())
printk(KERN_INFO "TCP: too many of orphaned "
"sockets\n");
@@ -2127,6 +2127,8 @@ void __init tcp_init(void)
"(established %d bind %d)\n",
tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);

+ percpu_counter_init(&tcp_memory_allocated);
+
tcp_register_congestion_control(&tcp_reno);
}

Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_input.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_input.c 2006-03-08 11:03:19.000000000 -0800
@@ -333,7 +333,7 @@ static void tcp_clamp_window(struct sock
if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
!tcp_memory_pressure &&
- atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+ percpu_counter_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
sysctl_tcp_rmem[2]);
}
@@ -3555,7 +3555,7 @@ static int tcp_should_expand_sndbuf(stru
return 0;

/* If we are under soft global TCP memory pressure, do not expand. */
- if (atomic_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+ if (percpu_counter_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
return 0;

/* If we filled the congestion window, do not expand. */
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_timer.c 2006-03-08 10:36:07.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_timer.c 2006-03-08 11:39:05.000000000 -0800
@@ -80,7 +80,7 @@ static int tcp_out_of_resources(struct s

if (orphans >= sysctl_tcp_max_orphans ||
(sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
- atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
+ percpu_counter_sum(&tcp_memory_allocated) > sysctl_tcp_mem[2])) {
if (net_ratelimit())
printk(KERN_INFO "Out of socket memory\n");

2006-03-08 20:55:43

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 3/4] net: percpufy frequently used vars -- proto.sockets_allocated

On Tue, Mar 07, 2006 at 06:16:02PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > +static inline int read_sockets_allocated(struct proto *prot)
> > +{
> > + int total = 0;
> > + int cpu;
> > + for_each_cpu(cpu)
> > + total += *per_cpu_ptr(prot->sockets_allocated, cpu);
> > + return total;
> > +}
>
> This is likely too big to be inlined, plus sock.h doesn't include enough
> headers to reliably compile this code.
>
> > +static inline void mod_sockets_allocated(int *sockets_allocated, int count)
> > +{
> > + (*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
> > + put_cpu();
> > +}
> > +
>
> Ditto.

OK, here is a revised patch.


Change the atomic_t sockets_allocated member of struct proto to a
per-cpu counter.

Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.16-rc5mm3/include/net/sock.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/sock.h 2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/sock.h 2006-03-08 12:06:52.000000000 -0800
@@ -543,7 +543,7 @@ struct proto {
/* Memory pressure */
void (*enter_memory_pressure)(void);
struct percpu_counter *memory_allocated; /* Current allocated memory. */
- atomic_t *sockets_allocated; /* Current number of sockets. */
+ int *sockets_allocated; /* Current number of sockets (percpu counter). */

/*
* Pressure flag: try to collapse.
@@ -579,6 +579,12 @@ struct proto {
} stats[NR_CPUS];
};

+extern int read_sockets_allocated(struct proto *prot);
+extern void mod_sockets_allocated(int *sockets_allocated, int count);
+
+#define inc_sockets_allocated(c) mod_sockets_allocated(c, 1)
+#define dec_sockets_allocated(c) mod_sockets_allocated(c, -1)
+
extern int proto_register(struct proto *prot, int alloc_slab);
extern void proto_unregister(struct proto *prot);

Index: linux-2.6.16-rc5mm3/include/net/tcp.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/net/tcp.h 2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/net/tcp.h 2006-03-08 11:39:40.000000000 -0800
@@ -226,7 +226,7 @@ extern int sysctl_tcp_mtu_probing;
extern int sysctl_tcp_base_mss;

extern struct percpu_counter tcp_memory_allocated;
-extern atomic_t tcp_sockets_allocated;
+extern int *tcp_sockets_allocated;
extern int tcp_memory_pressure;

/*
Index: linux-2.6.16-rc5mm3/net/core/sock.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/sock.c 2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/sock.c 2006-03-08 12:09:19.000000000 -0800
@@ -771,7 +771,7 @@ struct sock *sk_clone(const struct sock
newsk->sk_sleep = NULL;

if (newsk->sk_prot->sockets_allocated)
- atomic_inc(newsk->sk_prot->sockets_allocated);
+ inc_sockets_allocated(newsk->sk_prot->sockets_allocated);
}
out:
return newsk;
@@ -1451,6 +1451,25 @@ void sk_common_release(struct sock *sk)

EXPORT_SYMBOL(sk_common_release);

+int read_sockets_allocated(struct proto *prot)
+{
+ int total = 0;
+ int cpu;
+ for_each_cpu(cpu)
+ total += *per_cpu_ptr(prot->sockets_allocated, cpu);
+ return total;
+}
+
+EXPORT_SYMBOL(read_sockets_allocated);
+
+void mod_sockets_allocated(int *sockets_allocated, int count)
+{
+ (*per_cpu_ptr(sockets_allocated, get_cpu())) += count;
+ put_cpu();
+}
+
+EXPORT_SYMBOL(mod_sockets_allocated);
+
static DEFINE_RWLOCK(proto_list_lock);
static LIST_HEAD(proto_list);

@@ -1620,7 +1639,7 @@ static void proto_seq_printf(struct seq_
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
- proto->sockets_allocated != NULL ? atomic_read(proto->sockets_allocated) : -1,
+ proto->sockets_allocated != NULL ? read_sockets_allocated(proto) : -1,
proto->memory_allocated != NULL ?
percpu_counter_read_positive(proto->memory_allocated) : -1,
proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
Index: linux-2.6.16-rc5mm3/net/core/stream.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/core/stream.c 2006-03-08 11:08:28.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/core/stream.c 2006-03-08 11:39:40.000000000 -0800
@@ -242,7 +242,7 @@ int sk_stream_mem_schedule(struct sock *
return 1;

if (!*sk->sk_prot->memory_pressure ||
- sk->sk_prot->sysctl_mem[2] > atomic_read(sk->sk_prot->sockets_allocated) *
+ sk->sk_prot->sysctl_mem[2] > read_sockets_allocated(sk->sk_prot) *
sk_stream_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
sk->sk_forward_alloc))
Index: linux-2.6.16-rc5mm3/net/ipv4/proc.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/proc.c 2006-03-08 11:03:19.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/proc.c 2006-03-08 11:39:40.000000000 -0800
@@ -63,7 +63,7 @@ static int sockstat_seq_show(struct seq_
socket_seq_show(seq);
seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %lu\n",
fold_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
- tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
+ tcp_death_row.tw_count, read_sockets_allocated(&tcp_prot),
percpu_counter_read_positive(&tcp_memory_allocated));
seq_printf(seq, "UDP: inuse %d\n", fold_prot_inuse(&udp_prot));
seq_printf(seq, "RAW: inuse %d\n", fold_prot_inuse(&raw_prot));
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp.c 2006-03-08 11:16:33.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp.c 2006-03-08 11:39:40.000000000 -0800
@@ -284,7 +284,7 @@ EXPORT_SYMBOL(sysctl_tcp_rmem);
EXPORT_SYMBOL(sysctl_tcp_wmem);

struct percpu_counter tcp_memory_allocated; /* Current allocated memory. */
-atomic_t tcp_sockets_allocated; /* Current number of TCP sockets. */
+int *tcp_sockets_allocated; /* Current number of TCP sockets. */

EXPORT_SYMBOL(tcp_memory_allocated);
EXPORT_SYMBOL(tcp_sockets_allocated);
@@ -2055,6 +2055,12 @@ void __init tcp_init(void)
if (!tcp_hashinfo.bind_bucket_cachep)
panic("tcp_init: Cannot alloc tcp_bind_bucket cache.");

+ tcp_sockets_allocated = alloc_percpu(int);
+ if (!tcp_sockets_allocated)
+ panic("tcp_init: Cannot alloc tcp_sockets_allocated");
+
+ tcp_prot.sockets_allocated = tcp_sockets_allocated;
+
/* Size and allocate the main established and bind bucket
* hash tables.
*
Index: linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv4/tcp_ipv4.c 2006-03-08 10:36:03.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv4/tcp_ipv4.c 2006-03-08 11:39:40.000000000 -0800
@@ -1273,7 +1273,7 @@ static int tcp_v4_init_sock(struct sock
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];

- atomic_inc(&tcp_sockets_allocated);
+ inc_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1307,7 +1307,7 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- atomic_dec(&tcp_sockets_allocated);
+ dec_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1815,7 +1815,6 @@ struct proto tcp_prot = {
.unhash = tcp_unhash,
.get_port = tcp_v4_get_port,
.enter_memory_pressure = tcp_enter_memory_pressure,
- .sockets_allocated = &tcp_sockets_allocated,
.orphan_count = &tcp_orphan_count,
.memory_allocated = &tcp_memory_allocated,
.memory_pressure = &tcp_memory_pressure,
Index: linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.16-rc5mm3.orig/net/ipv6/tcp_ipv6.c 2006-03-08 10:36:03.000000000 -0800
+++ linux-2.6.16-rc5mm3/net/ipv6/tcp_ipv6.c 2006-03-08 11:39:40.000000000 -0800
@@ -1375,7 +1375,7 @@ static int tcp_v6_init_sock(struct sock
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];

- atomic_inc(&tcp_sockets_allocated);
+ inc_sockets_allocated(tcp_sockets_allocated);

return 0;
}
@@ -1573,7 +1573,6 @@ struct proto tcpv6_prot = {
.unhash = tcp_unhash,
.get_port = tcp_v6_get_port,
.enter_memory_pressure = tcp_enter_memory_pressure,
- .sockets_allocated = &tcp_sockets_allocated,
.memory_allocated = &tcp_memory_allocated,
.memory_pressure = &tcp_memory_pressure,
.orphan_count = &tcp_orphan_count,
@@ -1605,6 +1604,7 @@ static struct inet_protosw tcpv6_protosw

void __init tcpv6_init(void)
{
+ tcpv6_prot.sockets_allocated = tcp_sockets_allocated;
/* register inet6 protocol */
if (inet6_add_protocol(&tcpv6_protocol, IPPROTO_TCP) < 0)
printk(KERN_ERR "tcpv6_init: Could not register protocol\n");

2006-03-08 21:06:47

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 03:36:42PM -0500, Benjamin LaHaise wrote:
> On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote:
> > +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long amount)
> > +{
> > + local_bh_disable();
> > + fbc->count += amount;
> > + local_bh_enable();
> > +}
>
> Please use local_t instead, then you don't have to do the local_bh_disable()
> and enable() and the whole thing collapses down into 1 instruction on x86.

But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op no?
(Even if they were switched over to do local_irq_save() and
local_irq_restore() from atomic_t's that is).

And if we use local_t, we will add the overhead for the non bh
percpu_counter_mod for non x86 arches.

Thanks,
Kiran

2006-03-08 21:22:55

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
> But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op no?
> (Even if they were switched over to do local_irq_save() and
> local_irq_restore() from atomic_t's that is).

It's still more expensive than local_t.

> And if we use local_t, we will add the overhead for the non bh
> percpu_counter_mod for non x86 arches.

Last time I checked, all the major architectures had efficient local_t
implementations. Most of the RISC CPUs are able to do a load / store
conditional implementation that is the same cost (since memory barriers
tend to be explicite on powerpc). So why not use it?

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-08 22:24:50

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 04:17:33PM -0500, Benjamin LaHaise wrote:
> On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
>
> Last time I checked, all the major architectures had efficient local_t
> implementations. Most of the RISC CPUs are able to do a load / store
> conditional implementation that is the same cost (since memory barriers
> tend to be explicite on powerpc). So why not use it?

Then, for the batched percpu_counters, we could gain by using local_t only for
the UP case. But we will have to have a new local_long_t implementation
for that. Do you think just one use case of local_long_t warrants for a new
set of apis?

Kiran

2006-03-08 22:47:08

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 02:25:28PM -0800, Ravikiran G Thirumalai wrote:
> Then, for the batched percpu_counters, we could gain by using local_t only for
> the UP case. But we will have to have a new local_long_t implementation
> for that. Do you think just one use case of local_long_t warrants for a new
> set of apis?

I think it may make more sense to simply convert local_t into a long, given
that most of the users will be things like stats counters.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-03-08 23:08:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Wed, Mar 08, 2006 at 04:17:33PM -0500, Benjamin LaHaise wrote:
> > On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
> >
> > Last time I checked, all the major architectures had efficient local_t
> > implementations. Most of the RISC CPUs are able to do a load / store
> > conditional implementation that is the same cost (since memory barriers
> > tend to be explicite on powerpc). So why not use it?
>
> Then, for the batched percpu_counters, we could gain by using local_t only for
> the UP case. But we will have to have a new local_long_t implementation
> for that. Do you think just one use case of local_long_t warrants for a new
> set of apis?
>

local_t maps onto 32-bit values on 32-bit machines and onto 64-bit values
on 64-bit machines. unsigned longs. I don't quite trust the signedness
handling across all archs.

<looks>

Yes, alpha (for example) went and made its local_t's signed, which is wrong
and dangerous.

ia64 is signed.

mips is signed.

parisc is signed.

s390 is signed.

sparc64 is signed.

x86_64 is signed 32-bit!

All other architectures use unsigned long. A fiasco.

Once decrapify-asm-generic-localh.patch is merged I think all architectures
can and should use asm-generic/local.h.

Until decrapify-asm-generic-localh.patch has been merged and the downstream
arch consolidation has happened and the signedness problems have been
carefully reviewed and fixed I wouldn't go within a mile of local_t.

Once all that is sorted out then yes, it makes sense to convert per-cpu
counters to local_t. Note that local_t is unsigned, and percpu_counter
needs to treat it as signed.

We should also move the out-of-line percpu_counter implementation over to
lib/something.c (in obj-y).

But none of that has anything to do with these patches.

2006-03-08 23:14:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Andrew Morton <[email protected]> wrote:
>
> Once decrapify-asm-generic-localh.patch is merged I think all architectures
> can and should use asm-generic/local.h.

err, no. Because that's just atomic_long_t, and that's a locked instruction.

We need to review and fix up those architectures which have implemented the
optimised versions.

2006-03-08 23:45:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Benjamin LaHaise <[email protected]> wrote:
>
> On Wed, Mar 08, 2006 at 02:25:28PM -0800, Ravikiran G Thirumalai wrote:
> > Then, for the batched percpu_counters, we could gain by using local_t only for
> > the UP case. But we will have to have a new local_long_t implementation
> > for that. Do you think just one use case of local_long_t warrants for a new
> > set of apis?
>
> I think it may make more sense to simply convert local_t into a long, given
> that most of the users will be things like stats counters.
>

Yes, I agree that making local_t signed would be better. It's consistent
with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible.

Perhaps. A lot of applications would just be upcounters for statistics,
where unsigned is desired. But I think the consistency argument wins out.

2006-03-09 00:17:26

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 03:43:21PM -0800, Andrew Morton wrote:
> Benjamin LaHaise <[email protected]> wrote:
> >
> > I think it may make more sense to simply convert local_t into a long, given
> > that most of the users will be things like stats counters.
> >
>
> Yes, I agree that making local_t signed would be better. It's consistent
> with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible.
>
> Perhaps. A lot of applications would just be upcounters for statistics,
> where unsigned is desired. But I think the consistency argument wins out.

It already is... for most of the arches except x86_64.
And on -mm, the asm-generic version uses atomic_long_t for local_t (signed
long) which seems right.

Although, I wonder why we use:

#define local_read(l) ((unsigned long)atomic_long_read(&(l)->a))

It would return a huge value if the local counter was even -1 no?

2006-03-09 00:35:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Ravikiran G Thirumalai <[email protected]> wrote:
>
> On Wed, Mar 08, 2006 at 03:43:21PM -0800, Andrew Morton wrote:
> > Benjamin LaHaise <[email protected]> wrote:
> > >
> > > I think it may make more sense to simply convert local_t into a long, given
> > > that most of the users will be things like stats counters.
> > >
> >
> > Yes, I agree that making local_t signed would be better. It's consistent
> > with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible.
> >
> > Perhaps. A lot of applications would just be upcounters for statistics,
> > where unsigned is desired. But I think the consistency argument wins out.
>
> It already is... for most of the arches except x86_64.

x86 uses unsigned long.

> And on -mm, the asm-generic version uses atomic_long_t for local_t (signed
> long) which seems right.

No, it uses unsigned long. The only place where signedness matters is
local_read(), and there it is typecast to ulong.

> Although, I wonder why we use:
>
> #define local_read(l) ((unsigned long)atomic_long_read(&(l)->a))
>
> It would return a huge value if the local counter was even -1 no?

It's casting a signed long to an unsigned long. That does the right thing.
Yes, it'll convert -1 to 0xffffffff[ffffffff].

2006-03-09 02:21:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Andrew Morton <[email protected]> writes:
>
> x86_64 is signed 32-bit!

I'll change it. You want signed 64bit?

-Andi

2006-03-09 02:34:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Andi Kleen <[email protected]> wrote:
>
> Andrew Morton <[email protected]> writes:
> >
> > x86_64 is signed 32-bit!
>
> I'll change it. You want signed 64bit?
>

Well it's all random at present. Since the API is defined as unsigned I
guess it's be best to make it unsigned for now. Later, when someone gets
down to making it signed and reviewing all the users they can flip x86_64
to signed along with the rest of the architectures.

2006-03-09 08:06:14

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Wed, Mar 08, 2006 at 04:32:58PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > On Wed, Mar 08, 2006 at 03:43:21PM -0800, Andrew Morton wrote:
> > > Benjamin LaHaise <[email protected]> wrote:
> > > >
> > > > I think it may make more sense to simply convert local_t into a long, given
> > > > that most of the users will be things like stats counters.
> > > >
> > >
> > > Yes, I agree that making local_t signed would be better. It's consistent
> > > with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible.
> > >
> > > Perhaps. A lot of applications would just be upcounters for statistics,
> > > where unsigned is desired. But I think the consistency argument wins out.
> >
> > It already is... for most of the arches except x86_64.
>
> x86 uses unsigned long.

Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
This keeps local_t unsigned long. (We can change it to signed value
along with other arches later in one go I guess)

Thanks,
Kiran


Change x86_64 local_t to 64 bits like all other arches.

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.16-rc5mm3/include/asm-x86_64/local.h
===================================================================
--- linux-2.6.16-rc5mm3.orig/include/asm-x86_64/local.h 2006-03-08 16:51:31.000000000 -0800
+++ linux-2.6.16-rc5mm3/include/asm-x86_64/local.h 2006-03-08 21:56:01.000000000 -0800
@@ -5,18 +5,18 @@

typedef struct
{
- volatile unsigned int counter;
+ volatile long counter;
} local_t;

#define LOCAL_INIT(i) { (i) }

-#define local_read(v) ((v)->counter)
+#define local_read(v) ((unsigned long)(v)->counter)
#define local_set(v,i) (((v)->counter) = (i))

static __inline__ void local_inc(local_t *v)
{
__asm__ __volatile__(
- "incl %0"
+ "incq %0"
:"=m" (v->counter)
:"m" (v->counter));
}
@@ -24,7 +24,7 @@ static __inline__ void local_inc(local_t
static __inline__ void local_dec(local_t *v)
{
__asm__ __volatile__(
- "decl %0"
+ "decq %0"
:"=m" (v->counter)
:"m" (v->counter));
}
@@ -32,7 +32,7 @@ static __inline__ void local_dec(local_t
static __inline__ void local_add(unsigned int i, local_t *v)
{
__asm__ __volatile__(
- "addl %1,%0"
+ "addq %1,%0"
:"=m" (v->counter)
:"ir" (i), "m" (v->counter));
}
@@ -40,7 +40,7 @@ static __inline__ void local_add(unsigne
static __inline__ void local_sub(unsigned int i, local_t *v)
{
__asm__ __volatile__(
- "subl %1,%0"
+ "subq %1,%0"
:"=m" (v->counter)
:"ir" (i), "m" (v->counter));
}
@@ -71,4 +71,4 @@ static __inline__ void local_sub(unsigne
#define __cpu_local_add(i, v) cpu_local_add((i), (v))
#define __cpu_local_sub(i, v) cpu_local_sub((i), (v))

-#endif /* _ARCH_I386_LOCAL_H */
+#endif /* _ARCH_X8664_LOCAL_H */

2006-03-09 08:15:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Ravikiran G Thirumalai wrote:

> Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
> This keeps local_t unsigned long. (We can change it to signed value
> along with other arches later in one go I guess)
>

Why not just keep naming and structure of interfaces consistent with
atomic_t?

That would be signed and 32-bit. You then also have a local64_t.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 08:22:13

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Thu, Mar 09, 2006 at 07:14:26PM +1100, Nick Piggin wrote:
> Ravikiran G Thirumalai wrote:
>
> >Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
> >This keeps local_t unsigned long. (We can change it to signed value
> >along with other arches later in one go I guess)
> >
>
> Why not just keep naming and structure of interfaces consistent with
> atomic_t?
>
> That would be signed and 32-bit. You then also have a local64_t.

No, local_t is supposed to be 64-bits on 64bits arches and 32 bit on 32 bit
arches. x86_64 was the only exception, so this patch fixes that.

2006-03-09 08:45:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

Ravikiran G Thirumalai wrote:
> On Thu, Mar 09, 2006 at 07:14:26PM +1100, Nick Piggin wrote:
>
>>Ravikiran G Thirumalai wrote:
>>
>>
>>>Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
>>>This keeps local_t unsigned long. (We can change it to signed value
>>>along with other arches later in one go I guess)
>>>
>>
>>Why not just keep naming and structure of interfaces consistent with
>>atomic_t?
>>
>>That would be signed and 32-bit. You then also have a local64_t.
>
>
> No, local_t is supposed to be 64-bits on 64bits arches and 32 bit on 32 bit
> arches. x86_64 was the only exception, so this patch fixes that.
>
>

Right. If it wasn't I wouldn't have proposed the change.

Considering that local_t has been broken so that basically nobody
is using it, now is a great time to rethink the types before it
gets fixed and people start using it.

And modelling the type on the atomic types would make the most
sense because everyone already knows them.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 11:47:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Thursday 09 March 2006 09:06, Ravikiran G Thirumalai wrote:
> On Wed, Mar 08, 2006 at 04:32:58PM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <[email protected]> wrote:
> > >
> > > On Wed, Mar 08, 2006 at 03:43:21PM -0800, Andrew Morton wrote:
> > > > Benjamin LaHaise <[email protected]> wrote:
> > > > >
> > > > > I think it may make more sense to simply convert local_t into a long, given
> > > > > that most of the users will be things like stats counters.
> > > > >
> > > >
> > > > Yes, I agree that making local_t signed would be better. It's consistent
> > > > with atomic_t, atomic64_t and atomic_long_t and it's a bit more flexible.
> > > >
> > > > Perhaps. A lot of applications would just be upcounters for statistics,
> > > > where unsigned is desired. But I think the consistency argument wins out.
> > >
> > > It already is... for most of the arches except x86_64.
> >
> > x86 uses unsigned long.
>
> Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
> This keeps local_t unsigned long. (We can change it to signed value
> along with other arches later in one go I guess)

I already did that change in my tree
(except it's currently unsigned long as Andrew Indicated)

-Andi


>

2006-03-09 19:00:10

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

On Thu, Mar 09, 2006 at 07:41:08PM +1100, Nick Piggin wrote:
> Considering that local_t has been broken so that basically nobody
> is using it, now is a great time to rethink the types before it
> gets fixed and people start using it.

I'm starting to get more concerned as the per-cpu changes that keep adding
more overhead is getting out of hand.

> And modelling the type on the atomic types would make the most
> sense because everyone already knows them.

Except that the usage models are different; local_t is most likely to be
used for cpu local statistics or for sequences, where making them signed
is a bit backwards.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.