LinuxLists.cc - [PATCH 9/9] net: vm deadlock avoidance core

2007-01-16 10:28:56

by Peter Zijlstra

[permalink] [raw]

Subject: [PATCH 9/9] net: vm deadlock avoidance core

In order to provide robust networked storage there must be a guarantee
of progress. That is, the storage device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device must always find enough memory to build/send packets
over the network _and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for
user-space to read them. There is a practical limit imposed to avoid DoS
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network storage device waiting for an ACK to free a page).

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

The reserve is calculated to exceed the IP fragment caches and match the route
cache.

(Note on the name SOCK_VMIO; the basic problem is a circular dependency between
the network and virtual memory subsystems which needs to be broken. This does
make VM network IO - and only VM network IO - special, it does not generalize)

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/skbuff.h | 13 +++-
include/net/sock.h | 42 ++++++++++++++-
net/core/dev.c | 40 +++++++++++++-
net/core/skbuff.c | 50 ++++++++++++++++--
net/core/sock.c | 121 +++++++++++++++++++++++++++++++++++++++++++++
net/core/stream.c | 5 +
net/ipv4/ip_fragment.c | 1
net/ipv4/ipmr.c | 4 +
net/ipv4/route.c | 15 +++++
net/ipv4/sysctl_net_ipv4.c | 14 ++++-
net/ipv4/tcp_ipv4.c | 27 +++++++++-
net/ipv6/reassembly.c | 1
net/ipv6/route.c | 15 +++++
net/ipv6/sysctl_net_ipv6.c | 6 +-
net/ipv6/tcp_ipv6.c | 27 +++++++++-
net/netfilter/core.c | 5 +
security/selinux/avc.c | 2
17 files changed, 361 insertions(+), 27 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===================================================================
--- linux-2.6-git.orig/include/linux/skbuff.h 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/include/linux/skbuff.h 2007-01-12 12:21:14.000000000 +0100
@@ -284,7 +284,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ emergency:1;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
@@ -329,10 +330,13 @@ struct sk_buff {

#include <asm/system.h>

+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int flags, int node);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
@@ -342,7 +346,7 @@ static inline struct sk_buff *alloc_skb(
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, -1);
}

extern struct sk_buff *alloc_skb_from_cache(struct kmem_cache *cp,
@@ -1103,7 +1107,8 @@ static inline void __skb_queue_purge(str
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb =
+ __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, -1);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
Index: linux-2.6-git/include/net/sock.h
===================================================================
--- linux-2.6-git.orig/include/net/sock.h 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/include/net/sock.h 2007-01-12 13:17:45.000000000 +0100
@@ -392,6 +392,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_VMIO, /* the VM depends on us - make sure we're serviced */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -414,6 +415,40 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}

+static inline int sk_has_vmio(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_SKB 3
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+ (4 * MAX_FRAGMENTS * MAX_PAGES_PER_SKB)
+
+extern atomic_t vmio_socks;
+extern atomic_t emergency_rx_skbs;
+
+static inline int sk_vmio_socks(void)
+{
+ return atomic_read(&vmio_socks);
+}
+
+extern int sk_emergency_skb_get(void);
+
+static inline void sk_emergency_skb_put(void)
+{
+ return atomic_dec(&emergency_rx_skbs);
+}
+
+extern void sk_adjust_memalloc(int socks, int tx_reserve_pages);
+extern void ipfrag_reserve_memory(int ipfrag_reserve);
+extern void iprt_reserve_memory(int rt_reserve);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
@@ -695,7 +730,8 @@ static inline struct inode *SOCK_INODE(s
}

extern void __sk_stream_mem_reclaim(struct sock *sk);
-extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
+extern int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb,
+ int size, int kind);

#define SK_STREAM_MEM_QUANTUM ((int)PAGE_SIZE)

@@ -722,13 +758,13 @@ static inline void sk_stream_writequeue_
static inline int sk_stream_rmem_schedule(struct sock *sk, struct sk_buff *skb)
{
return (int)skb->truesize <= sk->sk_forward_alloc ||
- sk_stream_mem_schedule(sk, skb->truesize, 1);
+ sk_stream_mem_schedule(sk, skb, skb->truesize, 1);
}

static inline int sk_stream_wmem_schedule(struct sock *sk, int size)
{
return size <= sk->sk_forward_alloc ||
- sk_stream_mem_schedule(sk, size, 0);
+ sk_stream_mem_schedule(sk, NULL, size, 0);
}

/* Used by processes to "lock" a socket state, so that
Index: linux-2.6-git/net/core/dev.c
===================================================================
--- linux-2.6-git.orig/net/core/dev.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/dev.c 2007-01-12 12:21:55.000000000 +0100
@@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;
+ unsigned long pflags = current->flags;
+
+ /* Emergency skb are special, they should
+ * - be delivered to SOCK_VMIO sockets only
+ * - stay away from userspace
+ * - have bounded memory usage
+ *
+ * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
+ * This saves us from propagating the allocation context down to all
+ * allocation sites.
+ */
+ if (unlikely(skb->emergency))
+ current->flags |= PF_MEMALLOC;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
- return NET_RX_DROP;
+ goto out;

if (!skb->tstamp.off_sec)
net_timestamp(skb);
@@ -1781,7 +1794,7 @@ int netif_receive_skb(struct sk_buff *sk
orig_dev = skb_bond(skb);

if (!orig_dev)
- return NET_RX_DROP;
+ goto out;

__get_cpu_var(netdev_rx_stat).total++;

@@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
goto ncls;
}
#endif
+ if (unlikely(skb->emergency))
+ goto skip_taps;

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
@@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
}
}

+skip_taps:
#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
- goto out;
+ goto unlock;
}

skb->tc_verd = 0;
ncls:
#endif

+ if (unlikely(skb->emergency))
+ switch(skb->protocol) {
+ case __constant_htons(ETH_P_ARP):
+ case __constant_htons(ETH_P_IP):
+ case __constant_htons(ETH_P_IPV6):
+ break;
+
+ default:
+ goto drop;
+ }
+
if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
- goto out;
+ goto unlock;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
@@ -1842,6 +1869,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
* me how you were going to use this. :-)
@@ -1849,8 +1877,10 @@ ncls:
ret = NET_RX_DROP;
}

-out:
+unlock:
rcu_read_unlock();
+out:
+ current->flags = pflags;
return ret;
}

Index: linux-2.6-git/net/core/skbuff.c
===================================================================
--- linux-2.6-git.orig/net/core/skbuff.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/skbuff.c 2007-01-12 13:29:51.000000000 +0100
@@ -142,28 +142,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ int emergency = 0;

- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ size = SKB_DATA_ALIGN(size);
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
+ if (flags & SKB_ALLOC_RX)
+ gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;

+retry_alloc:
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
if (!skb)
- goto out;
+ goto noskb;

/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
if (!data)
goto nodata;

memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->emergency = emergency;
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -180,7 +186,7 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;

- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);

@@ -188,12 +194,29 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);

child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->emergency = skb->emergency;
}
out:
return skb;
+
nodata:
kmem_cache_free(cache, skb);
skb = NULL;
+noskb:
+ /* Attempt emergency allocation when RX skb. */
+ if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
+ goto out;
+
+ if (!emergency) {
+ if (sk_emergency_skb_get()) {
+ gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
+ gfp_mask |= __GFP_EMERGENCY;
+ emergency = 1;
+ goto retry_alloc;
+ }
+ } else
+ sk_emergency_skb_put();
+
goto out;
}

@@ -271,7 +294,7 @@ struct sk_buff *__netdev_alloc_skb(struc
int node = dev->class_dev.dev ? dev_to_node(dev->class_dev.dev) : -1;
struct sk_buff *skb;

- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -320,6 +343,8 @@ static void skb_release_data(struct sk_b
skb_drop_fraglist(skb);

kfree(skb->head);
+ if (unlikely(skb->emergency))
+ sk_emergency_skb_put();
}
}

@@ -440,6 +465,9 @@ struct sk_buff *skb_clone(struct sk_buff
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ if (unlikely(skb->emergency))
+ gfp_mask |= __GFP_EMERGENCY;
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -474,6 +502,7 @@ struct sk_buff *skb_clone(struct sk_buff
#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
C(ipvs_property);
#endif
+ C(emergency);
C(protocol);
n->destructor = NULL;
C(mark);
@@ -689,12 +718,19 @@ int pskb_expand_head(struct sk_buff *skb
u8 *data;
int size = nhead + (skb->end - skb->head) + ntail;
long off;
+ int emergency = 0;

if (skb_shared(skb))
BUG();

size = SKB_DATA_ALIGN(size);

+ if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
+ gfp_mask |= __GFP_EMERGENCY;
+ emergency = 1;
+ } else
+ gfp_mask |= __GFP_NOMEMALLOC;
+
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;
@@ -727,6 +763,8 @@ int pskb_expand_head(struct sk_buff *skb
return 0;

nodata:
+ if (unlikely(emergency))
+ sk_emergency_skb_put();
return -ENOMEM;
}

Index: linux-2.6-git/net/core/sock.c
===================================================================
--- linux-2.6-git.orig/net/core/sock.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/sock.c 2007-01-12 12:21:14.000000000 +0100
@@ -196,6 +196,120 @@ __u32 sysctl_rmem_default __read_mostly
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);

+static DEFINE_SPINLOCK(memalloc_lock);
+static int rx_net_reserve;
+
+atomic_t vmio_socks;
+atomic_t emergency_rx_skbs;
+
+static int ipfrag_threshold;
+
+#define ipfrag_mtu() (1500) /* XXX: should be smallest mtu system wide */
+#define ipfrag_skbs() (ipfrag_threshold / ipfrag_mtu())
+#define ipfrag_pages() (ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
+
+static int iprt_pages;
+
+/*
+ * is there room for another emergency skb.
+ */
+int sk_emergency_skb_get(void)
+{
+ int nr = atomic_add_return(1, &emergency_rx_skbs);
+ int thresh = (3 * ipfrag_skbs()) / 2;
+ if (nr < thresh)
+ return 1;
+
+ atomic_dec(&emergency_rx_skbs);
+ return 0;
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_VMIO sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ * @tx_reserve_pages is an upper-bound of memory used for TX hence
+ * we need not account the pages like we do for RX pages.
+ */
+void sk_adjust_memalloc(int socks, int tx_reserve_pages)
+{
+ unsigned long flags;
+ int reserve = tx_reserve_pages;
+ int nr_socks;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+ nr_socks = atomic_add_return(socks, &vmio_socks);
+ BUG_ON(nr_socks < 0);
+
+ if (nr_socks) {
+ int rx_pages = 2 * ipfrag_pages() + iprt_pages;
+ reserve += rx_pages - rx_net_reserve;
+ rx_net_reserve = rx_pages;
+ } else {
+ reserve -= rx_net_reserve;
+ rx_net_reserve = 0;
+ }
+
+ if (reserve)
+ adjust_memalloc_reserve(reserve);
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/*
+ * tiny helper function to track the total ipfragment memory
+ * needed because of modular ipv6
+ */
+void ipfrag_reserve_memory(int frags)
+{
+ ipfrag_threshold += frags;
+ sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(ipfrag_reserve_memory);
+
+void iprt_reserve_memory(int pages)
+{
+ iprt_pages += pages;
+ sk_adjust_memalloc(0, 0);
+}
+EXPORT_SYMBOL_GPL(iprt_reserve_memory);
+
+/**
+ * sk_set_vmio - sets %SOCK_VMIO
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ * accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+ int set = sock_flag(sk, SOCK_VMIO);
+ if (!set) {
+ sk_adjust_memalloc(1, 0);
+ sock_set_flag(sk, SOCK_VMIO);
+ sk->sk_allocation |= __GFP_EMERGENCY;
+ }
+ return !set;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+ int set = sock_flag(sk, SOCK_VMIO);
+ if (set) {
+ sk_adjust_memalloc(-1, 0);
+ sock_reset_flag(sk, SOCK_VMIO);
+ sk->sk_allocation &= ~__GFP_EMERGENCY;
+ }
+ return set;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;
@@ -239,6 +353,12 @@ int sock_queue_rcv_skb(struct sock *sk,
int err = 0;
int skb_len;

+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk)) {
+ err = -ENOMEM;
+ goto out;
+ }
+ } else
/* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
number of warnings when compiling with -W --ANK
*/
@@ -868,6 +988,7 @@ void sk_free(struct sock *sk)
struct sk_filter *filter;
struct module *owner = sk->sk_prot_creator->owner;

+ sk_clear_vmio(sk);
if (sk->sk_destruct)
sk->sk_destruct(sk);

Index: linux-2.6-git/net/ipv4/ipmr.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ipmr.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/ipmr.c 2007-01-12 12:21:14.000000000 +0100
@@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
struct mfc_cache *cache;
int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;

+ if (unlikely(skb->emergency))
+ goto drop;
+
/* Packet is looped back after forward, it should not be
forwarded second time, but still can be delivered locally.
*/
@@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
dont_forward:
if (local)
return ip_local_deliver(skb);
+drop:
kfree_skb(skb);
return 0;
}
Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:21:14.000000000 +0100
@@ -18,6 +18,7 @@
#include <net/route.h>
#include <net/tcp.h>
#include <net/cipso_ipv4.h>
+#include <net/sock.h>

/* From af_inet.c */
extern int sysctl_ip_nonlocal_bind;
@@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c

}

+int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old_thresh = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ ipfrag_reserve_memory(*(int *)table->data - old_thresh);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
+
ctl_table ipv4_table[] = {
{
.ctl_name = NET_IPV4_TCP_TIMESTAMPS,
@@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
.data = &sysctl_ipfrag_high_thresh,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec
+ .proc_handler = &proc_dointvec_fragment
},
{
.ctl_name = NET_IPV4_IPFRAG_LOW_THRESH,
Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
@@ -1604,6 +1604,22 @@ csum_err:
goto discard;
}

+static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+ if (unlikely(skb->emergency)) {
+ BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+ if (!(pflags & PF_MEMALLOC))
+ current->flags |= PF_MEMALLOC;
+ }
+
+ ret = tcp_v4_do_rcv(sk, skb);
+
+ current->flags = pflags;
+ return ret;
+}
+
/*
* From tcp_input.c
*/
@@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;

+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk))
+ goto discard_and_relse;
+ /*
+ decrease window size..
+ tcp_enter_quickack_mode(sk);
+ */
+ }
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
@@ -2429,7 +2454,7 @@ struct proto tcp_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
- .backlog_rcv = tcp_v4_do_rcv,
+ .backlog_rcv = tcp_v4_backlog_rcv,
.hash = tcp_v4_hash,
.unhash = tcp_unhash,
.get_port = tcp_v4_get_port,
Index: linux-2.6-git/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/sysctl_net_ipv6.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/sysctl_net_ipv6.c 2007-01-12 12:21:14.000000000 +0100
@@ -15,6 +15,10 @@

#ifdef CONFIG_SYSCTL

+extern int proc_dointvec_fragment(ctl_table *table, int write,
+ struct file *filp, void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
static ctl_table ipv6_table[] = {
{
.ctl_name = NET_IPV6_ROUTE,
@@ -44,7 +48,7 @@ static ctl_table ipv6_table[] = {
.data = &sysctl_ip6frag_high_thresh,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec
+ .proc_handler = &proc_dointvec_fragment
},
{
.ctl_name = NET_IPV6_IP6FRAG_LOW_THRESH,
Index: linux-2.6-git/net/netfilter/core.c
===================================================================
--- linux-2.6-git.orig/net/netfilter/core.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/netfilter/core.c 2007-01-12 12:21:14.000000000 +0100
@@ -181,6 +181,11 @@ next_hook:
kfree_skb(*pskb);
ret = -EPERM;
} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
+ if (unlikely((*pskb)->emergency)) {
+ printk(KERN_ERR "nf_hook: NF_QUEUE encountered for "
+ "emergency skb - skipping rule.\n");
+ goto next_hook;
+ }
NFDEBUG("nf_hook: Verdict = QUEUE.\n");
if (!nf_queue(*pskb, elem, pf, hook, indev, outdev, okfn,
verdict >> NF_VERDICT_BITS))
Index: linux-2.6-git/security/selinux/avc.c
===================================================================
--- linux-2.6-git.orig/security/selinux/avc.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/security/selinux/avc.c 2007-01-12 12:21:14.000000000 +0100
@@ -332,7 +332,7 @@ static struct avc_node *avc_alloc_node(v
{
struct avc_node *node;

- node = kmem_cache_alloc(avc_node_cachep, GFP_ATOMIC);
+ node = kmem_cache_alloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC);
if (!node)
goto out;

Index: linux-2.6-git/net/ipv4/ip_fragment.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/ip_fragment.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/ipv4/ip_fragment.c 2007-01-12 12:21:14.000000000 +0100
@@ -743,6 +743,7 @@ void ipfrag_init(void)
ipfrag_secret_timer.function = ipfrag_secret_rebuild;
ipfrag_secret_timer.expires = jiffies + sysctl_ipfrag_secret_interval;
add_timer(&ipfrag_secret_timer);
+ ipfrag_reserve_memory(sysctl_ipfrag_high_thresh);
}

EXPORT_SYMBOL(ip_defrag);
Index: linux-2.6-git/net/ipv6/reassembly.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/reassembly.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/reassembly.c 2007-01-12 12:21:14.000000000 +0100
@@ -772,4 +772,5 @@ void __init ipv6_frag_init(void)
ip6_frag_secret_timer.function = ip6_frag_secret_rebuild;
ip6_frag_secret_timer.expires = jiffies + sysctl_ip6frag_secret_interval;
add_timer(&ip6_frag_secret_timer);
+ ipfrag_reserve_memory(sysctl_ip6frag_high_thresh);
}
Index: linux-2.6-git/net/ipv4/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv4/route.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv4/route.c 2007-01-12 12:21:14.000000000 +0100
@@ -2884,6 +2884,17 @@ static int ipv4_sysctl_rtcache_flush_str
return 0;
}

+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+ *(int *)table->data - old));
+ return ret;
+}
+
ctl_table ipv4_route_table[] = {
{
.ctl_name = NET_IPV4_ROUTE_FLUSH,
@@ -2926,7 +2937,7 @@ ctl_table ipv4_route_table[] = {
.data = &ip_rt_max_size,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_dointvec_rt_size,
},
{
/* Deprecated. Use gc_min_interval_ms */
@@ -3153,6 +3164,8 @@ int __init ip_rt_init(void)

ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ipv4_dst_ops.kmem_cachep,
+ ip_rt_max_size));

devinet_init();
ip_fib_init();
Index: linux-2.6-git/net/ipv6/route.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/route.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/route.c 2007-01-12 12:21:14.000000000 +0100
@@ -2356,6 +2356,17 @@ int ipv6_sysctl_rtcache_flush(ctl_table
return -EINVAL;
}

+static int proc_dointvec_rt_size(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+ int old = *(int *)table->data;
+ ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+ *(int *)table->data - old));
+ return ret;
+}
+
ctl_table ipv6_route_table[] = {
{
.ctl_name = NET_IPV6_ROUTE_FLUSH,
@@ -2379,7 +2390,7 @@ ctl_table ipv6_route_table[] = {
.data = &ip6_rt_max_size,
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_dointvec_rt_size,
},
{
.ctl_name = NET_IPV6_ROUTE_GC_MIN_INTERVAL,
@@ -2464,6 +2475,8 @@ void __init ip6_route_init(void)

proc_net_fops_create("rt6_stats", S_IRUGO, &rt6_stats_seq_fops);
#endif
+ iprt_reserve_memory(kmem_cache_objs_to_pages(ip6_dst_ops.kmem_cachep,
+ ip6_rt_max_size));
#ifdef CONFIG_XFRM
xfrm6_init();
#endif
Index: linux-2.6-git/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6-git.orig/net/ipv6/tcp_ipv6.c 2007-01-12 12:20:08.000000000 +0100
+++ linux-2.6-git/net/ipv6/tcp_ipv6.c 2007-01-12 12:21:14.000000000 +0100
@@ -1678,6 +1678,22 @@ ipv6_pktoptions:
return 0;
}

+static int tcp_v6_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+ if (unlikely(skb->emergency)) {
+ BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
+ if (!(pflags & PF_MEMALLOC))
+ current->flags |= PF_MEMALLOC;
+ }
+
+ ret = tcp_v6_do_rcv(sk, skb);
+
+ current->flags = pflags;
+ return ret;
+}
+
static int tcp_v6_rcv(struct sk_buff **pskb)
{
struct sk_buff *skb = *pskb;
@@ -1723,6 +1739,15 @@ static int tcp_v6_rcv(struct sk_buff **p
if (!sk)
goto no_tcp_socket;

+ if (unlikely(skb->emergency)) {
+ if (!sk_has_vmio(sk))
+ goto discard_and_relse;
+ /*
+ decrease window size..
+ tcp_enter_quickack_mode(sk);
+ */
+ }
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
@@ -2127,7 +2152,7 @@ struct proto tcpv6_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
- .backlog_rcv = tcp_v6_do_rcv,
+ .backlog_rcv = tcp_v6_backlog_rcv,
.hash = tcp_v6_hash,
.unhash = tcp_unhash,
.get_port = tcp_v6_get_port,
Index: linux-2.6-git/net/core/stream.c
===================================================================
--- linux-2.6-git.orig/net/core/stream.c 2007-01-12 12:20:07.000000000 +0100
+++ linux-2.6-git/net/core/stream.c 2007-01-12 13:17:08.000000000 +0100
@@ -207,7 +207,7 @@ void __sk_stream_mem_reclaim(struct sock

EXPORT_SYMBOL(__sk_stream_mem_reclaim);

-int sk_stream_mem_schedule(struct sock *sk, int size, int kind)
+int sk_stream_mem_schedule(struct sock *sk, struct sk_buff *skb, int size, int kind)
{
int amt = sk_stream_pages(size);

@@ -224,7 +224,8 @@ int sk_stream_mem_schedule(struct sock *
/* Over hard limit. */
if (atomic_read(sk->sk_prot->memory_allocated) > sk->sk_prot->sysctl_mem[2]) {
sk->sk_prot->enter_memory_pressure();
- goto suppress_allocation;
+ if (likely(!skb || !skb->emergency))
+ goto suppress_allocation;
}

/* Under pressure. */

--

2007-01-16 13:25:44

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Tue, Jan 16, 2007 at 10:46:06AM +0100, Peter Zijlstra ([email protected]) wrote:
> In order to provide robust networked storage there must be a guarantee
> of progress. That is, the storage device must never stall because of (physical)
> OOM, because the device itself might be needed to get out of it (reclaim).

> /* Used by processes to "lock" a socket state, so that
> Index: linux-2.6-git/net/core/dev.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/dev.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/dev.c 2007-01-12 12:21:55.000000000 +0100
> @@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
> struct net_device *orig_dev;
> int ret = NET_RX_DROP;
> __be16 type;
> + unsigned long pflags = current->flags;
> +
> + /* Emergency skb are special, they should
> + * - be delivered to SOCK_VMIO sockets only
> + * - stay away from userspace
> + * - have bounded memory usage
> + *
> + * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> + * This saves us from propagating the allocation context down to all
> + * allocation sites.
> + */
> + if (unlikely(skb->emergency))
> + current->flags |= PF_MEMALLOC;

Access to 'current' in netif_receive_skb()???
Why do you want to work with, for example keventd?

> /* if we've gotten here through NAPI, check netpoll */
> if (skb->dev->poll && netpoll_rx(skb))
> - return NET_RX_DROP;
> + goto out;
>
> if (!skb->tstamp.off_sec)
> net_timestamp(skb);
> @@ -1781,7 +1794,7 @@ int netif_receive_skb(struct sk_buff *sk
> orig_dev = skb_bond(skb);
>
> if (!orig_dev)
> - return NET_RX_DROP;
> + goto out;
>
> __get_cpu_var(netdev_rx_stat).total++;
>
> @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> goto ncls;
> }
> #endif
> + if (unlikely(skb->emergency))
> + goto skip_taps;
>
> list_for_each_entry_rcu(ptype, &ptype_all, list) {
> if (!ptype->dev || ptype->dev == skb->dev) {
> @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> }
> }
>
> +skip_taps:

It is still a 'tap'.

> #ifdef CONFIG_NET_CLS_ACT
> if (pt_prev) {
> ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
>
> if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> kfree_skb(skb);
> - goto out;
> + goto unlock;
> }
>
> skb->tc_verd = 0;
> ncls:
> #endif
>
> + if (unlikely(skb->emergency))
> + switch(skb->protocol) {
> + case __constant_htons(ETH_P_ARP):
> + case __constant_htons(ETH_P_IP):
> + case __constant_htons(ETH_P_IPV6):
> + break;

Poor vlans and appletalk.

> + default:
> + goto drop;
> + }
> +
> if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
> - goto out;
> + goto unlock;
>
> type = skb->protocol;
> list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
> @@ -1842,6 +1869,7 @@ ncls:
> if (pt_prev) {
> ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> } else {
> +drop:
> kfree_skb(skb);
> /* Jamal, now you will not able to escape explaining
> * me how you were going to use this. :-)
> @@ -1849,8 +1877,10 @@ ncls:
> ret = NET_RX_DROP;
> }
>
> -out:
> +unlock:
> rcu_read_unlock();
> +out:
> + current->flags = pflags;
> return ret;
> }
>
> Index: linux-2.6-git/net/core/skbuff.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/skbuff.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/skbuff.c 2007-01-12 13:29:51.000000000 +0100
> @@ -142,28 +142,34 @@ EXPORT_SYMBOL(skb_truesize_bug);
> * %GFP_ATOMIC.
> */
> struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> - int fclone, int node)
> + int flags, int node)
> {
> struct kmem_cache *cache;
> struct skb_shared_info *shinfo;
> struct sk_buff *skb;
> u8 *data;
> + int emergency = 0;
>
> - cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> + size = SKB_DATA_ALIGN(size);
> + cache = (flags & SKB_ALLOC_FCLONE)
> + ? skbuff_fclone_cache : skbuff_head_cache;
> + if (flags & SKB_ALLOC_RX)
> + gfp_mask |= __GFP_NOMEMALLOC|__GFP_NOWARN;
>
> +retry_alloc:
> /* Get the HEAD */
> skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> if (!skb)
> - goto out;
> + goto noskb;
>
> /* Get the DATA. Size must match skb_add_mtu(). */
> - size = SKB_DATA_ALIGN(size);
> data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
> gfp_mask, node);
> if (!data)
> goto nodata;
>
> memset(skb, 0, offsetof(struct sk_buff, truesize));
> + skb->emergency = emergency;
> skb->truesize = size + sizeof(struct sk_buff);
> atomic_set(&skb->users, 1);
> skb->head = data;
> @@ -180,7 +186,7 @@ struct sk_buff *__alloc_skb(unsigned int
> shinfo->ip6_frag_id = 0;
> shinfo->frag_list = NULL;
>
> - if (fclone) {
> + if (flags & SKB_ALLOC_FCLONE) {
> struct sk_buff *child = skb + 1;
> atomic_t *fclone_ref = (atomic_t *) (child + 1);
>
> @@ -188,12 +194,29 @@ struct sk_buff *__alloc_skb(unsigned int
> atomic_set(fclone_ref, 1);
>
> child->fclone = SKB_FCLONE_UNAVAILABLE;
> + child->emergency = skb->emergency;
> }
> out:
> return skb;
> +
> nodata:
> kmem_cache_free(cache, skb);
> skb = NULL;
> +noskb:
> + /* Attempt emergency allocation when RX skb. */
> + if (likely(!(flags & SKB_ALLOC_RX) || !sk_vmio_socks()))
> + goto out;
> +
> + if (!emergency) {
> + if (sk_emergency_skb_get()) {
> + gfp_mask &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
> + gfp_mask |= __GFP_EMERGENCY;
> + emergency = 1;
> + goto retry_alloc;
> + }
> + } else
> + sk_emergency_skb_put();
> +
> goto out;
> }
>
> @@ -271,7 +294,7 @@ struct sk_buff *__netdev_alloc_skb(struc
> int node = dev->class_dev.dev ? dev_to_node(dev->class_dev.dev) : -1;
> struct sk_buff *skb;
>
> - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX, node);
> if (likely(skb)) {
> skb_reserve(skb, NET_SKB_PAD);
> skb->dev = dev;
> @@ -320,6 +343,8 @@ static void skb_release_data(struct sk_b
> skb_drop_fraglist(skb);
>
> kfree(skb->head);
> + if (unlikely(skb->emergency))
> + sk_emergency_skb_put();
> }
> }
>
> @@ -440,6 +465,9 @@ struct sk_buff *skb_clone(struct sk_buff
> n->fclone = SKB_FCLONE_CLONE;
> atomic_inc(fclone_ref);
> } else {
> + if (unlikely(skb->emergency))
> + gfp_mask |= __GFP_EMERGENCY;
> +
> n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> if (!n)
> return NULL;
> @@ -474,6 +502,7 @@ struct sk_buff *skb_clone(struct sk_buff
> #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
> C(ipvs_property);
> #endif
> + C(emergency);
> C(protocol);
> n->destructor = NULL;
> C(mark);
> @@ -689,12 +718,19 @@ int pskb_expand_head(struct sk_buff *skb
> u8 *data;
> int size = nhead + (skb->end - skb->head) + ntail;
> long off;
> + int emergency = 0;
>
> if (skb_shared(skb))
> BUG();
>
> size = SKB_DATA_ALIGN(size);
>
> + if (unlikely(skb->emergency) && sk_emergency_skb_get()) {
> + gfp_mask |= __GFP_EMERGENCY;
> + emergency = 1;
> + } else
> + gfp_mask |= __GFP_NOMEMALLOC;
> +
> data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> if (!data)
> goto nodata;
> @@ -727,6 +763,8 @@ int pskb_expand_head(struct sk_buff *skb
> return 0;
>
> nodata:
> + if (unlikely(emergency))
> + sk_emergency_skb_put();
> return -ENOMEM;
> }
>
> Index: linux-2.6-git/net/core/sock.c
> ===================================================================
> --- linux-2.6-git.orig/net/core/sock.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/core/sock.c 2007-01-12 12:21:14.000000000 +0100
> @@ -196,6 +196,120 @@ __u32 sysctl_rmem_default __read_mostly
> /* Maximal space eaten by iovec or ancilliary data plus some space */
> int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int rx_net_reserve;
> +
> +atomic_t vmio_socks;
> +atomic_t emergency_rx_skbs;
> +
> +static int ipfrag_threshold;
> +
> +#define ipfrag_mtu() (1500) /* XXX: should be smallest mtu system wide */
> +#define ipfrag_skbs() (ipfrag_threshold / ipfrag_mtu())
> +#define ipfrag_pages() (ipfrag_threshold / (ipfrag_mtu() * (PAGE_SIZE / ipfrag_mtu())))
> +
> +static int iprt_pages;
> +
> +/*
> + * is there room for another emergency skb.
> + */
> +int sk_emergency_skb_get(void)
> +{
> + int nr = atomic_add_return(1, &emergency_rx_skbs);
> + int thresh = (3 * ipfrag_skbs()) / 2;
> + if (nr < thresh)
> + return 1;
> +
> + atomic_dec(&emergency_rx_skbs);
> + return 0;
> +}
> +
> +/**
> + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + * @socks: number of new %SOCK_VMIO sockets
> + * @tx_resserve_pages: number of pages to (un)reserve for TX
> + *
> + * This function adjusts the memalloc reserve based on system demand.
> + * The RX reserve is a limit, and only added once, not for each socket.
> + *
> + * NOTE:
> + * @tx_reserve_pages is an upper-bound of memory used for TX hence
> + * we need not account the pages like we do for RX pages.
> + */
> +void sk_adjust_memalloc(int socks, int tx_reserve_pages)
> +{
> + unsigned long flags;
> + int reserve = tx_reserve_pages;
> + int nr_socks;
> +
> + spin_lock_irqsave(&memalloc_lock, flags);
> + nr_socks = atomic_add_return(socks, &vmio_socks);
> + BUG_ON(nr_socks < 0);
> +
> + if (nr_socks) {
> + int rx_pages = 2 * ipfrag_pages() + iprt_pages;
> + reserve += rx_pages - rx_net_reserve;
> + rx_net_reserve = rx_pages;
> + } else {
> + reserve -= rx_net_reserve;
> + rx_net_reserve = 0;
> + }
> +
> + if (reserve)
> + adjust_memalloc_reserve(reserve);
> + spin_unlock_irqrestore(&memalloc_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
> +
> +/*
> + * tiny helper function to track the total ipfragment memory
> + * needed because of modular ipv6
> + */
> +void ipfrag_reserve_memory(int frags)
> +{
> + ipfrag_threshold += frags;
> + sk_adjust_memalloc(0, 0);
> +}
> +EXPORT_SYMBOL_GPL(ipfrag_reserve_memory);
> +
> +void iprt_reserve_memory(int pages)
> +{
> + iprt_pages += pages;
> + sk_adjust_memalloc(0, 0);
> +}
> +EXPORT_SYMBOL_GPL(iprt_reserve_memory);
> +
> +/**
> + * sk_set_vmio - sets %SOCK_VMIO
> + * @sk: socket to set it on
> + *
> + * Set %SOCK_VMIO on a socket and increase the memalloc reserve
> + * accordingly.
> + */
> +int sk_set_vmio(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_VMIO);
> + if (!set) {
> + sk_adjust_memalloc(1, 0);
> + sock_set_flag(sk, SOCK_VMIO);
> + sk->sk_allocation |= __GFP_EMERGENCY;
> + }
> + return !set;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_vmio);
> +
> +int sk_clear_vmio(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_VMIO);
> + if (set) {
> + sk_adjust_memalloc(-1, 0);
> + sock_reset_flag(sk, SOCK_VMIO);
> + sk->sk_allocation &= ~__GFP_EMERGENCY;
> + }
> + return set;
> +}
> +EXPORT_SYMBOL_GPL(sk_clear_vmio);
> +
> static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
> {
> struct timeval tv;
> @@ -239,6 +353,12 @@ int sock_queue_rcv_skb(struct sock *sk,
> int err = 0;
> int skb_len;
>
> + if (unlikely(skb->emergency)) {
> + if (!sk_has_vmio(sk)) {
> + err = -ENOMEM;
> + goto out;
> + }
> + } else
> /* Cast skb->rcvbuf to unsigned... It's pointless, but reduces
> number of warnings when compiling with -W --ANK
> */
> @@ -868,6 +988,7 @@ void sk_free(struct sock *sk)
> struct sk_filter *filter;
> struct module *owner = sk->sk_prot_creator->owner;
>
> + sk_clear_vmio(sk);
> if (sk->sk_destruct)
> sk->sk_destruct(sk);
>
> Index: linux-2.6-git/net/ipv4/ipmr.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/ipmr.c 2007-01-12 12:20:08.000000000 +0100
> +++ linux-2.6-git/net/ipv4/ipmr.c 2007-01-12 12:21:14.000000000 +0100
> @@ -1340,6 +1340,9 @@ int ip_mr_input(struct sk_buff *skb)
> struct mfc_cache *cache;
> int local = ((struct rtable*)skb->dst)->rt_flags&RTCF_LOCAL;
>
> + if (unlikely(skb->emergency))
> + goto drop;
> +
> /* Packet is looped back after forward, it should not be
> forwarded second time, but still can be delivered locally.
> */
> @@ -1411,6 +1414,7 @@ int ip_mr_input(struct sk_buff *skb)
> dont_forward:
> if (local)
> return ip_local_deliver(skb);
> +drop:
> kfree_skb(skb);
> return 0;
> }
> Index: linux-2.6-git/net/ipv4/sysctl_net_ipv4.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:20:08.000000000 +0100
> +++ linux-2.6-git/net/ipv4/sysctl_net_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> @@ -18,6 +18,7 @@
> #include <net/route.h>
> #include <net/tcp.h>
> #include <net/cipso_ipv4.h>
> +#include <net/sock.h>
>
> /* From af_inet.c */
> extern int sysctl_ip_nonlocal_bind;
> @@ -186,6 +187,17 @@ static int strategy_allowed_congestion_c
>
> }
>
> +int proc_dointvec_fragment(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> + int old_thresh = *(int *)table->data;
> + ret = proc_dointvec(table,write,filp,buffer,lenp,ppos);
> + ipfrag_reserve_memory(*(int *)table->data - old_thresh);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(proc_dointvec_fragment);
> +
> ctl_table ipv4_table[] = {
> {
> .ctl_name = NET_IPV4_TCP_TIMESTAMPS,
> @@ -291,7 +303,7 @@ ctl_table ipv4_table[] = {
> .data = &sysctl_ipfrag_high_thresh,
> .maxlen = sizeof(int),
> .mode = 0644,
> - .proc_handler = &proc_dointvec
> + .proc_handler = &proc_dointvec_fragment
> },
> {
> .ctl_name = NET_IPV4_IPFRAG_LOW_THRESH,
> Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> ===================================================================
> --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> @@ -1604,6 +1604,22 @@ csum_err:
> goto discard;
> }
>
> +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> + int ret;
> + unsigned long pflags = current->flags;
> + if (unlikely(skb->emergency)) {
> + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> + if (!(pflags & PF_MEMALLOC))
> + current->flags |= PF_MEMALLOC;
> + }
> +
> + ret = tcp_v4_do_rcv(sk, skb);
> +
> + current->flags = pflags;
> + return ret;

Why don't you want to just setup PF_MEMALLOC for the socket and all
related processes?

> +}
> +
> /*
> * From tcp_input.c
> */
> @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> if (!sk)
> goto no_tcp_socket;
>
> + if (unlikely(skb->emergency)) {
> + if (!sk_has_vmio(sk))
> + goto discard_and_relse;
> + /*
> + decrease window size..
> + tcp_enter_quickack_mode(sk);
> + */

How does this decrease window size?
Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
or just directly send an ack, which in turn requires allocation, which
can be bound to this received frame processing...

--
Evgeniy Polyakov

2007-01-16 13:50:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Tue, 2007-01-16 at 16:25 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 10:46:06AM +0100, Peter Zijlstra ([email protected]) wrote:

> > @@ -1767,10 +1767,23 @@ int netif_receive_skb(struct sk_buff *sk
> > struct net_device *orig_dev;
> > int ret = NET_RX_DROP;
> > __be16 type;
> > + unsigned long pflags = current->flags;
> > +
> > + /* Emergency skb are special, they should
> > + * - be delivered to SOCK_VMIO sockets only
> > + * - stay away from userspace
> > + * - have bounded memory usage
> > + *
> > + * Use PF_MEMALLOC as a poor mans memory pool - the grouping kind.
> > + * This saves us from propagating the allocation context down to all
> > + * allocation sites.
> > + */
> > + if (unlikely(skb->emergency))
> > + current->flags |= PF_MEMALLOC;
>
> Access to 'current' in netif_receive_skb()???
> Why do you want to work with, for example keventd?

Can this run in keventd?

I thought this was softirq context and thus this would either run in a
borrowed context or in ksoftirqd. See patch 3/9.

> > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > goto ncls;
> > }
> > #endif
> > + if (unlikely(skb->emergency))
> > + goto skip_taps;
> >
> > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > if (!ptype->dev || ptype->dev == skb->dev) {
> > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > }
> > }
> >
> > +skip_taps:
>
> It is still a 'tap'.

Not sure what you are saying, I thought this should stop delivery of
skbs to taps?

> > #ifdef CONFIG_NET_CLS_ACT
> > if (pt_prev) {
> > ret = deliver_skb(skb, pt_prev, orig_dev);
> > @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
> >
> > if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> > kfree_skb(skb);
> > - goto out;
> > + goto unlock;
> > }
> >
> > skb->tc_verd = 0;
> > ncls:
> > #endif
> >
> > + if (unlikely(skb->emergency))
> > + switch(skb->protocol) {
> > + case __constant_htons(ETH_P_ARP):
> > + case __constant_htons(ETH_P_IP):
> > + case __constant_htons(ETH_P_IPV6):
> > + break;
>
> Poor vlans and appletalk.

Yeah and all those other too, maybe some day.

> > Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> > ===================================================================
> > --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> > +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> > @@ -1604,6 +1604,22 @@ csum_err:
> > goto discard;
> > }
> >
> > +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> > +{
> > + int ret;
> > + unsigned long pflags = current->flags;
> > + if (unlikely(skb->emergency)) {
> > + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> > + if (!(pflags & PF_MEMALLOC))
> > + current->flags |= PF_MEMALLOC;
> > + }
> > +
> > + ret = tcp_v4_do_rcv(sk, skb);
> > +
> > + current->flags = pflags;
> > + return ret;
>
> Why don't you want to just setup PF_MEMALLOC for the socket and all
> related processes?

I'm not understanding what you're saying here.

I want grant the processing of skb->emergency packets access to the
memory reserves.

How would I set PF_MEMALLOC on a socket, its a process flag? And which
related processes?

> > +}
> > +
> > /*
> > * From tcp_input.c
> > */
> > @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > if (!sk)
> > goto no_tcp_socket;
> >
> > + if (unlikely(skb->emergency)) {
> > + if (!sk_has_vmio(sk))
> > + goto discard_and_relse;
> > + /*
> > + decrease window size..
> > + tcp_enter_quickack_mode(sk);
> > + */
>
> How does this decrease window size?
> Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> or just directly send an ack, which in turn requires allocation, which
> can be bound to this received frame processing...

It doesn't, I thought that it might be a good idea doing that, but never
got around to actually figuring out how to do it.

2007-01-16 15:34:20

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra ([email protected]) wrote:
> > > + if (unlikely(skb->emergency))
> > > + current->flags |= PF_MEMALLOC;
> >
> > Access to 'current' in netif_receive_skb()???
> > Why do you want to work with, for example keventd?
>
> Can this run in keventd?

Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
(or dedicated kernel thread, I do not recall).

> I thought this was softirq context and thus this would either run in a
> borrowed context or in ksoftirqd. See patch 3/9.

And how are you going to access 'current' in softirq?

netif_receive_skb() can also be called from a lot of other places
including keventd and/or different context - it is permitted to call it
everywhere to process packet.

I meant that you break the rule accessing 'current' in that context.

> > > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > > goto ncls;
> > > }
> > > #endif
> > > + if (unlikely(skb->emergency))
> > > + goto skip_taps;
> > >
> > > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > > if (!ptype->dev || ptype->dev == skb->dev) {
> > > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > > }
> > > }
> > >
> > > +skip_taps:
> >
> > It is still a 'tap'.
>
> Not sure what you are saying, I thought this should stop delivery of
> skbs to taps?

Ingres filter can do whatever it wants with skb at that point, likely
you want to skip that hunk too.

> > > #ifdef CONFIG_NET_CLS_ACT
> > > if (pt_prev) {
> > > ret = deliver_skb(skb, pt_prev, orig_dev);
> > > @@ -1819,15 +1835,26 @@ int netif_receive_skb(struct sk_buff *sk
> > >
> > > if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
> > > kfree_skb(skb);
> > > - goto out;
> > > + goto unlock;
> > > }
> > >
> > > skb->tc_verd = 0;
> > > ncls:
> > > #endif
> > >
> > > + if (unlikely(skb->emergency))
> > > + switch(skb->protocol) {
> > > + case __constant_htons(ETH_P_ARP):
> > > + case __constant_htons(ETH_P_IP):
> > > + case __constant_htons(ETH_P_IPV6):
> > > + break;
> >
> > Poor vlans and appletalk.
>
> Yeah and all those other too, maybe some day.
>
> > > Index: linux-2.6-git/net/ipv4/tcp_ipv4.c
> > > ===================================================================
> > > --- linux-2.6-git.orig/net/ipv4/tcp_ipv4.c 2007-01-12 12:20:07.000000000 +0100
> > > +++ linux-2.6-git/net/ipv4/tcp_ipv4.c 2007-01-12 12:21:14.000000000 +0100
> > > @@ -1604,6 +1604,22 @@ csum_err:
> > > goto discard;
> > > }
> > >
> > > +static int tcp_v4_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > + int ret;
> > > + unsigned long pflags = current->flags;
> > > + if (unlikely(skb->emergency)) {
> > > + BUG_ON(!sk_has_vmio(sk)); /* we dropped those before queueing */
> > > + if (!(pflags & PF_MEMALLOC))
> > > + current->flags |= PF_MEMALLOC;
> > > + }
> > > +
> > > + ret = tcp_v4_do_rcv(sk, skb);
> > > +
> > > + current->flags = pflags;
> > > + return ret;
> >
> > Why don't you want to just setup PF_MEMALLOC for the socket and all
> > related processes?
>
> I'm not understanding what you're saying here.
>
> I want grant the processing of skb->emergency packets access to the
> memory reserves.
>
> How would I set PF_MEMALLOC on a socket, its a process flag? And which
> related processes?

You use special flag for sockets to mark them as capable of
'reserve-eating', too many flags are a bit confusing.

I meant that you can just mark process which created such socket as
PF_MEMALLOC, and clone that flag on forks and other relatest calls without
all that checks for 'current' in different places.

> > > +}
> > > +
> > > /*
> > > * From tcp_input.c
> > > */
> > > @@ -1654,6 +1670,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > if (!sk)
> > > goto no_tcp_socket;
> > >
> > > + if (unlikely(skb->emergency)) {
> > > + if (!sk_has_vmio(sk))
> > > + goto discard_and_relse;
> > > + /*
> > > + decrease window size..
> > > + tcp_enter_quickack_mode(sk);
> > > + */
> >
> > How does this decrease window size?
> > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > or just directly send an ack, which in turn requires allocation, which
> > can be bound to this received frame processing...
>
> It doesn't, I thought that it might be a good idea doing that, but never
> got around to actually figuring out how to do it.

tcp_send_ack()?

--
Evgeniy Polyakov

2007-01-16 16:11:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra ([email protected]) wrote:
> > > > + if (unlikely(skb->emergency))
> > > > + current->flags |= PF_MEMALLOC;
> > >
> > > Access to 'current' in netif_receive_skb()???
> > > Why do you want to work with, for example keventd?
> >
> > Can this run in keventd?
>
> Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> (or dedicated kernel thread, I do not recall).
>
> > I thought this was softirq context and thus this would either run in a
> > borrowed context or in ksoftirqd. See patch 3/9.
>
> And how are you going to access 'current' in softirq?
>
> netif_receive_skb() can also be called from a lot of other places
> including keventd and/or different context - it is permitted to call it
> everywhere to process packet.
>
> I meant that you break the rule accessing 'current' in that context.

Yeah, I know, but as long as we're not actually in hard irq context
current does point to the task_struct in charge of current execution and
as long as we restore whatever was in the flags field before we started
poking, nothing can go wrong.

So, yes this is unconventional, but it does work as expected.

As for breaking, 3/9 makes it legal.

> > > > @@ -1798,6 +1811,8 @@ int netif_receive_skb(struct sk_buff *sk
> > > > goto ncls;
> > > > }
> > > > #endif
> > > > + if (unlikely(skb->emergency))
> > > > + goto skip_taps;
> > > >
> > > > list_for_each_entry_rcu(ptype, &ptype_all, list) {
> > > > if (!ptype->dev || ptype->dev == skb->dev) {
> > > > @@ -1807,6 +1822,7 @@ int netif_receive_skb(struct sk_buff *sk
> > > > }
> > > > }
> > > >
> > > > +skip_taps:
> > >
> > > It is still a 'tap'.
> >
> > Not sure what you are saying, I thought this should stop delivery of
> > skbs to taps?
>
> Ingres filter can do whatever it wants with skb at that point, likely
> you want to skip that hunk too.

Will look into Ingres filters, thanks for the pointer.

> > > Why don't you want to just setup PF_MEMALLOC for the socket and all
> > > related processes?
> >
> > I'm not understanding what you're saying here.
> >
> > I want grant the processing of skb->emergency packets access to the
> > memory reserves.
> >
> > How would I set PF_MEMALLOC on a socket, its a process flag? And which
> > related processes?
>
> You use special flag for sockets to mark them as capable of
> 'reserve-eating', too many flags are a bit confusing.

Right, and I use PF_MEMALLOC to implement that reserve-eating. There
must be a link between SOCK_VMIO and all allocations associated with
that socket.

> I meant that you can just mark process which created such socket as
> PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> all that checks for 'current' in different places.

Ah, thats the wrong level to think here, these processes never reach
user-space - nor should these sockets.

Also, I only want the processing of the actual network packet to be able
to eat the reserves, not any other thing that might happen in that
context.

And since network processing is mostly done in softirq context I must
mark these sections like I did.

> > > > + /*
> > > > + decrease window size..
> > > > + tcp_enter_quickack_mode(sk);
> > > > + */
> > >
> > > How does this decrease window size?
> > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > or just directly send an ack, which in turn requires allocation, which
> > > can be bound to this received frame processing...
> >
> > It doesn't, I thought that it might be a good idea doing that, but never
> > got around to actually figuring out how to do it.
>
> tcp_send_ack()?
>

does that shrink the window automagically?

2007-01-17 04:54:54

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Tue, Jan 16, 2007 at 05:08:15PM +0100, Peter Zijlstra ([email protected]) wrote:
> On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra ([email protected]) wrote:
> > > > > + if (unlikely(skb->emergency))
> > > > > + current->flags |= PF_MEMALLOC;
> > > >
> > > > Access to 'current' in netif_receive_skb()???
> > > > Why do you want to work with, for example keventd?
> > >
> > > Can this run in keventd?
> >
> > Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> > (or dedicated kernel thread, I do not recall).
> >
> > > I thought this was softirq context and thus this would either run in a
> > > borrowed context or in ksoftirqd. See patch 3/9.
> >
> > And how are you going to access 'current' in softirq?
> >
> > netif_receive_skb() can also be called from a lot of other places
> > including keventd and/or different context - it is permitted to call it
> > everywhere to process packet.
> >
> > I meant that you break the rule accessing 'current' in that context.
>
> Yeah, I know, but as long as we're not actually in hard irq context
> current does point to the task_struct in charge of current execution and
> as long as we restore whatever was in the flags field before we started
> poking, nothing can go wrong.
>
> So, yes this is unconventional, but it does work as expected.
>
> As for breaking, 3/9 makes it legal.

You operate with 'current' in different contexts without any locks which
looks racy and even is not allowed. What will be 'current' for
netif_rx() case, which schedules softirq from hard irq context -
ksoftirqd, why do you want to set its flags?

> > I meant that you can just mark process which created such socket as
> > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > all that checks for 'current' in different places.
>
> Ah, thats the wrong level to think here, these processes never reach
> user-space - nor should these sockets.

You limit this just to send an ack?
What about 'level-7' ack as you described in introduction?

> Also, I only want the processing of the actual network packet to be able
> to eat the reserves, not any other thing that might happen in that
> context.
>
> And since network processing is mostly done in softirq context I must
> mark these sections like I did.

You artificially limit system to just add a reserve to generate one ack.
For that purpose you do not need to have all those flags - just reseve
some data in network core and use it when system is in OOM (or reclaim)
for critical data pathes.

> > > > > + /*
> > > > > + decrease window size..
> > > > > + tcp_enter_quickack_mode(sk);
> > > > > + */
> > > >
> > > > How does this decrease window size?
> > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > or just directly send an ack, which in turn requires allocation, which
> > > > can be bound to this received frame processing...
> > >
> > > It doesn't, I thought that it might be a good idea doing that, but never
> > > got around to actually figuring out how to do it.
> >
> > tcp_send_ack()?
> >
>
> does that shrink the window automagically?

Yes, it updates window, but having ack generated in that place is
actually very wrong. In that place system has not processed incoming
packet yet, so it can not generate correct ACK for received frame at
all. And it seems that the only purpose of the whole patchset is to
generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
in system startup and reuse them when you are under memory pressure.

--
Evgeniy Polyakov

2007-01-17 09:10:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Wed, 2007-01-17 at 07:54 +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 16, 2007 at 05:08:15PM +0100, Peter Zijlstra ([email protected]) wrote:
> > On Tue, 2007-01-16 at 18:33 +0300, Evgeniy Polyakov wrote:
> > > On Tue, Jan 16, 2007 at 02:47:54PM +0100, Peter Zijlstra ([email protected]) wrote:
> > > > > > + if (unlikely(skb->emergency))
> > > > > > + current->flags |= PF_MEMALLOC;
> > > > >
> > > > > Access to 'current' in netif_receive_skb()???
> > > > > Why do you want to work with, for example keventd?
> > > >
> > > > Can this run in keventd?
> > >
> > > Initial netchannel implementation by Kelly Daly (IBM) worked in keventd
> > > (or dedicated kernel thread, I do not recall).
> > >
> > > > I thought this was softirq context and thus this would either run in a
> > > > borrowed context or in ksoftirqd. See patch 3/9.
> > >
> > > And how are you going to access 'current' in softirq?
> > >
> > > netif_receive_skb() can also be called from a lot of other places
> > > including keventd and/or different context - it is permitted to call it
> > > everywhere to process packet.
> > >
> > > I meant that you break the rule accessing 'current' in that context.
> >
> > Yeah, I know, but as long as we're not actually in hard irq context
> > current does point to the task_struct in charge of current execution and
> > as long as we restore whatever was in the flags field before we started
> > poking, nothing can go wrong.
> >
> > So, yes this is unconventional, but it does work as expected.
> >
> > As for breaking, 3/9 makes it legal.
>
> You operate with 'current' in different contexts without any locks which
> looks racy and even is not allowed. What will be 'current' for
> netif_rx() case, which schedules softirq from hard irq context -
> ksoftirqd, why do you want to set its flags?

I don't touch current in hardirq context, do I (if I did, that is indeed
a mistake)?

In all other contexts, current is valid.

> > > I meant that you can just mark process which created such socket as
> > > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > > all that checks for 'current' in different places.
> >
> > Ah, thats the wrong level to think here, these processes never reach
> > user-space - nor should these sockets.
>
> You limit this just to send an ack?
> What about 'level-7' ack as you described in introduction?

Take NFS, it does full data traffic in kernel.

> > Also, I only want the processing of the actual network packet to be able
> > to eat the reserves, not any other thing that might happen in that
> > context.
> >
> > And since network processing is mostly done in softirq context I must
> > mark these sections like I did.
>
> You artificially limit system to just add a reserve to generate one ack.
> For that purpose you do not need to have all those flags - just reseve
> some data in network core and use it when system is in OOM (or reclaim)
> for critical data pathes.

How would that end up being different, I would have to replace all
allocations done in the full network processing path.

This seems a much less invasive method, all the (allocation) code can
stay the way it is and use the normal allocation functions.

> > > > > > + /*
> > > > > > + decrease window size..
> > > > > > + tcp_enter_quickack_mode(sk);
> > > > > > + */
> > > > >
> > > > > How does this decrease window size?
> > > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > > or just directly send an ack, which in turn requires allocation, which
> > > > > can be bound to this received frame processing...
> > > >
> > > > It doesn't, I thought that it might be a good idea doing that, but never
> > > > got around to actually figuring out how to do it.
> > >
> > > tcp_send_ack()?
> > >
> >
> > does that shrink the window automagically?
>
> Yes, it updates window, but having ack generated in that place is
> actually very wrong. In that place system has not processed incoming
> packet yet, so it can not generate correct ACK for received frame at
> all. And it seems that the only purpose of the whole patchset is to
> generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
> in system startup and reuse them when you are under memory pressure.

Right, I suspected something like that; hence I wanted to just shrink
the window. Anyway, this is not a very important issue.

2007-01-18 10:42:14

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Wed, Jan 17, 2007 at 10:07:28AM +0100, Peter Zijlstra ([email protected]) wrote:
> > You operate with 'current' in different contexts without any locks which
> > looks racy and even is not allowed. What will be 'current' for
> > netif_rx() case, which schedules softirq from hard irq context -
> > ksoftirqd, why do you want to set its flags?
>
> I don't touch current in hardirq context, do I (if I did, that is indeed
> a mistake)?
>
> In all other contexts, current is valid.

Well, if you think that setting PF_MEMALLOC flag for keventd and
ksoftirqd is valid, then probably yes...

> > > > I meant that you can just mark process which created such socket as
> > > > PF_MEMALLOC, and clone that flag on forks and other relatest calls without
> > > > all that checks for 'current' in different places.
> > >
> > > Ah, thats the wrong level to think here, these processes never reach
> > > user-space - nor should these sockets.
> >
> > You limit this just to send an ack?
> > What about 'level-7' ack as you described in introduction?
>
> Take NFS, it does full data traffic in kernel.

NFS case is exactly the situation, when you only need to generate an ACK.

> > > Also, I only want the processing of the actual network packet to be able
> > > to eat the reserves, not any other thing that might happen in that
> > > context.
> > >
> > > And since network processing is mostly done in softirq context I must
> > > mark these sections like I did.
> >
> > You artificially limit system to just add a reserve to generate one ack.
> > For that purpose you do not need to have all those flags - just reseve
> > some data in network core and use it when system is in OOM (or reclaim)
> > for critical data pathes.
>
> How would that end up being different, I would have to replace all
> allocations done in the full network processing path.
>
> This seems a much less invasive method, all the (allocation) code can
> stay the way it is and use the normal allocation functions.

Ack is only generated in one place in TCP.

And acutally we are starting to talk about different approach - having
separated allocator for network, which will be turned on on OOM (reclaim
or at any other time). If you do not mind, I would likw to refresh a
discussion about network tree allocator, which utilizes own pool of
pages, performs self-defragmentation of the memeory, is very SMP
friendly in that regard that it is per-cpu like slab and never free
objects on different CPUs, so they always stay in the same cache.
Among other goodies it allows to have full sending/receiving zero-copy.

Here is a link:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=nta

> > > > > > > + /*
> > > > > > > + decrease window size..
> > > > > > > + tcp_enter_quickack_mode(sk);
> > > > > > > + */
> > > > > >
> > > > > > How does this decrease window size?
> > > > > > Maybe ack scheduling would be better handled by inet_csk_schedule_ack()
> > > > > > or just directly send an ack, which in turn requires allocation, which
> > > > > > can be bound to this received frame processing...
> > > > >
> > > > > It doesn't, I thought that it might be a good idea doing that, but never
> > > > > got around to actually figuring out how to do it.
> > > >
> > > > tcp_send_ack()?
> > > >
> > >
> > > does that shrink the window automagically?
> >
> > Yes, it updates window, but having ack generated in that place is
> > actually very wrong. In that place system has not processed incoming
> > packet yet, so it can not generate correct ACK for received frame at
> > all. And it seems that the only purpose of the whole patchset is to
> > generate that poor ack - reseve 2007 ack packets (MAX_TCP_HEADER)
> > in system startup and reuse them when you are under memory pressure.
>
> Right, I suspected something like that; hence I wanted to just shrink
> the window. Anyway, this is not a very important issue.

tcp_enter_quickack_mode() does not update window, it allows to send ack
immediately after packet has been processed, window can be changed in
any way TCP state machine and congestion control want.

--
Evgeniy Polyakov

2007-01-18 12:20:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 9/9] net: vm deadlock avoidance core

On Thu, 2007-01-18 at 13:41 +0300, Evgeniy Polyakov wrote:

> > > What about 'level-7' ack as you described in introduction?
> >
> > Take NFS, it does full data traffic in kernel.
>
> NFS case is exactly the situation, when you only need to generate an ACK.

No it is not, it needs the full RPC response.

> > > You artificially limit system to just add a reserve to generate one ack.
> > > For that purpose you do not need to have all those flags - just reseve
> > > some data in network core and use it when system is in OOM (or reclaim)
> > > for critical data pathes.
> >
> > How would that end up being different, I would have to replace all
> > allocations done in the full network processing path.
> >
> > This seems a much less invasive method, all the (allocation) code can
> > stay the way it is and use the normal allocation functions.

> And acutally we are starting to talk about different approach - having
> separated allocator for network, which will be turned on on OOM (reclaim
> or at any other time).

I think we might be, I'm more talking about requirements on the
allocator, while you seem to talk about implementations.

Replacing the allocator, or splitting it in two based on a condition are
all fine as long as they observe the requirements.

The requirement I add is that there is a reserve nobody touches unless
given express permission.

You could implement this by modifying each reachable allocator call site
and stick a branch in and use an alternate allocator when the normal
route fails and we do have permission; much like:

foo = kmalloc(size, gfp_mask);
+ if (!foo && special)
+ foo = my_alloc(size)

And earlier versions of this work did something like that. But it
litters the code quite badly and its quite easy to miss spots. There can
be quite a few allocations in processing network data.

Hence my work on integrating this into the regular memory allocators.

FYI; 'special' evaluates to something like:
!(gfp_mask & __GFP_NOMEMALLOC) &&
((gfp_mask & __GFP_EMERGENCY) ||
(!in_irq() && (current->flags & PF_MEMALLOC)))

> If you do not mind, I would likw to refresh a
> discussion about network tree allocator,

> which utilizes own pool of
> pages,

very high order pages, no?

This means that you have to either allocate at boot time and cannot
resize/add pools; which means you waste all that memory if the network
load never comes near using the reserved amount.

Or, you get into all the same trouble the hugepages folks are trying so
very hard to solve.

> performs self-defragmentation of the memeory,

Does it move memory about?

All it does is try to avoid fragmentation by policy - a problem
impossible to solve in general; but can achieve good results in view of
practical limitations on program behaviour.

Does your policy work for the given workload? we'll see.

Also, on what level, each level has both internal and external
fragmentation. I can argue that having large immovable objects in memory
adds to the fragmentation issues on the page-allocator level.

> is very SMP
> friendly in that regard that it is per-cpu like slab and never free
> objects on different CPUs, so they always stay in the same cache.

This makes it very hard to guarantee a reserve limit. (Not impossible,
just more difficult)

> Among other goodies it allows to have full sending/receiving zero-copy.

That won't ever work unless you have page aligned objects, otherwise you
cannot map them into user-space. Which seems to be at odds with your
tight packing/reduce internal fragmentation goals.

Zero-copy entails mapping the page the hardware writes the packet in
into user-space, right?

Since its impossible to predict to whoem the next packet is addressed
the packets must be written (by hardware) to different pages.

2007-01-18 13:59:00

by Evgeniy Polyakov

[permalink] [raw]

Subject: Possible ways of dealing with OOM conditions.

On Thu, Jan 18, 2007 at 01:18:44PM +0100, Peter Zijlstra ([email protected]) wrote:
> > > How would that end up being different, I would have to replace all
> > > allocations done in the full network processing path.
> > >
> > > This seems a much less invasive method, all the (allocation) code can
> > > stay the way it is and use the normal allocation functions.
>
> > And acutally we are starting to talk about different approach - having
> > separated allocator for network, which will be turned on on OOM (reclaim
> > or at any other time).
>
> I think we might be, I'm more talking about requirements on the
> allocator, while you seem to talk about implementations.
>
> Replacing the allocator, or splitting it in two based on a condition are
> all fine as long as they observe the requirements.
>
> The requirement I add is that there is a reserve nobody touches unless
> given express permission.
>
> You could implement this by modifying each reachable allocator call site
> and stick a branch in and use an alternate allocator when the normal
> route fails and we do have permission; much like:
>
> foo = kmalloc(size, gfp_mask);
> + if (!foo && special)
> + foo = my_alloc(size)

Network is special in this regard, since it only has one allocation path
(actually it has one cache for skb, and usual kmalloc, but they are
called from only two functions).

So it would become
ptr = network_alloc();
and network_alloc() would be usual kmalloc or call for own allocator in
case of deadlock.

> And earlier versions of this work did something like that. But it
> litters the code quite badly and its quite easy to miss spots. There can
> be quite a few allocations in processing network data.
>
> Hence my work on integrating this into the regular memory allocators.
>
> FYI; 'special' evaluates to something like:
> !(gfp_mask & __GFP_NOMEMALLOC) &&
> ((gfp_mask & __GFP_EMERGENCY) ||
> (!in_irq() && (current->flags & PF_MEMALLOC)))
>
>
> > If you do not mind, I would likw to refresh a
> > discussion about network tree allocator,
>
> > which utilizes own pool of
> > pages,
>
> very high order pages, no?
>
> This means that you have to either allocate at boot time and cannot
> resize/add pools; which means you waste all that memory if the network
> load never comes near using the reserved amount.
>
> Or, you get into all the same trouble the hugepages folks are trying so
> very hard to solve.

It is configurable - by default it takes pool of 32k pages for allocations for
jumbo-frames (e1000 requires such allocations for 9k frames
unfortunately), without jumbo-frame support it works with pool of 0-order
pages, which grows dynamically when needed.

> > performs self-defragmentation of the memeory,
>
> Does it move memory about?

It works in a page, not as pages - when neighbour regions are freed,
they are combined into single one with bigger size - it would be
extended to move pages around to combied them into bigger one though
too, but network stack requires high-order allocations in extremely rare
cases of broken design (Intel folks, sorry, but your hardware sucks in
that regard - jumbo frame of 9k should not require 16k of mem plu
network overhead).

NTA also does not align buffers to the power of two - extremely significant
win of that approach can be found on project's homepage with graps of
failed allocations and state of the mem for different sizes of
allocaions. Power-of-two overhead of SLAB is extremely high.

> All it does is try to avoid fragmentation by policy - a problem
> impossible to solve in general; but can achieve good results in view of
> practical limitations on program behaviour.
>
> Does your policy work for the given workload? we'll see.
>
> Also, on what level, each level has both internal and external
> fragmentation. I can argue that having large immovable objects in memory
> adds to the fragmentation issues on the page-allocator level.

NTA works with pages, not with contiguous memory, it reduces
fragmentation inside pages, which can not be solved in SLAB, where
objects from the same page can live in different caches and thus _never_
can be combined. Thus, the only soultuin for SLAB is copy, which is not a
good one for big sizes and is just wrong for big pages.
It is not about page moving and VM tricks, which are generally described
as fragmentation avoidance technique, but about how fragmentation
problem is solved in one page.

> > is very SMP
> > friendly in that regard that it is per-cpu like slab and never free
> > objects on different CPUs, so they always stay in the same cache.
>
> This makes it very hard to guarantee a reserve limit. (Not impossible,
> just more difficult)

The whole pool of pages becomes reserve, since no one (and mainly VFS)
can consume that reserve.

> > Among other goodies it allows to have full sending/receiving zero-copy.
>
> That won't ever work unless you have page aligned objects, otherwise you
> cannot map them into user-space. Which seems to be at odds with your
> tight packing/reduce internal fragmentation goals.
>
> Zero-copy entails mapping the page the hardware writes the packet in
> into user-space, right?
>
> Since its impossible to predict to whoem the next packet is addressed
> the packets must be written (by hardware) to different pages.

Yes, receiving zero-copy without appropriate hardware assist is
impossible, so either absence of such facility at all, or special overhead,
which forces object to lie in different pages. With hardware assist it
would be possible to select a flow in advance, so data would be packet
in the same page.

Sending zero-copy from userspace memory does not suffer with any such
problem.

--
Evgeniy Polyakov

2007-01-18 15:12:42

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:

> Network is special in this regard, since it only has one allocation path
> (actually it has one cache for skb, and usual kmalloc, but they are
> called from only two functions).
>
> So it would become
> ptr = network_alloc();
> and network_alloc() would be usual kmalloc or call for own allocator in
> case of deadlock.

There is more to networking that skbs only, what about route cache,
there is quite a lot of allocs in this fib_* stuff, IGMP etc...

> > very high order pages, no?
> >
> > This means that you have to either allocate at boot time and cannot
> > resize/add pools; which means you waste all that memory if the network
> > load never comes near using the reserved amount.
> >
> > Or, you get into all the same trouble the hugepages folks are trying so
> > very hard to solve.
>
> It is configurable - by default it takes pool of 32k pages for allocations for
> jumbo-frames (e1000 requires such allocations for 9k frames
> unfortunately), without jumbo-frame support it works with pool of 0-order
> pages, which grows dynamically when needed.

With 0-order pages, you can only fit 2 1500 byte packets in there, you
could perhaps stick some small skb heads in there as well, but why
bother, the waste isn't _that_ high.

Esp if you would make a slab for 1500 mtu packets (5*1638 < 2*4096; and
1638 should be enough, right?)

It would make sense to pack related objects into a page so you could
free all together.

> > > performs self-defragmentation of the memeory,
> >
> > Does it move memory about?
>
> It works in a page, not as pages - when neighbour regions are freed,
> they are combined into single one with bigger size

Yeah, that is not defragmentation, defragmentation is moving active
regions about to create contiguous free space. What you do is free space
coalescence.

> but network stack requires high-order allocations in extremely rare
> cases of broken design (Intel folks, sorry, but your hardware sucks in
> that regard - jumbo frame of 9k should not require 16k of mem plu
> network overhead).

Well, if you have such hardware its not rare at all, But yeah that
sucks.

> NTA also does not align buffers to the power of two - extremely significant
> win of that approach can be found on project's homepage with graps of
> failed allocations and state of the mem for different sizes of
> allocaions. Power-of-two overhead of SLAB is extremely high.

Sure you can pack the page a little better(*), but I thought the main
advantage was a speed increase.

(*) memory is generally cheaper than engineering efforts, esp on this
scale. The only advantage in the manual packing is that (with the fancy
hardware stream engine mentioned below) you could ensure they are
grouped together (then again, the hardware stream engine would, together
with a SG-DMA engine, take care of that).

>
> > All it does is try to avoid fragmentation by policy - a problem
> > impossible to solve in general; but can achieve good results in view of
> > practical limitations on program behaviour.
> >
> > Does your policy work for the given workload? we'll see.
> >
> > Also, on what level, each level has both internal and external
> > fragmentation. I can argue that having large immovable objects in memory
> > adds to the fragmentation issues on the page-allocator level.
>
> NTA works with pages, not with contiguous memory, it reduces
> fragmentation inside pages, which can not be solved in SLAB, where
> objects from the same page can live in different caches and thus _never_
> can be combined. Thus, the only soultuin for SLAB is copy, which is not a
> good one for big sizes and is just wrong for big pages.

By allocating, and never returning the page to the page-allocator you've
increased the fragmentation on the page-allocator level significantly.
It will avoid a super page ever forming around that page.

> It is not about page moving and VM tricks, which are generally described
> as fragmentation avoidance technique, but about how fragmentation
> problem is solved in one page.

Short of defragmentation (move active regions about) fragmentation is an
unsolved problem. For any heuristic there is a pattern that will defeat
it.

Luckily program allocation behaviour is usually very regular (or
decomposable in well behaved groups).

> > > is very SMP
> > > friendly in that regard that it is per-cpu like slab and never free
> > > objects on different CPUs, so they always stay in the same cache.
> >
> > This makes it very hard to guarantee a reserve limit. (Not impossible,
> > just more difficult)
>
> The whole pool of pages becomes reserve, since no one (and mainly VFS)
> can consume that reserve.

Ah, but there you violate my requirement, any network allocation can
claim the last bit of memory. The whole idea was that the reserve is
explicitly managed.

It not only needs protection from other users but also from itself.

> > > Among other goodies it allows to have full sending/receiving zero-copy.
> >
> > That won't ever work unless you have page aligned objects, otherwise you
> > cannot map them into user-space. Which seems to be at odds with your
> > tight packing/reduce internal fragmentation goals.
> >
> > Zero-copy entails mapping the page the hardware writes the packet in
> > into user-space, right?
> >
> > Since its impossible to predict to whoem the next packet is addressed
> > the packets must be written (by hardware) to different pages.
>
> Yes, receiving zero-copy without appropriate hardware assist is
> impossible, so either absence of such facility at all, or special overhead,
> which forces object to lie in different pages. With hardware assist it
> would be possible to select a flow in advance, so data would be packet
> in the same page.

I was not aware that hardware could order the packets in such a fashion.
Yes, if it can do that it becomes doable.

> Sending zero-copy from userspace memory does not suffer with any such
> problem.

True, that is properly ordered. But for that I'm not sure how NTA (you
really should change that name, there is no Tree anymore) helps here.

2007-01-18 15:52:43

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Thu, Jan 18, 2007 at 04:10:52PM +0100, Peter Zijlstra ([email protected]) wrote:
> On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:
>
> > Network is special in this regard, since it only has one allocation path
> > (actually it has one cache for skb, and usual kmalloc, but they are
> > called from only two functions).
> >
> > So it would become
> > ptr = network_alloc();
> > and network_alloc() would be usual kmalloc or call for own allocator in
> > case of deadlock.
>
> There is more to networking that skbs only, what about route cache,
> there is quite a lot of allocs in this fib_* stuff, IGMP etc...

skbs are the most extensively used path.
Actually the same is applied to route - dst_entries and rtable are
allocated through own wrappers.

> > > very high order pages, no?
> > >
> > > This means that you have to either allocate at boot time and cannot
> > > resize/add pools; which means you waste all that memory if the network
> > > load never comes near using the reserved amount.
> > >
> > > Or, you get into all the same trouble the hugepages folks are trying so
> > > very hard to solve.
> >
> > It is configurable - by default it takes pool of 32k pages for allocations for
> > jumbo-frames (e1000 requires such allocations for 9k frames
> > unfortunately), without jumbo-frame support it works with pool of 0-order
> > pages, which grows dynamically when needed.
>
> With 0-order pages, you can only fit 2 1500 byte packets in there, you
> could perhaps stick some small skb heads in there as well, but why
> bother, the waste isn't _that_ high.
>
> Esp if you would make a slab for 1500 mtu packets (5*1638 < 2*4096; and
> 1638 should be enough, right?)
>
> It would make sense to pack related objects into a page so you could
> free all together.

With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
packet (roughly), it is actaly one ACK packet - and I hear it from
person who develops a system, which is aimed to guarantee ACK
allocation in OOM :)

SLAB overhead is _very_ expensive for network - what if jumbo frame is
used? It becomes incredible in that case, although modern NICs allows
scatter-gather, which is aimed to fix the problem.

Cache misses for small packet flow due to the fact, that the same data
is allocated and freed and accessed on different CPUs will become an
issue soon, not right now, since two-four core CPUs are not yet to be
very popular and price for the cache miss is not _that_ high.

> > > > performs self-defragmentation of the memeory,
> > >
> > > Does it move memory about?
> >
> > It works in a page, not as pages - when neighbour regions are freed,
> > they are combined into single one with bigger size
>
> Yeah, that is not defragmentation, defragmentation is moving active
> regions about to create contiguous free space. What you do is free space
> coalescence.

That is wrong definition just because no one developed different system.
Defragmentation is a result of broken system.

Existing design _does_not_ allow to have the situation when whole page
belongs to the same cache after it was actively used, the same is
applied to the situation when several pages, which create contiguous
region, are used by different users, so people start develop VM tricks
to move pages around so they would be placed near in address space.

Do not fix the result, fix the reason.

> > but network stack requires high-order allocations in extremely rare
> > cases of broken design (Intel folks, sorry, but your hardware sucks in
> > that regard - jumbo frame of 9k should not require 16k of mem plu
> > network overhead).
>
> Well, if you have such hardware its not rare at all, But yeah that
> sucks.

They do a good jop developing different approaches to workaround that
hardware 'feature', but this is still wrong situation.

> > NTA also does not align buffers to the power of two - extremely significant
> > win of that approach can be found on project's homepage with graps of
> > failed allocations and state of the mem for different sizes of
> > allocaions. Power-of-two overhead of SLAB is extremely high.
>
> Sure you can pack the page a little better(*), but I thought the main
> advantage was a speed increase.
>
> (*) memory is generally cheaper than engineering efforts, esp on this
> scale. The only advantage in the manual packing is that (with the fancy
> hardware stream engine mentioned below) you could ensure they are
> grouped together (then again, the hardware stream engine would, together
> with a SG-DMA engine, take care of that).

Extensoin way of doing things.
That is wrong.

> > > All it does is try to avoid fragmentation by policy - a problem
> > > impossible to solve in general; but can achieve good results in view of
> > > practical limitations on program behaviour.
> > >
> > > Does your policy work for the given workload? we'll see.
> > >
> > > Also, on what level, each level has both internal and external
> > > fragmentation. I can argue that having large immovable objects in memory
> > > adds to the fragmentation issues on the page-allocator level.
> >
> > NTA works with pages, not with contiguous memory, it reduces
> > fragmentation inside pages, which can not be solved in SLAB, where
> > objects from the same page can live in different caches and thus _never_
> > can be combined. Thus, the only soultuin for SLAB is copy, which is not a
> > good one for big sizes and is just wrong for big pages.
>
> By allocating, and never returning the page to the page-allocator you've
> increased the fragmentation on the page-allocator level significantly.
> It will avoid a super page ever forming around that page.

Not at all - SLAB fragmentation is so high, that stealing pages from its
highly fragmented pool does not result in any lose or win for SPAB
users. And it is possible to allocate at boot time.

NTA cache grows in _very_ rare cases, and it can be preallocated at
startup.

> > It is not about page moving and VM tricks, which are generally described
> > as fragmentation avoidance technique, but about how fragmentation
> > problem is solved in one page.
>
> Short of defragmentation (move active regions about) fragmentation is an
> unsolved problem. For any heuristic there is a pattern that will defeat
> it.
>
> Luckily program allocation behaviour is usually very regular (or
> decomposable in well behaved groups).

We are talking about different approaces here.
Per-page defragmentation by playing games with memory management is one
approach. Run-time defragmentation by groupping neighbour regions is
another one.

Main issue is the fact, that with second one, requirement for the first
one becomes MUCH smaller, since when application, no matter how strange
its allocation pattern is, frees object, it will be groupped with
neighbours. In SLAB that will almost never happen, so situation with
memory tricks.

> > > > is very SMP
> > > > friendly in that regard that it is per-cpu like slab and never free
> > > > objects on different CPUs, so they always stay in the same cache.
> > >
> > > This makes it very hard to guarantee a reserve limit. (Not impossible,
> > > just more difficult)
> >
> > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > can consume that reserve.
>
> Ah, but there you violate my requirement, any network allocation can
> claim the last bit of memory. The whole idea was that the reserve is
> explicitly managed.
>
> It not only needs protection from other users but also from itself.

Specifying some users as good and others as bad generally tends to very
bad behaviour. Your appwoach only covers some users, mine does not
differentiate between users, but prevents system from such situation at all.

> > > > Among other goodies it allows to have full sending/receiving zero-copy.
> > >
> > > That won't ever work unless you have page aligned objects, otherwise you
> > > cannot map them into user-space. Which seems to be at odds with your
> > > tight packing/reduce internal fragmentation goals.
> > >
> > > Zero-copy entails mapping the page the hardware writes the packet in
> > > into user-space, right?
> > >
> > > Since its impossible to predict to whoem the next packet is addressed
> > > the packets must be written (by hardware) to different pages.
> >
> > Yes, receiving zero-copy without appropriate hardware assist is
> > impossible, so either absence of such facility at all, or special overhead,
> > which forces object to lie in different pages. With hardware assist it
> > would be possible to select a flow in advance, so data would be packet
> > in the same page.
>
> I was not aware that hardware could order the packets in such a fashion.
> Yes, if it can do that it becomes doable.

Not hardware, but allocator, which can provide data with special
requirement like alignment and offset for given flow id.
Hardware just provides needed info and does DMA transfer into specified area.

You can find more on receiving zero-copy with _emulation_ of such
hardware (MMIO copy of the header) for example here:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=recv_zero_copy

where system perfomed data receiving of 1500 MTU sized frames directly
into VFS cache.

> > Sending zero-copy from userspace memory does not suffer with any such
> > problem.
>
> True, that is properly ordered. But for that I'm not sure how NTA (you
> really should change that name, there is no Tree anymore) helps here.

Because user has access to the memory which will be used directly by
hardware, it should not care about preallocation, although there are
problems with notification about completeness of operation, which can be
postponed in case of fancy egress filters used.

--
Evgeniy Polyakov

2007-01-18 17:33:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Thu, 2007-01-18 at 18:50 +0300, Evgeniy Polyakov wrote:
> On Thu, Jan 18, 2007 at 04:10:52PM +0100, Peter Zijlstra ([email protected]) wrote:
> > On Thu, 2007-01-18 at 16:58 +0300, Evgeniy Polyakov wrote:
> >
> > > Network is special in this regard, since it only has one allocation path
> > > (actually it has one cache for skb, and usual kmalloc, but they are
> > > called from only two functions).
> > >
> > > So it would become
> > > ptr = network_alloc();
> > > and network_alloc() would be usual kmalloc or call for own allocator in
> > > case of deadlock.
> >
> > There is more to networking that skbs only, what about route cache,
> > there is quite a lot of allocs in this fib_* stuff, IGMP etc...
>
> skbs are the most extensively used path.
> Actually the same is applied to route - dst_entries and rtable are
> allocated through own wrappers.

Still, edit all places and perhaps forget one and make sure all new code
doesn't forget about it, or pick a solution that covers everything.

> With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
> packet (roughly), it is actaly one ACK packet - and I hear it from
> person who develops a system, which is aimed to guarantee ACK
> allocation in OOM :)

I need full data traffic during OOM, not just a single ACK.

> SLAB overhead is _very_ expensive for network - what if jumbo frame is
> used? It becomes incredible in that case, although modern NICs allows
> scatter-gather, which is aimed to fix the problem.

Jumbo frames are fine if the hardware can do SG-DMA..

> Cache misses for small packet flow due to the fact, that the same data
> is allocated and freed and accessed on different CPUs will become an
> issue soon, not right now, since two-four core CPUs are not yet to be
> very popular and price for the cache miss is not _that_ high.

SGI does networking too, right?

> > > > > performs self-defragmentation of the memeory,
> > > >
> > > > Does it move memory about?
> > >
> > > It works in a page, not as pages - when neighbour regions are freed,
> > > they are combined into single one with bigger size
> >
> > Yeah, that is not defragmentation, defragmentation is moving active
> > regions about to create contiguous free space. What you do is free space
> > coalescence.
>
> That is wrong definition just because no one developed different system.
> Defragmentation is a result of broken system.
>
> Existing design _does_not_ allow to have the situation when whole page
> belongs to the same cache after it was actively used, the same is
> applied to the situation when several pages, which create contiguous
> region, are used by different users, so people start develop VM tricks
> to move pages around so they would be placed near in address space.
>
> Do not fix the result, fix the reason.

*plonk* 30+yrs of research ignored.

> > > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > > can consume that reserve.
> >
> > Ah, but there you violate my requirement, any network allocation can
> > claim the last bit of memory. The whole idea was that the reserve is
> > explicitly managed.
> >
> > It not only needs protection from other users but also from itself.
>
> Specifying some users as good and others as bad generally tends to very
> bad behaviour. Your appwoach only covers some users, mine does not
> differentiate between users,

The kernel is special, right? It has priority over whatever user-land
does.

> but prevents system from such situation at all.

I'm not seeing that, with your approach nobody stops the kernel from
filling up the memory with user-space network traffic.

swapping is not some random user process, its a fundamental kernel task,
if this fails the machine is history.

2007-01-18 18:35:39

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Thu, Jan 18, 2007 at 06:31:53PM +0100, Peter Zijlstra ([email protected]) wrote:

> > skbs are the most extensively used path.
> > Actually the same is applied to route - dst_entries and rtable are
> > allocated through own wrappers.
>
> Still, edit all places and perhaps forget one and make sure all new code
> doesn't forget about it, or pick a solution that covers everything.

There is _one_ place for allocation of any kind of object.
skb path has two places.

> > With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU
> > packet (roughly), it is actaly one ACK packet - and I hear it from
> > person who develops a system, which is aimed to guarantee ACK
> > allocation in OOM :)
>
> I need full data traffic during OOM, not just a single ACK.

But your code exactly limit codepath to several allocaions, which must
be ACK. You do not have enough reserve to support whole traffic.
So the right solution, IMO, is to _prevent_ such situation, which means
that allocation is not allowed to depend on external conditions like
VFS.

Actually my above sentences were about the case, when anly having
different allocator, it is possible to dramatically change memory usage
model, which supffers greatly from power-of-two allocations. OOM
condition is one of the results which has big SLAB overhead among other
roots. Actually all pathes which work with kmem_cache are safe against
it, since kernel cache packs objects, but thos who uses raw kmalloc has
problems.

> > SLAB overhead is _very_ expensive for network - what if jumbo frame is
> > used? It becomes incredible in that case, although modern NICs allows
> > scatter-gather, which is aimed to fix the problem.
>
> Jumbo frames are fine if the hardware can do SG-DMA..

Notice word _IF_ in you sentence. e1000 for example can not (or it can,
but driver is not developed for such scenario).

> > Cache misses for small packet flow due to the fact, that the same data
> > is allocated and freed and accessed on different CPUs will become an
> > issue soon, not right now, since two-four core CPUs are not yet to be
> > very popular and price for the cache miss is not _that_ high.
>
> SGI does networking too, right?

Yep, Cristoph Lameter developed own allocator too.

I agreee with you, that if that price is too high already, then it is a
dditional sign to look into network tree allocator (yep, name is bad)
again.

> > That is wrong definition just because no one developed different system.
> > Defragmentation is a result of broken system.
> >
> > Existing design _does_not_ allow to have the situation when whole page
> > belongs to the same cache after it was actively used, the same is
> > applied to the situation when several pages, which create contiguous
> > region, are used by different users, so people start develop VM tricks
> > to move pages around so they would be placed near in address space.
> >
> > Do not fix the result, fix the reason.
>
> *plonk* 30+yrs of research ignored.

30 years to develop SLAB allocator? In what universe that is all about?

> > > > The whole pool of pages becomes reserve, since no one (and mainly VFS)
> > > > can consume that reserve.
> > >
> > > Ah, but there you violate my requirement, any network allocation can
> > > claim the last bit of memory. The whole idea was that the reserve is
> > > explicitly managed.
> > >
> > > It not only needs protection from other users but also from itself.
> >
> > Specifying some users as good and others as bad generally tends to very
> > bad behaviour. Your appwoach only covers some users, mine does not
> > differentiate between users,
>
> The kernel is special, right? It has priority over whatever user-land
> does.

Kernel only does ACK generation and allocation for userspace.
Kernel does not know that some of users are potentially good or bad, and
if you will export this socket option to the userspace, everyone will
think that his application is good enough to use reserve.

So, for kernel-only side you just need to preallocate pool of packets
and use them when system is in OOM (reclaim). For the long direction,
new approach of memory allocaiton should be developed, and there are
different works in that direction - NTA is one of them and not the only
one, for the best resutlts it must be combined with vm-tricks
defragmentation too.

> > but prevents system from such situation at all.
>
> I'm not seeing that, with your approach nobody stops the kernel from
> filling up the memory with user-space network traffic.
>
> swapping is not some random user process, its a fundamental kernel task,
> if this fails the machine is history.

You completely misses the point. The main goal is to
1. reduce fragmentation and/or enable self defragmentation (which is
done in NTA), this also reduces memory usage.
2. perform correct recover steps in OOM - reduce memory usage, use
different allocator and/or reserve (which is the case, where NTA can be
used)
3. do not allow OOM condition - unfortunately it is not always possible,
but having separated allocation allows to not depend on external
conditions such as VFS memory usage, thus this approach reduces
condition when memory deadlock related to network path can happen.

Let me briefly describe your approach and possible drawbacks in it.
You start reserving some memory when systems is under memory pressure.
when system is in real trouble, you start using that reserve for special
tasks mainly for network path to allocate packets and process them in
order to get committed some memory swapping.

So, the problems I see here, are following:
1. it is possible that when you are starting to create a reserve, there
will not be enough memeory at all. So the solution is to reserve in
advance.
2. You differentiate by hand between critical and non-critical
allocations by specifying some kernel users as potentially possible to
allocate from reserve. This does not prevent from NVIDIA module to
allocate from that reserve too, does it? And you artificially limit
system to process only tiny bits of what it must do, thus potentially
leaking pathes which must use reserve too.

So, solution is to have a reserve in advance, and manage it using
special path when system is in OOM. So you will have network memory
reserve, which will be used when system is in trouble. It is very
similar to what you had.

But the whole reserve can never be used at all, so it should be used,
but not by those who can create OOM condition, thus it should be
exported to, for example, network only, and when system is in trouble,
network would be still functional (although only critical pathes).

Even further development of such idea is to prevent such OOM condition
at all - by starting swapping early (but wisely) and reduce memory
usage.

Network tree allocator does exactly above cases.
Here advertisement is over.

--
Evgeniy Polyakov

2007-01-19 12:55:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

> Let me briefly describe your approach and possible drawbacks in it.
> You start reserving some memory when systems is under memory pressure.
> when system is in real trouble, you start using that reserve for special
> tasks mainly for network path to allocate packets and process them in
> order to get committed some memory swapping.
>
> So, the problems I see here, are following:
> 1. it is possible that when you are starting to create a reserve, there
> will not be enough memeory at all. So the solution is to reserve in
> advance.

Swap is usually enabled at startup, but sure, if you want you can mess
this up.

> 2. You differentiate by hand between critical and non-critical
> allocations by specifying some kernel users as potentially possible to
> allocate from reserve.

True, all sockets that are needed for swap, no-one else.

> This does not prevent from NVIDIA module to
> allocate from that reserve too, does it?

All users of the NVidiot crap deserve all the pain they get.
If it breaks they get to keep both pieces.

> And you artificially limit
> system to process only tiny bits of what it must do, thus potentially
> leaking pathes which must use reserve too.

How so? I cover pretty much every allocation needed to process an skb by
setting PF_MEMALLOC - the only drawback there is that the reserve might
not actually be large enough because it covers more allocations that
were considered. (thats one of the TODO items, validate the reserve
functions parameters)

> So, solution is to have a reserve in advance, and manage it using
> special path when system is in OOM. So you will have network memory
> reserve, which will be used when system is in trouble. It is very
> similar to what you had.
>
> But the whole reserve can never be used at all, so it should be used,
> but not by those who can create OOM condition, thus it should be
> exported to, for example, network only, and when system is in trouble,
> network would be still functional (although only critical pathes).

But the network can create OOM conditions for itself just fine.

Consider the remote storage disappearing for a while (it got rebooted,
someone tripped over the wire etc..). Now the rest of the network
traffic keeps coming and will queue up - because user-space is stalled,
waiting for more memory - and we run out of memory.

There must be a point where we start dropping packets that are not
critical to the survival of the machine.

> Even further development of such idea is to prevent such OOM condition
> at all - by starting swapping early (but wisely) and reduce memory
> usage.

These just postpone execution but will not avoid it.

2007-01-19 17:55:31

by Christoph Lameter

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Thu, 18 Jan 2007, Peter Zijlstra wrote:

>
> > Cache misses for small packet flow due to the fact, that the same data
> > is allocated and freed and accessed on different CPUs will become an
> > issue soon, not right now, since two-four core CPUs are not yet to be
> > very popular and price for the cache miss is not _that_ high.
>
> SGI does networking too, right?

Sslab deals with those issues the right way. We have per processor
queues that attempt to keep the cache hot state. A special shared queue
exists between neighboring processors to facilitate exchange of objects
between then.

2007-01-19 22:57:13

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra ([email protected]) wrote:
> > 2. You differentiate by hand between critical and non-critical
> > allocations by specifying some kernel users as potentially possible to
> > allocate from reserve.
>
> True, all sockets that are needed for swap, no-one else.
>
> > This does not prevent from NVIDIA module to
> > allocate from that reserve too, does it?
>
> All users of the NVidiot crap deserve all the pain they get.
> If it breaks they get to keep both pieces.

I meant that pretty anyone can be those user, who can just add a bit
into own gfp_flags which are used for allocation.

> > And you artificially limit
> > system to process only tiny bits of what it must do, thus potentially
> > leaking pathes which must use reserve too.
>
> How so? I cover pretty much every allocation needed to process an skb by
> setting PF_MEMALLOC - the only drawback there is that the reserve might
> not actually be large enough because it covers more allocations that
> were considered. (thats one of the TODO items, validate the reserve
> functions parameters)

You only covered ipv4/v6 and arp, maybe some route updates.
But it is very possible, that some allocations are missed like
multicast/broadcast. Selecting only special pathes out of the whole
possible network alocations tends to create a situation, when something
is missed or cross dependant on other pathes.

> > So, solution is to have a reserve in advance, and manage it using
> > special path when system is in OOM. So you will have network memory
> > reserve, which will be used when system is in trouble. It is very
> > similar to what you had.
> >
> > But the whole reserve can never be used at all, so it should be used,
> > but not by those who can create OOM condition, thus it should be
> > exported to, for example, network only, and when system is in trouble,
> > network would be still functional (although only critical pathes).
>
> But the network can create OOM conditions for itself just fine.
>
> Consider the remote storage disappearing for a while (it got rebooted,
> someone tripped over the wire etc..). Now the rest of the network
> traffic keeps coming and will queue up - because user-space is stalled,
> waiting for more memory - and we run out of memory.

Hmm... Neither UDP, nor TCP work that way actually.

> There must be a point where we start dropping packets that are not
> critical to the survival of the machine.

You still can drop them, the main point is that network allocations do
not depend on other allocations.

> > Even further development of such idea is to prevent such OOM condition
> > at all - by starting swapping early (but wisely) and reduce memory
> > usage.
>
> These just postpone execution but will not avoid it.

No. If system allows to have such a condition, then
something is broken. It must be prevented, instead of creating special
hacks to recover from it.

--
Evgeniy Polyakov

2007-01-20 22:58:50

by Rik van Riel

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

Evgeniy Polyakov wrote:
> On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra ([email protected]) wrote:

>>> Even further development of such idea is to prevent such OOM condition
>>> at all - by starting swapping early (but wisely) and reduce memory
>>> usage.
>> These just postpone execution but will not avoid it.
>
> No. If system allows to have such a condition, then
> something is broken. It must be prevented, instead of creating special
> hacks to recover from it.

Evgeniy, you may want to learn something about the VM before
stating that reality should not occur.

Due to the way everything in the kernel works, you cannot
prevent the memory allocator from allocating everything and
running out, except maybe by setting aside reserves to deal
with special subsystems.

As for your "swapping early and reduce memory usage", that is
just not possible in a system where a memory writeout may need
one or more memory allocations to succeed and other I/O paths
(eg. file writes) can take memory from the same pools.

With something like iscsi it may be _necessary_ for file writes
and swap to take memory from the same pools, because they can
share the same block device.

Please get out of your fantasy world and accept the constraints
the VM has to operate under. Maybe then you and Peter can agree
on something.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.

2007-01-21 01:47:16

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra
> >([email protected]) wrote:
>
> >>>Even further development of such idea is to prevent such OOM condition
> >>>at all - by starting swapping early (but wisely) and reduce memory
> >>>usage.
> >>These just postpone execution but will not avoid it.
> >
> >No. If system allows to have such a condition, then
> >something is broken. It must be prevented, instead of creating special
> >hacks to recover from it.
>
> Evgeniy, you may want to learn something about the VM before
> stating that reality should not occur.

I.e. I should start believing that OOM can not be prevented, bugs can
not be fixed and things can not be changed just because it happens right
now? That is why I'm not subscribed to lkml :)

> Due to the way everything in the kernel works, you cannot
> prevent the memory allocator from allocating everything and
> running out, except maybe by setting aside reserves to deal
> with special subsystems.
>
> As for your "swapping early and reduce memory usage", that is
> just not possible in a system where a memory writeout may need
> one or more memory allocations to succeed and other I/O paths
> (eg. file writes) can take memory from the same pools.

When system starts swapping only when it can not allocate new page,
then it is broken system. I bet you get warm closing way before you
hands are frostbitten, and you do not have a liter of alcohol in the
packet for such emergency. And to get warm closing you still need to
go over cold street into the shop, but you will do it before weather
becomes arctic.

> With something like iscsi it may be _necessary_ for file writes
> and swap to take memory from the same pools, because they can
> share the same block device.

Of course swapping can require additional allocation, when it happens
over network it is quite obvious.

The main problem is the fact, that if system was put into the state,
when its life depends on the last possible allocation, then it is
broken.

There is a light connected to car's fuel tank which starts blinking,
when amount of fuel is less then predefined level. Car just does not
stop suddenly and starts to get fuel from reserve (well eventually it
stops, but it says about problem long before it dies).

> Please get out of your fantasy world and accept the constraints
> the VM has to operate under. Maybe then you and Peter can agree
> on something.

I can not accept the situation, when problem is not fixed, but instead
recovery path is added. There must be both ways of dealing with it -
emergency force majeur recovery and preventive steps.

What we are talking about (except pointing to obvious things and sending
to school-classes), at least how I see this, is ways of dealing with
possible OOM condition. If OOM has happend, then there must be recovery
path, but OOM must be prevented, and ways to do this were described too.

> --
> Politics is the struggle between those who want to make their country
> the best in the world, and those who believe it already is. Each group
> calls the other unpatriotic.

--
Evgeniy Polyakov

2007-01-21 02:15:04

by Evgeniy Polyakov

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

> On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel ([email protected]) wrote:
> > Due to the way everything in the kernel works, you cannot
> > prevent the memory allocator from allocating everything and
> > running out, except maybe by setting aside reserves to deal
> > with special subsystems.

As a technical side gets described, this is exactly the way I proposed -
there is special dedicated pool which does not depend on main system
allocator, so if the latter is empty, the former still _can_ work,
although it is possible that it will be empty too.

Separation.
It removes avalanche effect when one problem produces several different.

I do not say that some allocator is the best for dealing with such
situation, I just pointed that critical pathes were separated in NTA, so
they do not depend on each one's failure.

Actually that separation was introduced way too long ago with memory
pools, this is some kind of continuation, which adds a lot of additional
extremely useful features.

NTA used for network allocations is that pool, since in real life
packets can not be allocated in advance without memory overhead. For
simple situations like only ACK generatinos it is possible, which I
suggested first, but long-term solution is special allocator.
I selected NTA for this task because it has _additional_ features like
self-deragmentation, which is very useful part for networking, but if
only OOM recovery condition is concerned, then actually any other
allocator can be used of course.

--
Evgeniy Polyakov

2007-01-21 16:31:15

by Rik van Riel

[permalink] [raw]

Subject: Re: Possible ways of dealing with OOM conditions.

Evgeniy Polyakov wrote:
> On Sat, Jan 20, 2007 at 05:36:03PM -0500, Rik van Riel ([email protected]) wrote:
>> Evgeniy Polyakov wrote:
>>> On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra
>>> ([email protected]) wrote:
>>>>> Even further development of such idea is to prevent such OOM condition
>>>>> at all - by starting swapping early (but wisely) and reduce memory
>>>>> usage.
>>>> These just postpone execution but will not avoid it.
>>> No. If system allows to have such a condition, then
>>> something is broken. It must be prevented, instead of creating special
>>> hacks to recover from it.
>> Evgeniy, you may want to learn something about the VM before
>> stating that reality should not occur.
>
> I.e. I should start believing that OOM can not be prevented, bugs can
> not be fixed and things can not be changed just because it happens right
> now? That is why I'm not subscribed to lkml :)

The reasons for this are often not inside the VM itself,
but are due to the constraints imposed on the VM.

For example, with many of the journaled filesystems there
is no way to know in advance how much IO needs to be done
to complete a writeout of one dirty page (and consequently,
how much memory needs to be allocated to complete this one
writeout).

Parts of the VM could be changed to reduce the pressure
somewhat, eg. limiting the number of IOs in flight, but
that will probably have performance consequences that may
not be acceptable to Andrew and Linus and never get merged.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.