Hi,
The latest version of the VM deadlock prevention work.
The basic premises is that network sockets serving the VM need undisturbed
functionality in the face of severe memory shortage.
This patch-set provides the framework to provide this.
The core of the VM deadlock avoidance framework.
In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of OOM because
the device itself might be needed to get out of OOM (reclaim pageout).
This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive ACKs for those packets.
The network stack has a huge capacity for buffering packets; waiting for
user-space to read them. There is a practical limit imposed to avoid DoS
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page).
Memory pressure will add to that; what if there is simply no memory left to
receive packets in.
This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee full
service to these critical sockets.
When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.
Since memory is tight and the reserve modest, we do not want to lose memory to
fragmentation effects. Hence a very simple allocator is used to guarantee that
the memory used for each packet is returned to the page allocator.
Converted protocols:
IPv4 & IPv6:
- icmp
- udp
- tcp
IPv4:
- igmp
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>
---
include/linux/gfp.h | 3 -
include/linux/mmzone.h | 1
include/linux/skbuff.h | 13 ++++--
include/net/sock.h | 37 +++++++++++++++++
mm/page_alloc.c | 41 ++++++++++++++++++-
net/core/skbuff.c | 103 ++++++++++++++++++++++++++++++++++++++++++-------
net/core/sock.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
net/ipv4/af_inet.c | 3 +
net/ipv4/icmp.c | 3 +
net/ipv4/igmp.c | 3 +
net/ipv4/tcp_ipv4.c | 3 +
net/ipv4/udp.c | 8 +++
net/ipv6/af_inet6.c | 3 +
net/ipv6/icmp.c | 3 +
net/ipv6/tcp_ipv6.c | 3 +
net/ipv6/udp.c | 3 +
16 files changed, 305 insertions(+), 22 deletions(-)
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_EMERG ((__force gfp_t)0x40000u) /* Use emergency reserves */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_EMERG)
/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);
#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ emerg:1;
__be16 protocol;
void (*destructor)(struct sk_buff *skb);
@@ -327,10 +328,13 @@ struct sk_buff {
#include <asm/system.h>
+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone);
+ gfp_t priority, int flags);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
@@ -340,7 +344,7 @@ static inline struct sk_buff *alloc_skb(
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
}
extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1101,7 +1105,8 @@ static inline void __skb_queue_purge(str
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb =
+ __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -391,6 +391,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_VMIO, /* promise to never block on receive */
};
static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,42 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}
+static inline int sk_is_vmio(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_VMIO);
+}
+
+#define MAX_PAGES_PER_PACKET 2
+#define MAX_FRAGMENTS ((65536 + 1500 - 1) / 1500)
+/*
+ * Set an upper limit on the number of pages used for RX skbs.
+ */
+#define RX_RESERVE_PAGES (64 * MAX_PAGES_PER_PACKET)
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ */
+#define TX_RESERVE_PAGES \
+ (4 * MAX_FRAGMENTS * MAX_PAGES_PER_PACKET)
+
+extern atomic_t vmio_socks;
+extern atomic_t emerg_rx_pages_used;
+
+static inline int emerg_rx_pages_try_inc(void)
+{
+ return atomic_read(&vmio_socks) &&
+ atomic_add_unless(&emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
+}
+
+static inline void emerg_rx_pages_dec(void)
+{
+ atomic_dec(&emerg_rx_pages_used);
+}
+
+extern int sk_adjust_memalloc(int socks, int request_queues);
+extern int sk_set_vmio(struct sock *sk);
+extern int sk_clear_vmio(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+int var_free_kbytes;
unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
@@ -970,8 +971,8 @@ restart:
/* This allocation should allow future memory freeing. */
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_EMERG)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
@@ -2196,7 +2197,8 @@ static void setup_per_zone_lowmem_reserv
*/
void setup_per_zone_pages_min(void)
{
- unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+ >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -2248,6 +2250,39 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
}
+/**
+ * adjust_memalloc_reserve - adjust the memalloc reserve
+ * @pages: number of pages to add
+ *
+ * It adds a number of pages to the memalloc reserve; if
+ * the number was positive it kicks kswapd into action to
+ * satisfy the higher watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+ int kbytes;
+ int err = 0;
+
+ kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+ if (kbytes < 0) {
+ err = -EINVAL;
+ goto out;
+ }
+ var_free_kbytes = kbytes;
+ setup_per_zone_pages_min();
+ if (pages > 0) {
+ struct zone *zone;
+ for_each_zone(zone)
+ wakeup_kswapd(zone, 0);
+ }
+ printk(KERN_DEBUG "Emergency reserve: %d\n", var_free_kbytes);
+
+out:
+ return err;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
/*
* Initialise min_free_kbytes.
*
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -139,28 +139,30 @@ EXPORT_SYMBOL(skb_truesize_bug);
* Buffers may only be allocated from interrupts using a @gfp_mask of
* %GFP_ATOMIC.
*/
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone)
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int flags)
{
kmem_cache_t *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ size = SKB_DATA_ALIGN(size);
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
/* Get the HEAD */
skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);
if (!skb)
- goto out;
+ goto noskb;
/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;
+allocated:
memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->emerg = !cache;
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -177,7 +179,7 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);
@@ -185,13 +187,48 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);
child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->emerg = skb->emerg;
}
out:
return skb;
+
nodata:
kmem_cache_free(cache, skb);
skb = NULL;
- goto out;
+noskb:
+ /* Attempt emergency allocation when RX skb. */
+ if (!(flags & SKB_ALLOC_RX))
+ goto out;
+
+ if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
+ goto out;
+
+ if (!emerg_rx_pages_try_inc())
+ goto out;
+
+ skb = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
+ if (!skb) {
+ WARN_ON(1); /* we were promised memory but didn't get it? */
+ goto dec_out;
+ }
+
+ if (!emerg_rx_pages_try_inc())
+ goto skb_free_out;
+
+ data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
+ if (!data) {
+ WARN_ON(1); /* we were promised memory but didn't get it? */
+ emerg_rx_pages_dec();
+skb_free_out:
+ free_page((unsigned long)skb);
+ skb = NULL;
+dec_out:
+ emerg_rx_pages_dec();
+ goto out;
+ }
+
+ cache = NULL;
+ goto allocated;
}
/**
@@ -267,7 +304,7 @@ struct sk_buff *__netdev_alloc_skb(struc
{
struct sk_buff *skb;
- skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -315,10 +352,20 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);
- kfree(skb->head);
+ if (skb->emerg) {
+ free_page((unsigned long)skb->head);
+ emerg_rx_pages_dec();
+ } else
+ kfree(skb->head);
}
}
+static void emerg_free_skb(struct kmem_cache *cache, void *objp)
+{
+ free_page((unsigned long)objp);
+ emerg_rx_pages_dec();
+}
+
/*
* Free an skbuff by memory without cleaning the state.
*/
@@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
+ void (*free_skb)(struct kmem_cache *, void *);
skb_release_data(skb);
+
+ free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
+
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
- kmem_cache_free(skbuff_head_cache, skb);
+ free_skb(skbuff_head_cache, skb);
break;
case SKB_FCLONE_ORIG:
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, skb);
+ free_skb(skbuff_fclone_cache, skb);
break;
case SKB_FCLONE_CLONE:
@@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
skb->fclone = SKB_FCLONE_UNAVAILABLE;
if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, other);
+ free_skb(skbuff_fclone_cache, other);
break;
};
}
@@ -435,6 +486,17 @@ struct sk_buff *skb_clone(struct sk_buff
atomic_t *fclone_ref = (atomic_t *) (n + 1);
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
+ } else if (skb->emerg) {
+ if (!emerg_rx_pages_try_inc())
+ return NULL;
+
+ n = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
+ if (!n) {
+ WARN_ON(1);
+ emerg_rx_pages_dec();
+ return NULL;
+ }
+ n->fclone = SKB_FCLONE_UNAVAILABLE;
} else {
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
@@ -470,6 +532,7 @@ struct sk_buff *skb_clone(struct sk_buff
#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
C(ipvs_property);
#endif
+ C(emerg);
C(protocol);
n->destructor = NULL;
#ifdef CONFIG_NETFILTER
@@ -690,7 +753,21 @@ int pskb_expand_head(struct sk_buff *skb
size = SKB_DATA_ALIGN(size);
- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ if (skb->emerg) {
+ if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
+ goto nodata;
+
+ if (!emerg_rx_pages_try_inc())
+ goto nodata;
+
+ data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
+ if (!data) {
+ WARN_ON(1);
+ emerg_rx_pages_dec();
+ goto nodata;
+ }
+ } else
+ data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c
+++ linux-2.6/net/ipv4/icmp.c
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
goto error;
}
+ if (unlikely(skb->emerg))
+ goto drop;
+
if (!pskb_pull(skb, sizeof(struct icmphdr)))
goto error;
Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c
+++ linux-2.6/net/ipv4/tcp_ipv4.c
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;
+ if (unlikely(skb->emerg && !sk_is_vmio(sk)))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ if (unlikely(skb->emerg && !sk_is_vmio(sk)))
+ goto drop_noncritical;
+
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);
/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
return 0;
}
+drop_noncritical:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset(skb);
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -132,6 +132,9 @@ void inet_sock_destruct(struct sock *sk)
{
struct inet_sock *inet = inet_sk(sk);
+ if (sk_is_vmio(sk))
+ sk_clear_vmio(sk);
+
__skb_queue_purge(&sk->sk_receive_queue);
__skb_queue_purge(&sk->sk_error_queue);
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -111,6 +111,7 @@
#include <linux/poll.h>
#include <linux/tcp.h>
#include <linux/init.h>
+#include <linux/blkdev.h>
#include <asm/uaccess.h>
#include <asm/system.h>
@@ -195,6 +196,102 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_reserve;
+static unsigned int vmio_request_queues;
+
+atomic_t vmio_socks;
+atomic_t emerg_rx_pages_used;
+EXPORT_SYMBOL_GPL(vmio_socks);
+EXPORT_SYMBOL_GPL(emerg_rx_pages_used);
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @nr_socks: number of new %SOCK_VMIO sockets
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * For each %SOCK_VMIO socket this device will reserve enough
+ * to send a few large packets (64k) at a time: %TX_RESERVE_PAGES.
+ *
+ * Assumption:
+ * - each %SOCK_VMIO socket will have a %request_queue associated.
+ *
+ * NOTE:
+ * %TX_RESERVE_PAGES is an upper-bound of memory used for TX hence
+ * we need not account the pages like we do for %RX_RESERVE_PAGES.
+ *
+ * On top of this comes a one time charge of:
+ * %RX_RESERVE_PAGES pages -
+ * number of pages alloted for emergency skb service to critical
+ * sockets.
+ */
+int sk_adjust_memalloc(int socks, int request_queues)
+{
+ unsigned long flags;
+ int reserve;
+ int nr_socks;
+ int err;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+
+ atomic_add(socks, &vmio_socks);
+ nr_socks = atomic_read(&vmio_socks);
+ BUG_ON(socks < 0);
+ vmio_request_queues += request_queues;
+
+ reserve = vmio_request_queues * TX_RESERVE_PAGES + /* outbound */
+ (!!socks) * RX_RESERVE_PAGES; /* inbound */
+
+ err = adjust_memalloc_reserve(reserve - memalloc_reserve);
+ if (err) {
+ printk(KERN_WARNING
+ "Unable to change reserve to: %d pages, error: %d\n",
+ reserve, err);
+ goto unlock;
+ }
+ memalloc_reserve = reserve;
+
+unlock:
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ * sk_set_vmio - sets %SOCK_VMIO
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_VMIO on a socket and increase the memalloc reserve
+ * accordingly.
+ */
+int sk_set_vmio(struct sock *sk)
+{
+ int err = 0;
+
+ if (!sock_flag(sk, SOCK_VMIO) &&
+ !(err = sk_adjust_memalloc(1, 0))) {
+ sock_set_flag(sk, SOCK_VMIO);
+ sk->sk_allocation |= __GFP_EMERG;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_set_vmio);
+
+int sk_clear_vmio(struct sock *sk)
+{
+ int err = 0;
+
+ if (sock_flag(sk, SOCK_VMIO) &&
+ !(err = sk_adjust_memalloc(-1, 0))) {
+ sock_reset_flag(sk, SOCK_VMIO);
+ sk->sk_allocation &= ~__GFP_EMERG;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_clear_vmio);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;
Index: linux-2.6/net/ipv6/af_inet6.c
===================================================================
--- linux-2.6.orig/net/ipv6/af_inet6.c
+++ linux-2.6/net/ipv6/af_inet6.c
@@ -368,6 +368,9 @@ int inet6_destroy_sock(struct sock *sk)
struct sk_buff *skb;
struct ipv6_txoptions *opt;
+ if (sk_is_vmio(sk))
+ sk_clear_vmio(sk);
+
/* Release rx options */
if ((skb = xchg(&np->pktoptions, NULL)) != NULL)
Index: linux-2.6/net/ipv6/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv6/icmp.c
+++ linux-2.6/net/ipv6/icmp.c
@@ -599,6 +599,9 @@ static int icmpv6_rcv(struct sk_buff **p
ICMP6_INC_STATS_BH(idev, ICMP6_MIB_INMSGS);
+ if (unlikely(skb->emerg))
+ goto discard_it;
+
saddr = &skb->nh.ipv6h->saddr;
daddr = &skb->nh.ipv6h->daddr;
Index: linux-2.6/net/ipv6/tcp_ipv6.c
===================================================================
--- linux-2.6.orig/net/ipv6/tcp_ipv6.c
+++ linux-2.6/net/ipv6/tcp_ipv6.c
@@ -1216,6 +1216,9 @@ static int tcp_v6_rcv(struct sk_buff **p
if (!sk)
goto no_tcp_socket;
+ if (unlikely(skb->emerg && !sk_is_vmio(sk)))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Index: linux-2.6/net/ipv6/udp.c
===================================================================
--- linux-2.6.orig/net/ipv6/udp.c
+++ linux-2.6/net/ipv6/udp.c
@@ -499,6 +499,9 @@ static int udpv6_rcv(struct sk_buff **ps
sk = udp_v6_lookup(saddr, uh->source, daddr, uh->dest, dev->ifindex);
if (sk == NULL) {
+ if (unlikely(skb->emerg && !sk_is_vmio(sk)))
+ goto discard;
+
if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard;
Index: linux-2.6/net/ipv4/igmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/igmp.c
+++ linux-2.6/net/ipv4/igmp.c
@@ -927,6 +927,9 @@ int igmp_rcv(struct sk_buff *skb)
return 0;
}
+ if (unlikely(skb->emerg))
+ goto drop;
+
if (!pskb_may_pull(skb, sizeof(struct igmphdr)))
goto drop;
Provide an block queue init function that allows to set an elevator.
And a function to pin the current elevator.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>
---
block/elevator.c | 5 +++++
block/ll_rw_blk.c | 12 ++++++++++--
include/linux/blkdev.h | 9 +++++++++
3 files changed, 24 insertions(+), 2 deletions(-)
Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
request_queue_t *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
+ return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+ char *elv_name)
+{
request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);
if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
/*
* all done
*/
- if (!elevator_init(q, NULL)) {
+ if (!elevator_init(q, elv_name)) {
blk_queue_congestion_threshold(q);
return q;
}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
blk_put_queue(q);
return NULL;
}
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);
int blk_get_queue(request_queue_t *q)
{
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -444,6 +444,12 @@ struct request_queue
#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED 9 /* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+ set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}
enum {
/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
/*
* Access functions for manipulating queue properties
*/
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+ spinlock_t *lock, int node_id,
+ char *elv_name);
extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
spinlock_t *lock, int node_id);
extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -861,6 +861,11 @@ ssize_t elv_iosched_store(request_queue_
size_t len;
struct elevator_type *e;
+ if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags)) {
+ printk(KERN_NOTICE "elevator: cannot switch elevator, pinned\n");
+ return count;
+ }
+
elevator_name[sizeof(elevator_name) - 1] = '\0';
strncpy(elevator_name, name, sizeof(elevator_name) - 1);
len = strlen(elevator_name);
Provide a proper a_ops->swapfile() implementation for NFS. This will
set the NFS socket to SOCK_VMIO and put the socket reconnection under
PF_MEMALLOC (I hope this is enough, otherwise more work needs to be done).
Signed-off-by: Peter Zijlstra <[email protected]>
---
fs/nfs/file.c | 21 ++++++++++++++++++++-
include/linux/sunrpc/xprt.h | 4 +++-
net/sunrpc/xprtsock.c | 16 ++++++++++++++++
3 files changed, 39 insertions(+), 2 deletions(-)
Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -27,6 +27,7 @@
#include <linux/slab.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
+#include <net/sock.h>
#include <asm/uaccess.h>
#include <asm/system.h>
@@ -317,7 +318,25 @@ static int nfs_release_page(struct page
static int nfs_swapfile(struct address_space *mapping, int enable)
{
- return 0;
+ int err = -EINVAL;
+ struct rpc_clnt *client = NFS_CLIENT(mapping->host);
+ struct sock *sk = client->cl_xprt->inet;
+
+ if (enable) {
+ client->cl_xprt->swapper = 1;
+ /*
+ * keep one extra sock reference so the reserve won't dip
+ * when the socket gets reconnected.
+ */
+ sk_adjust_memalloc(1, 1);
+ err = sk_set_vmio(sk);
+ } else if (client->cl_xprt->swapper) {
+ client->cl_xprt->swapper = 0;
+ sk_adjust_memalloc(-1, -1);
+ err = sk_clear_vmio(sk);
+ }
+
+ return err;
}
const struct address_space_operations nfs_file_aops = {
Index: linux-2.6/net/sunrpc/xprtsock.c
===================================================================
--- linux-2.6.orig/net/sunrpc/xprtsock.c
+++ linux-2.6/net/sunrpc/xprtsock.c
@@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
{
struct rpc_xprt *xprt = (struct rpc_xprt *) args;
struct socket *sock = xprt->sock;
+ unsigned long pflags = current->flags;
int err, status = -EIO;
if (xprt->shutdown || xprt->addr.sin_port == 0)
@@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
dprintk("RPC: xs_udp_connect_worker for xprt %p\n", xprt);
+ if (xprt->swapper)
+ current->flags |= PF_MEMALLOC;
+
/* Start by resetting any existing state */
xs_close(xprt);
@@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
xprt->sock = sock;
xprt->inet = sk;
+ if (xprt->swapper)
+ sk_set_vmio(sk);
+
write_unlock_bh(&sk->sk_callback_lock);
}
xs_udp_do_set_buffer_size(xprt);
@@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
out:
xprt_wake_pending_tasks(xprt, status);
xprt_clear_connecting(xprt);
+ current->flags = pflags;
}
/*
@@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
{
struct rpc_xprt *xprt = (struct rpc_xprt *)args;
struct socket *sock = xprt->sock;
+ unsigned long pflags = current->flags;
int err, status = -EIO;
if (xprt->shutdown || xprt->addr.sin_port == 0)
goto out;
+ if (xprt->swapper)
+ current->flags |= PF_MEMALLOC;
+
dprintk("RPC: xs_tcp_connect_worker for xprt %p\n", xprt);
if (!xprt->sock) {
@@ -1170,10 +1182,14 @@ static void xs_tcp_connect_worker(void *
break;
}
}
+
+ if (xprt->swapper)
+ sk_set_vmio(xprt->inet);
out:
xprt_wake_pending_tasks(xprt, status);
out_clear:
xprt_clear_connecting(xprt);
+ current->flags = pflags;
}
/**
Index: linux-2.6/include/linux/sunrpc/xprt.h
===================================================================
--- linux-2.6.orig/include/linux/sunrpc/xprt.h
+++ linux-2.6/include/linux/sunrpc/xprt.h
@@ -147,7 +147,9 @@ struct rpc_xprt {
unsigned int max_reqs; /* total slots */
unsigned long state; /* transport state */
unsigned char shutdown : 1, /* being shut down */
- resvport : 1; /* use a reserved port */
+ resvport : 1, /* use a reserved port */
+ swapper : 1; /* we're swapping over this
+ transport */
/*
* XID
Use sk_set_vmio() on the nbd socket.
Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>
---
drivers/block/nbd.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -135,7 +135,6 @@ static int sock_xmit(struct socket *sock
spin_unlock_irqrestore(¤t->sighand->siglock, flags);
do {
- sock->sk->sk_allocation = GFP_NOIO;
iov.iov_base = buf;
iov.iov_len = size;
msg.msg_name = NULL;
@@ -361,8 +360,16 @@ static void nbd_do_it(struct nbd_device
BUG_ON(lo->magic != LO_MAGIC);
+ sk_adjust_memalloc(0, 1);
+ if (sk_set_vmio(lo->sock->sk))
+ printk(KERN_WARNING
+ "failed to set SOCK_VMIO on NBD socket\n");
+
while ((req = nbd_read_stat(lo)) != NULL)
nbd_end_request(req);
+
+ sk_adjust_memalloc(0, -1);
+
return;
}
@@ -525,6 +533,7 @@ static int nbd_ioctl(struct inode *inode
if (S_ISSOCK(inode->i_mode)) {
lo->file = file;
lo->sock = SOCKET_I(inode);
+ lo->sock->sk->sk_allocation = GFP_NOIO;
error = 0;
} else {
fput(file);
@@ -628,11 +637,16 @@ static int __init nbd_init(void)
* every gendisk to have its very own request_queue struct.
* These structs are big so we dynamically allocate them.
*/
- disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+ disk->queue = blk_init_queue_node_elv(do_nbd_request,
+ &nbd_lock, -1, "noop");
if (!disk->queue) {
put_disk(disk);
goto out;
}
+ blk_queue_pin_elevator(disk->queue);
+ blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+ blk_queue_max_hw_segments(disk->queue, 1);
+ blk_queue_max_phys_segments(disk->queue, 1);
}
if (register_blkdev(NBD_MAJOR, "nbd")) {
On Fri, 25 Aug 2006, Peter Zijlstra wrote:
> The basic premises is that network sockets serving the VM need undisturbed
> functionality in the face of severe memory shortage.
>
> This patch-set provides the framework to provide this.
Hmmm.. Is it not possible to avoid the memory pools by
guaranteeing that a certain number of page is easily reclaimable?
On Fri, 2006-08-25 at 08:51 -0700, Christoph Lameter wrote:
> On Fri, 25 Aug 2006, Peter Zijlstra wrote:
>
> > The basic premises is that network sockets serving the VM need undisturbed
> > functionality in the face of severe memory shortage.
> >
> > This patch-set provides the framework to provide this.
>
> Hmmm.. Is it not possible to avoid the memory pools by
> guaranteeing that a certain number of page is easily reclaimable?
We're not actually using mempools, but the memalloc reserve. Purely easy
reclaimable memory is not enough however, since packet receive happens
from IRQ context, and we cannot unmap pages in IRQ context.
Christoph Lameter wrote:
> On Fri, 25 Aug 2006, Peter Zijlstra wrote:
>
>> The basic premises is that network sockets serving the VM need undisturbed
>> functionality in the face of severe memory shortage.
>>
>> This patch-set provides the framework to provide this.
>
> Hmmm.. Is it not possible to avoid the memory pools by
> guaranteeing that a certain number of page is easily reclaimable?
No.
You need to guarantee that the memory is not gobbled up by
another subsystem, but remains available for use by *this*
subsystem. Otherwise you could still deadlock.
--
What is important? What you want to be true, or what is true?
Grumble... If your patches are targetting NFS, could you please at the
very least Cc [email protected] and/or myself.
On Fri, 2006-08-25 at 17:40 +0200, Peter Zijlstra wrote:
> Provide a proper a_ops->swapfile() implementation for NFS. This will
> set the NFS socket to SOCK_VMIO and put the socket reconnection under
> PF_MEMALLOC (I hope this is enough, otherwise more work needs to be done).
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> fs/nfs/file.c | 21 ++++++++++++++++++++-
> include/linux/sunrpc/xprt.h | 4 +++-
> net/sunrpc/xprtsock.c | 16 ++++++++++++++++
> 3 files changed, 39 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/fs/nfs/file.c
> ===================================================================
> --- linux-2.6.orig/fs/nfs/file.c
> +++ linux-2.6/fs/nfs/file.c
> @@ -27,6 +27,7 @@
> #include <linux/slab.h>
> #include <linux/pagemap.h>
> #include <linux/smp_lock.h>
> +#include <net/sock.h>
>
> #include <asm/uaccess.h>
> #include <asm/system.h>
> @@ -317,7 +318,25 @@ static int nfs_release_page(struct page
>
> static int nfs_swapfile(struct address_space *mapping, int enable)
> {
> - return 0;
> + int err = -EINVAL;
> + struct rpc_clnt *client = NFS_CLIENT(mapping->host);
> + struct sock *sk = client->cl_xprt->inet;
> +
> + if (enable) {
> + client->cl_xprt->swapper = 1;
> + /*
> + * keep one extra sock reference so the reserve won't dip
> + * when the socket gets reconnected.
> + */
> + sk_adjust_memalloc(1, 1);
> + err = sk_set_vmio(sk);
> + } else if (client->cl_xprt->swapper) {
> + client->cl_xprt->swapper = 0;
> + sk_adjust_memalloc(-1, -1);
> + err = sk_clear_vmio(sk);
> + }
> +
> + return err;
> }
This all belongs in net/sunrpc/xprtsock.c. The NFS code has no business
screwing around with the internals of the sunrpc transport.
> const struct address_space_operations nfs_file_aops = {
> Index: linux-2.6/net/sunrpc/xprtsock.c
> ===================================================================
> --- linux-2.6.orig/net/sunrpc/xprtsock.c
> +++ linux-2.6/net/sunrpc/xprtsock.c
> @@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
> {
> struct rpc_xprt *xprt = (struct rpc_xprt *) args;
> struct socket *sock = xprt->sock;
> + unsigned long pflags = current->flags;
> int err, status = -EIO;
>
> if (xprt->shutdown || xprt->addr.sin_port == 0)
> @@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
>
> dprintk("RPC: xs_udp_connect_worker for xprt %p\n", xprt);
>
> + if (xprt->swapper)
> + current->flags |= PF_MEMALLOC;
> +
> /* Start by resetting any existing state */
> xs_close(xprt);
>
> @@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
> xprt->sock = sock;
> xprt->inet = sk;
>
> + if (xprt->swapper)
> + sk_set_vmio(sk);
> +
> write_unlock_bh(&sk->sk_callback_lock);
> }
> xs_udp_do_set_buffer_size(xprt);
> @@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
> out:
> xprt_wake_pending_tasks(xprt, status);
> xprt_clear_connecting(xprt);
> + current->flags = pflags;
> }
>
> /*
> @@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
> {
> struct rpc_xprt *xprt = (struct rpc_xprt *)args;
> struct socket *sock = xprt->sock;
> + unsigned long pflags = current->flags;
> int err, status = -EIO;
>
> if (xprt->shutdown || xprt->addr.sin_port == 0)
> goto out;
>
> + if (xprt->swapper)
> + current->flags |= PF_MEMALLOC;
> +
> dprintk("RPC: xs_tcp_connect_worker for xprt %p\n", xprt);
>
> if (!xprt->sock) {
> @@ -1170,10 +1182,14 @@ static void xs_tcp_connect_worker(void *
> break;
> }
> }
> +
> + if (xprt->swapper)
> + sk_set_vmio(xprt->inet);
> out:
> xprt_wake_pending_tasks(xprt, status);
> out_clear:
> xprt_clear_connecting(xprt);
> + current->flags = pflags;
> }
How does this guarantee that the socket reconnection won't fail?
Also, what about the case of rpc_malloc()? Can't that cause rpciod to
deadlock when you add NFS swap into the equation?
Cheers,
Trond
On Fri, 2006-08-25 at 16:14 -0400, Trond Myklebust wrote:
> Grumble... If your patches are targetting NFS, could you please at the
> very least Cc [email protected] and/or myself.
Sorry, will make sure you're on the CC list next round.
> On Fri, 2006-08-25 at 17:40 +0200, Peter Zijlstra wrote:
> > Provide a proper a_ops->swapfile() implementation for NFS. This will
> > set the NFS socket to SOCK_VMIO and put the socket reconnection under
> > PF_MEMALLOC (I hope this is enough, otherwise more work needs to be done).
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > fs/nfs/file.c | 21 ++++++++++++++++++++-
> > include/linux/sunrpc/xprt.h | 4 +++-
> > net/sunrpc/xprtsock.c | 16 ++++++++++++++++
> > 3 files changed, 39 insertions(+), 2 deletions(-)
> >
> > Index: linux-2.6/fs/nfs/file.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfs/file.c
> > +++ linux-2.6/fs/nfs/file.c
> > @@ -27,6 +27,7 @@
> > #include <linux/slab.h>
> > #include <linux/pagemap.h>
> > #include <linux/smp_lock.h>
> > +#include <net/sock.h>
> >
> > #include <asm/uaccess.h>
> > #include <asm/system.h>
> > @@ -317,7 +318,25 @@ static int nfs_release_page(struct page
> >
> > static int nfs_swapfile(struct address_space *mapping, int enable)
> > {
> > - return 0;
> > + int err = -EINVAL;
> > + struct rpc_clnt *client = NFS_CLIENT(mapping->host);
> > + struct sock *sk = client->cl_xprt->inet;
> > +
> > + if (enable) {
> > + client->cl_xprt->swapper = 1;
> > + /*
> > + * keep one extra sock reference so the reserve won't dip
> > + * when the socket gets reconnected.
> > + */
> > + sk_adjust_memalloc(1, 1);
> > + err = sk_set_vmio(sk);
> > + } else if (client->cl_xprt->swapper) {
> > + client->cl_xprt->swapper = 0;
> > + sk_adjust_memalloc(-1, -1);
> > + err = sk_clear_vmio(sk);
> > + }
> > +
> > + return err;
> > }
>
> This all belongs in net/sunrpc/xprtsock.c. The NFS code has no business
> screwing around with the internals of the sunrpc transport.
Ok, I'll make a function there, and call that.
> > const struct address_space_operations nfs_file_aops = {
> > Index: linux-2.6/net/sunrpc/xprtsock.c
> > ===================================================================
> > --- linux-2.6.orig/net/sunrpc/xprtsock.c
> > +++ linux-2.6/net/sunrpc/xprtsock.c
> > @@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
> > {
> > struct rpc_xprt *xprt = (struct rpc_xprt *) args;
> > struct socket *sock = xprt->sock;
> > + unsigned long pflags = current->flags;
> > int err, status = -EIO;
> >
> > if (xprt->shutdown || xprt->addr.sin_port == 0)
> > @@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
> >
> > dprintk("RPC: xs_udp_connect_worker for xprt %p\n", xprt);
> >
> > + if (xprt->swapper)
> > + current->flags |= PF_MEMALLOC;
> > +
> > /* Start by resetting any existing state */
> > xs_close(xprt);
> >
> > @@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
> > xprt->sock = sock;
> > xprt->inet = sk;
> >
> > + if (xprt->swapper)
> > + sk_set_vmio(sk);
> > +
> > write_unlock_bh(&sk->sk_callback_lock);
> > }
> > xs_udp_do_set_buffer_size(xprt);
> > @@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
> > out:
> > xprt_wake_pending_tasks(xprt, status);
> > xprt_clear_connecting(xprt);
> > + current->flags = pflags;
> > }
> >
> > /*
> > @@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
> > {
> > struct rpc_xprt *xprt = (struct rpc_xprt *)args;
> > struct socket *sock = xprt->sock;
> > + unsigned long pflags = current->flags;
> > int err, status = -EIO;
> >
> > if (xprt->shutdown || xprt->addr.sin_port == 0)
> > goto out;
> >
> > + if (xprt->swapper)
> > + current->flags |= PF_MEMALLOC;
> > +
> > dprintk("RPC: xs_tcp_connect_worker for xprt %p\n", xprt);
> >
> > if (!xprt->sock) {
> > @@ -1170,10 +1182,14 @@ static void xs_tcp_connect_worker(void *
> > break;
> > }
> > }
> > +
> > + if (xprt->swapper)
> > + sk_set_vmio(xprt->inet);
> > out:
> > xprt_wake_pending_tasks(xprt, status);
> > out_clear:
> > xprt_clear_connecting(xprt);
> > + current->flags = pflags;
> > }
>
> How does this guarantee that the socket reconnection won't fail?
I was afraid this might not be enough, I really have to go through the
network code.
> Also, what about the case of rpc_malloc()? Can't that cause rpciod to
> deadlock when you add NFS swap into the equation?
I will have to plead ignorance for now, I'll look into this on monday.
On first glance it looks like rpc_malloc could use an |__GFP_EMERG for
RPC_TASK_SWAPPER.
On Fri, August 25, 2006 17:39, Peter Zijlstra said:
> @@ -282,7 +282,8 @@ struct sk_buff {
> nfctinfo:3;
> __u8 pkt_type:3,
> fclone:2,
> - ipvs_property:1;
> + ipvs_property:1,
> + emerg:1;
> __be16 protocol;
Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)
> @@ -391,6 +391,7 @@ enum sock_flags {
> SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
> SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
> SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> + SOCK_VMIO, /* promise to never block on receive */
It might be used for IO related to the VM, but that doesn't tell _what_ it does.
It also does much more than just not blocking on receive, so overal, aren't
both the vmio name and the comment slightly misleading?
> +static inline int emerg_rx_pages_try_inc(void)
> +{
> + return atomic_read(&vmio_socks) &&
> + atomic_add_unless(&emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
> +}
It looks cleaner to move that first check to the caller, as it is often
redundant and in the other cases makes it more clear what the caller is
really doing.
> @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
>
> static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
> int min_free_kbytes = 1024;
> +int var_free_kbytes;
Using var_free_pages makes the code slightly simpler, as all that needless
convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...
> +noskb:
> + /* Attempt emergency allocation when RX skb. */
> + if (!(flags & SKB_ALLOC_RX))
> + goto out;
So only incoming skb allocation is guaranteed? What about outgoing skbs?
What am I missing? Or can we sleep then, and increasing var_free_kbytes is
sufficient to guarantee it?
> +
> + if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
> + goto out;
> +
> + if (!emerg_rx_pages_try_inc())
> + goto out;
> +
> + skb = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!skb) {
> + WARN_ON(1); /* we were promised memory but didn't get it? */
> + goto dec_out;
> + }
> +
> + if (!emerg_rx_pages_try_inc())
> + goto skb_free_out;
> +
> + data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!data) {
> + WARN_ON(1); /* we were promised memory but didn't get it? */
> + emerg_rx_pages_dec();
> +skb_free_out:
> + free_page((unsigned long)skb);
> + skb = NULL;
> +dec_out:
> + emerg_rx_pages_dec();
> + goto out;
> + }
> +
> + cache = NULL;
> + goto allocated;
> }
>
> /**
> @@ -267,7 +304,7 @@ struct sk_buff *__netdev_alloc_skb(struc
> {
> struct sk_buff *skb;
>
> - skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
> + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
> if (likely(skb)) {
> skb_reserve(skb, NET_SKB_PAD);
> skb->dev = dev;
> @@ -315,10 +352,20 @@ static void skb_release_data(struct sk_b
> if (skb_shinfo(skb)->frag_list)
> skb_drop_fraglist(skb);
>
> - kfree(skb->head);
> + if (skb->emerg) {
> + free_page((unsigned long)skb->head);
> + emerg_rx_pages_dec();
> + } else
> + kfree(skb->head);
> }
> }
>
> +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
> +{
> + free_page((unsigned long)objp);
> + emerg_rx_pages_dec();
> +}
> +
> /*
> * Free an skbuff by memory without cleaning the state.
> */
> @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
> {
> struct sk_buff *other;
> atomic_t *fclone_ref;
> + void (*free_skb)(struct kmem_cache *, void *);
>
> skb_release_data(skb);
> +
> + free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
> +
> switch (skb->fclone) {
> case SKB_FCLONE_UNAVAILABLE:
> - kmem_cache_free(skbuff_head_cache, skb);
> + free_skb(skbuff_head_cache, skb);
> break;
>
> case SKB_FCLONE_ORIG:
> fclone_ref = (atomic_t *) (skb + 2);
> if (atomic_dec_and_test(fclone_ref))
> - kmem_cache_free(skbuff_fclone_cache, skb);
> + free_skb(skbuff_fclone_cache, skb);
> break;
>
> case SKB_FCLONE_CLONE:
> @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
> skb->fclone = SKB_FCLONE_UNAVAILABLE;
>
> if (atomic_dec_and_test(fclone_ref))
> - kmem_cache_free(skbuff_fclone_cache, other);
> + free_skb(skbuff_fclone_cache, other);
> break;
> };
> }
I don't have the original code in front of me, but isn't it possible to
add a "goto free" which has all the freeing in one place? That would get
rid of the function pointer stuff and emerg_free_skb.
> @@ -435,6 +486,17 @@ struct sk_buff *skb_clone(struct sk_buff
> atomic_t *fclone_ref = (atomic_t *) (n + 1);
> n->fclone = SKB_FCLONE_CLONE;
> atomic_inc(fclone_ref);
> + } else if (skb->emerg) {
> + if (!emerg_rx_pages_try_inc())
> + return NULL;
> +
> + n = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!n) {
> + WARN_ON(1);
> + emerg_rx_pages_dec();
> + return NULL;
> + }
> + n->fclone = SKB_FCLONE_UNAVAILABLE;
> } else {
> n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> if (!n)
> @@ -470,6 +532,7 @@ struct sk_buff *skb_clone(struct sk_buff
> #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
> C(ipvs_property);
> #endif
> + C(emerg);
> C(protocol);
> n->destructor = NULL;
> #ifdef CONFIG_NETFILTER
> @@ -690,7 +753,21 @@ int pskb_expand_head(struct sk_buff *skb
>
> size = SKB_DATA_ALIGN(size);
>
> - data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> + if (skb->emerg) {
> + if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
> + goto nodata;
> +
> + if (!emerg_rx_pages_try_inc())
> + goto nodata;
> +
> + data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> + if (!data) {
> + WARN_ON(1);
> + emerg_rx_pages_dec();
> + goto nodata;
> + }
There seems to be some pattern occuring here, what about a new function?
Are these functions only called for incoming skbs? If not, calling
emerg_rx_pages_try_inc() is the wrong thing to do. A quick search says
they aren't. Add a RX check? Or is that fixed by SKB_FCLONE_UNAVAILABLE?
If so, why are skb_clone() and pskb_expand_head() modified at all?
(Probably ignorance on my end.)
> + } else
> + data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> if (!data)
> goto nodata;
>
> Index: linux-2.6/net/ipv4/icmp.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/icmp.c
> +++ linux-2.6/net/ipv4/icmp.c
> @@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
> goto error;
> }
>
> + if (unlikely(skb->emerg))
> + goto drop;
> +
> if (!pskb_pull(skb, sizeof(struct icmphdr)))
> goto error;
>
> Index: linux-2.6/net/ipv4/tcp_ipv4.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/tcp_ipv4.c
> +++ linux-2.6/net/ipv4/tcp_ipv4.c
> @@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
> if (!sk)
> goto no_tcp_socket;
>
> + if (unlikely(skb->emerg && !sk_is_vmio(sk)))
> + goto discard_and_relse;
> +
> process:
> if (sk->sk_state == TCP_TIME_WAIT)
> goto do_time_wait;
> Index: linux-2.6/net/ipv4/udp.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/udp.c
> +++ linux-2.6/net/ipv4/udp.c
> @@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
> sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);
>
> if (sk != NULL) {
> - int ret = udp_queue_rcv_skb(sk, skb);
> + int ret;
> +
> + if (unlikely(skb->emerg && !sk_is_vmio(sk)))
> + goto drop_noncritical;
> +
> + ret = udp_queue_rcv_skb(sk, skb);
> sock_put(sk);
>
> /* a return value > 0 means to resubmit the input, but
> @@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
> return 0;
> }
>
> +drop_noncritical:
> if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
> goto drop;
> nf_reset(skb);
> Index: linux-2.6/net/ipv4/af_inet.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/af_inet.c
> +++ linux-2.6/net/ipv4/af_inet.c
> @@ -132,6 +132,9 @@ void inet_sock_destruct(struct sock *sk)
> {
> struct inet_sock *inet = inet_sk(sk);
>
> + if (sk_is_vmio(sk))
> + sk_clear_vmio(sk);
> +
> __skb_queue_purge(&sk->sk_receive_queue);
> __skb_queue_purge(&sk->sk_error_queue);
>
> Index: linux-2.6/net/core/sock.c
> ===================================================================
> --- linux-2.6.orig/net/core/sock.c
> +++ linux-2.6/net/core/sock.c
> @@ -111,6 +111,7 @@
> #include <linux/poll.h>
> #include <linux/tcp.h>
> #include <linux/init.h>
> +#include <linux/blkdev.h>
>
> #include <asm/uaccess.h>
> #include <asm/system.h>
> @@ -195,6 +196,102 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
> /* Maximal space eaten by iovec or ancilliary data plus some space */
> int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_reserve;
> +static unsigned int vmio_request_queues;
> +
> +atomic_t vmio_socks;
> +atomic_t emerg_rx_pages_used;
> +EXPORT_SYMBOL_GPL(vmio_socks);
> +EXPORT_SYMBOL_GPL(emerg_rx_pages_used);
> +
> +/**
> + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + * @nr_socks: number of new %SOCK_VMIO sockets
I don't see a parameter named nr_socks? request_queues isn't docuemnted?
> + *
> + * This function adjusts the memalloc reserve based on system demand.
> + * For each %SOCK_VMIO socket this device will reserve enough
> + * to send a few large packets (64k) at a time: %TX_RESERVE_PAGES.
> + *
> + * Assumption:
> + * - each %SOCK_VMIO socket will have a %request_queue associated.
If this is assumed, then why is the request_queue parameter needed?
> + *
> + * NOTE:
> + * %TX_RESERVE_PAGES is an upper-bound of memory used for TX hence
> + * we need not account the pages like we do for %RX_RESERVE_PAGES.
> + *
> + * On top of this comes a one time charge of:
> + * %RX_RESERVE_PAGES pages -
> + * number of pages alloted for emergency skb service to critical
> + * sockets.
> + */
> +int sk_adjust_memalloc(int socks, int request_queues)
> +{
> + unsigned long flags;
> + int reserve;
> + int nr_socks;
> + int err;
> +
> + spin_lock_irqsave(&memalloc_lock, flags);
> +
> + atomic_add(socks, &vmio_socks);
> + nr_socks = atomic_read(&vmio_socks);
nr_socks = atomic_add_return(socks, &vmio_socks);
> + BUG_ON(socks < 0);
Shouldn't this be nr_socks < 0?
> + vmio_request_queues += request_queues;
> +
> + reserve = vmio_request_queues * TX_RESERVE_PAGES + /* outbound */
> + (!!socks) * RX_RESERVE_PAGES; /* inbound */
If the assumption that each VMIO socket will have a request_queue associated
is true, then we only need to check if vmio_request_queues !=0, right?
> +
> + err = adjust_memalloc_reserve(reserve - memalloc_reserve);
> + if (err) {
> + printk(KERN_WARNING
> + "Unable to change reserve to: %d pages, error: %d\n",
> + reserve, err);
> + goto unlock;
> + }
> + memalloc_reserve = reserve;
> +
> +unlock:
> + spin_unlock_irqrestore(&memalloc_lock, flags);
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
You can get rid of the memalloc_reserve and vmio_request_queues variables
if you want, they aren't really needed for anything. If using them reduces
the total code size I'd keep them though.
int sk_adjust_memalloc(int socks, int request_queues)
{
unsigned long flags;
int reserve;
int nr_socks;
int err;
spin_lock_irqsave(&memalloc_lock, flags);
nr_socks = atomic_add_return(socks, &vmio_socks);
BUG_ON(nr_socks < 0);
reserve = request_queues * TX_RESERVE_PAGES; /* outbound */
/* Check if this were the first socks: */
if (nr_socks - socks == 0)
reserve += RX_RESERVE_PAGES; /* inbound */
/* Or the last: */
else if (nr_socks == 0)
reserve -= RX_RESERVE_PAGES;
err = adjust_memalloc_reserve(reserve);
if (err) {
printk(KERN_WARNING
"Unable to adjust reserve by %d pages, error: %d\n",
reserve, err);
goto unlock;
}
unlock:
spin_unlock_irqrestore(&memalloc_lock, flags);
return err;
}
> +
> +/**
> + * sk_set_vmio - sets %SOCK_VMIO
> + * @sk: socket to set it on
> + *
> + * Set %SOCK_VMIO on a socket and increase the memalloc reserve
> + * accordingly.
> + */
> +int sk_set_vmio(struct sock *sk)
> +{
> + int err = 0;
> +
> + if (!sock_flag(sk, SOCK_VMIO) &&
> + !(err = sk_adjust_memalloc(1, 0))) {
> + sock_set_flag(sk, SOCK_VMIO);
> + sk->sk_allocation |= __GFP_EMERG;
> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_vmio);
> +
> +int sk_clear_vmio(struct sock *sk)
> +{
> + int err = 0;
> +
> + if (sock_flag(sk, SOCK_VMIO) &&
> + !(err = sk_adjust_memalloc(-1, 0))) {
> + sock_reset_flag(sk, SOCK_VMIO);
> + sk->sk_allocation &= ~__GFP_EMERG;
> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(sk_clear_vmio);
It seems wiser to always reset the flags, even if sk_adjust_memalloc fails.
This patch looks much better than the previous one, not much cruft left.
Greetings,
Indan
> /* Check if this were the first socks: */
> if (nr_socks - socks == 0)
> reserve += RX_RESERVE_PAGES;
Can of course be:
if (nr_socks == socks)
reserve += RX_RESERVE_PAGES;
Grumble,
Indan
On Sat, 2006-08-26 at 04:37 +0200, Indan Zupancic wrote:
> On Fri, August 25, 2006 17:39, Peter Zijlstra said:
> > @@ -282,7 +282,8 @@ struct sk_buff {
> > nfctinfo:3;
> > __u8 pkt_type:3,
> > fclone:2,
> > - ipvs_property:1;
> > + ipvs_property:1,
> > + emerg:1;
> > __be16 protocol;
>
> Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)
hehe, me lazy, you gentoo ;-)
sed -i -e 's/emerg/emregency/g' -e 's/EMERG/EMERGENCY/g' *.patch
> > @@ -391,6 +391,7 @@ enum sock_flags {
> > SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
> > SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
> > SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> > + SOCK_VMIO, /* promise to never block on receive */
>
> It might be used for IO related to the VM, but that doesn't tell _what_ it does.
> It also does much more than just not blocking on receive, so overal, aren't
> both the vmio name and the comment slightly misleading?
I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
while, but that is a very bad name because nonblocking has this well
defined meaning when talking about sockets, and this is not that.
Hence I came up with the VMIO, because that is the only selecting
criteria for being special. - I'll fix up the comment.
> > +static inline int emerg_rx_pages_try_inc(void)
> > +{
> > + return atomic_read(&vmio_socks) &&
> > + atomic_add_unless(&emerg_rx_pages_used, 1, RX_RESERVE_PAGES);
> > +}
>
> It looks cleaner to move that first check to the caller, as it is often
> redundant and in the other cases makes it more clear what the caller is
> really doing.
Yes, very good suggestion indeed, what was I thinking?!
> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
> >
> > static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
> > int min_free_kbytes = 1024;
> > +int var_free_kbytes;
>
> Using var_free_pages makes the code slightly simpler, as all that needless
> convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...
't seems I'm a bit puzzled as to what you mean here.
>
> > +noskb:
> > + /* Attempt emergency allocation when RX skb. */
> > + if (!(flags & SKB_ALLOC_RX))
> > + goto out;
>
> So only incoming skb allocation is guaranteed? What about outgoing skbs?
> What am I missing? Or can we sleep then, and increasing var_free_kbytes is
> sufficient to guarantee it?
->sk_allocation |= __GFP_EMERGENCY - will take care of the outgoing
packets. Also, since one only sends a limited number of packets out and
then will wait for answers, we do not need to worry about fragmentation
issues that much in this case.
> > +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
> > +{
> > + free_page((unsigned long)objp);
> > + emerg_rx_pages_dec();
> > +}
> > +
> > /*
> > * Free an skbuff by memory without cleaning the state.
> > */
> > @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
> > {
> > struct sk_buff *other;
> > atomic_t *fclone_ref;
> > + void (*free_skb)(struct kmem_cache *, void *);
> >
> > skb_release_data(skb);
> > +
> > + free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
> > +
> > switch (skb->fclone) {
> > case SKB_FCLONE_UNAVAILABLE:
> > - kmem_cache_free(skbuff_head_cache, skb);
> > + free_skb(skbuff_head_cache, skb);
> > break;
> >
> > case SKB_FCLONE_ORIG:
> > fclone_ref = (atomic_t *) (skb + 2);
> > if (atomic_dec_and_test(fclone_ref))
> > - kmem_cache_free(skbuff_fclone_cache, skb);
> > + free_skb(skbuff_fclone_cache, skb);
> > break;
> >
> > case SKB_FCLONE_CLONE:
> > @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
> > skb->fclone = SKB_FCLONE_UNAVAILABLE;
> >
> > if (atomic_dec_and_test(fclone_ref))
> > - kmem_cache_free(skbuff_fclone_cache, other);
> > + free_skb(skbuff_fclone_cache, other);
> > break;
> > };
> > }
>
> I don't have the original code in front of me, but isn't it possible to
> add a "goto free" which has all the freeing in one place? That would get
> rid of the function pointer stuff and emerg_free_skb.
perhaps, yes, however I prefer this one, it allows access to the size.
> > @@ -435,6 +486,17 @@ struct sk_buff *skb_clone(struct sk_buff
> > atomic_t *fclone_ref = (atomic_t *) (n + 1);
> > n->fclone = SKB_FCLONE_CLONE;
> > atomic_inc(fclone_ref);
> > + } else if (skb->emerg) {
> > + if (!emerg_rx_pages_try_inc())
> > + return NULL;
> > +
> > + n = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> > + if (!n) {
> > + WARN_ON(1);
> > + emerg_rx_pages_dec();
> > + return NULL;
> > + }
> > + n->fclone = SKB_FCLONE_UNAVAILABLE;
> > } else {
> > n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> > if (!n)
> > @@ -470,6 +532,7 @@ struct sk_buff *skb_clone(struct sk_buff
> > #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
> > C(ipvs_property);
> > #endif
> > + C(emerg);
> > C(protocol);
> > n->destructor = NULL;
> > #ifdef CONFIG_NETFILTER
> > @@ -690,7 +753,21 @@ int pskb_expand_head(struct sk_buff *skb
> >
> > size = SKB_DATA_ALIGN(size);
> >
> > - data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> > + if (skb->emerg) {
> > + if (size + sizeof(struct skb_shared_info) > PAGE_SIZE)
> > + goto nodata;
> > +
> > + if (!emerg_rx_pages_try_inc())
> > + goto nodata;
> > +
> > + data = (void *)__get_free_page(gfp_mask | __GFP_EMERG);
> > + if (!data) {
> > + WARN_ON(1);
> > + emerg_rx_pages_dec();
> > + goto nodata;
> > + }
>
> There seems to be some pattern occuring here, what about a new function?
D'oh, thanks for pointing out.
> Are these functions only called for incoming skbs? If not, calling
> emerg_rx_pages_try_inc() is the wrong thing to do. A quick search says
> they aren't. Add a RX check? Or is that fixed by SKB_FCLONE_UNAVAILABLE?
> If so, why are skb_clone() and pskb_expand_head() modified at all?
> (Probably ignorance on my end.)
Well, no, however only incoming skbs can have skb->emergency set to
begin with.
> > +/**
> > + * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> > + * @nr_socks: number of new %SOCK_VMIO sockets
>
> I don't see a parameter named nr_socks? request_queues isn't docuemnted?
>
> > + *
> > + * This function adjusts the memalloc reserve based on system demand.
> > + * For each %SOCK_VMIO socket this device will reserve enough
> > + * to send a few large packets (64k) at a time: %TX_RESERVE_PAGES.
> > + *
> > + * Assumption:
> > + * - each %SOCK_VMIO socket will have a %request_queue associated.
>
> If this is assumed, then why is the request_queue parameter needed?
>
> > + *
> > + * NOTE:
> > + * %TX_RESERVE_PAGES is an upper-bound of memory used for TX hence
> > + * we need not account the pages like we do for %RX_RESERVE_PAGES.
> > + *
> > + * On top of this comes a one time charge of:
> > + * %RX_RESERVE_PAGES pages -
> > + * number of pages alloted for emergency skb service to critical
> > + * sockets.
> > + */
Gah, obsolete comment again.
> > +int sk_adjust_memalloc(int socks, int request_queues)
> > +{
> > + unsigned long flags;
> > + int reserve;
> > + int nr_socks;
> > + int err;
> > +
> > + spin_lock_irqsave(&memalloc_lock, flags);
> > +
> > + atomic_add(socks, &vmio_socks);
> > + nr_socks = atomic_read(&vmio_socks);
>
> nr_socks = atomic_add_return(socks, &vmio_socks);
Ah, I must've been blind, I even had a quick look for such a function.
> > + BUG_ON(socks < 0);
>
> Shouldn't this be nr_socks < 0?
Yes.
> > + vmio_request_queues += request_queues;
> > +
> > + reserve = vmio_request_queues * TX_RESERVE_PAGES + /* outbound */
> > + (!!socks) * RX_RESERVE_PAGES; /* inbound */
>
> If the assumption that each VMIO socket will have a request_queue associated
> is true, then we only need to check if vmio_request_queues !=0, right?
I had to break that assumption.
> > +
> > + err = adjust_memalloc_reserve(reserve - memalloc_reserve);
> > + if (err) {
> > + printk(KERN_WARNING
> > + "Unable to change reserve to: %d pages, error: %d\n",
> > + reserve, err);
> > + goto unlock;
> > + }
> > + memalloc_reserve = reserve;
> > +
> > +unlock:
> > + spin_unlock_irqrestore(&memalloc_lock, flags);
> > + return err;
> > +}
> > +EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
>
> You can get rid of the memalloc_reserve and vmio_request_queues variables
> if you want, they aren't really needed for anything. If using them reduces
> the total code size I'd keep them though.
I find my version easier to read, but that might just be the way my
brain works.
> > +int sk_clear_vmio(struct sock *sk)
> > +{
> > + int err = 0;
> > +
> > + if (sock_flag(sk, SOCK_VMIO) &&
> > + !(err = sk_adjust_memalloc(-1, 0))) {
> > + sock_reset_flag(sk, SOCK_VMIO);
> > + sk->sk_allocation &= ~__GFP_EMERG;
> > + }
> > +
> > + return err;
> > +}
> > +EXPORT_SYMBOL_GPL(sk_clear_vmio);
>
> It seems wiser to always reset the flags, even if sk_adjust_memalloc fails.
So much for symmetry.
> This patch looks much better than the previous one, not much cruft left.
You seem to have found enough :-(
Thanks though.
On Mon, August 28, 2006 12:22, Peter Zijlstra said:
> On Sat, 2006-08-26 at 04:37 +0200, Indan Zupancic wrote:
>> Why not 'emergency'? Looks like 'emerge' with a typo now. ;-)
>
> hehe, me lazy, you gentoo ;-)
> sed -i -e 's/emerg/emregency/g' -e 's/EMERG/EMERGENCY/g' *.patch
I used it for a while, long ago, until I figured out that there were better
alternatives. I didn't like the overly complex init and portage system though.
But if you say "emerg" it will sound as "emerge", and all other fields in that
struct aren't abbreviated either and often longer, so it just makes more sense
to use the full name.
>> > @@ -391,6 +391,7 @@ enum sock_flags {
>> > SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
>> > SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
>> > SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
>> > + SOCK_VMIO, /* promise to never block on receive */
>>
>> It might be used for IO related to the VM, but that doesn't tell _what_ it does.
>> It also does much more than just not blocking on receive, so overal, aren't
>> both the vmio name and the comment slightly misleading?
>
> I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
> while, but that is a very bad name because nonblocking has this well
> defined meaning when talking about sockets, and this is not that.
>
> Hence I came up with the VMIO, because that is the only selecting
> criteria for being special. - I'll fix up the comment.
It's nice and short, but it might be weird if someone after a while finds another way
of using this stuff. And it's relation to 'emergency' looks unclear. So maybe calling
both the same makes most sense, no matter how you name it.
>> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
>> >
>> > static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
>> > int min_free_kbytes = 1024;
>> > +int var_free_kbytes;
>>
>> Using var_free_pages makes the code slightly simpler, as all that needless
>> convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...
>
> 't seems I'm a bit puzzled as to what you mean here.
I mean to store the variable reserve in pages instead of kilobytes. Currently you're
converting from the one to the other both when setting and when using the value. That
doesn't make much sense and can be avoided by storing the value in pages from the start.
>> > +noskb:
>> > + /* Attempt emergency allocation when RX skb. */
>> > + if (!(flags & SKB_ALLOC_RX))
>> > + goto out;
>>
>> So only incoming skb allocation is guaranteed? What about outgoing skbs?
>> What am I missing? Or can we sleep then, and increasing var_free_kbytes is
>> sufficient to guarantee it?
>
> ->sk_allocation |= __GFP_EMERGENCY - will take care of the outgoing
> packets. Also, since one only sends a limited number of packets out and
> then will wait for answers, we do not need to worry about fragmentation
> issues that much in this case.
Ah, missed that one. Didn't knew that the alloc flags were stored in the sock.
>> > +static void emerg_free_skb(struct kmem_cache *cache, void *objp)
>> > +{
>> > + free_page((unsigned long)objp);
>> > + emerg_rx_pages_dec();
>> > +}
>> > +
>> > /*
>> > * Free an skbuff by memory without cleaning the state.
>> > */
>> > @@ -326,17 +373,21 @@ void kfree_skbmem(struct sk_buff *skb)
>> > {
>> > struct sk_buff *other;
>> > atomic_t *fclone_ref;
>> > + void (*free_skb)(struct kmem_cache *, void *);
>> >
>> > skb_release_data(skb);
>> > +
>> > + free_skb = skb->emerg ? emerg_free_skb : kmem_cache_free;
>> > +
>> > switch (skb->fclone) {
>> > case SKB_FCLONE_UNAVAILABLE:
>> > - kmem_cache_free(skbuff_head_cache, skb);
>> > + free_skb(skbuff_head_cache, skb);
>> > break;
>> >
>> > case SKB_FCLONE_ORIG:
>> > fclone_ref = (atomic_t *) (skb + 2);
>> > if (atomic_dec_and_test(fclone_ref))
>> > - kmem_cache_free(skbuff_fclone_cache, skb);
>> > + free_skb(skbuff_fclone_cache, skb);
>> > break;
>> >
>> > case SKB_FCLONE_CLONE:
>> > @@ -349,7 +400,7 @@ void kfree_skbmem(struct sk_buff *skb)
>> > skb->fclone = SKB_FCLONE_UNAVAILABLE;
>> >
>> > if (atomic_dec_and_test(fclone_ref))
>> > - kmem_cache_free(skbuff_fclone_cache, other);
>> > + free_skb(skbuff_fclone_cache, other);
>> > break;
>> > };
>> > }
>>
>> I don't have the original code in front of me, but isn't it possible to
>> add a "goto free" which has all the freeing in one place? That would get
>> rid of the function pointer stuff and emerg_free_skb.
>
> perhaps, yes, however I prefer this one, it allows access to the size.
What size are you talking about? What I had in mind is probably less readable,
but it avoids a bunch of function calls and that indirect function call, so
with luck it has less overhead and smaller object size:
void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
struct kmem_cache *cache = skbuff_head_cache;
struct sk_buff *free = skb;
skb_release_data(skb);
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
goto free;
case SKB_FCLONE_ORIG:
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref)){
cache = skbuff_fclone_cache;
goto free;
}
break;
case SKB_FCLONE_CLONE:
fclone_ref = (atomic_t *) (skb + 1);
other = skb - 1;
/* The clone portion is available for
* fast-cloning again.
*/
skb->fclone = SKB_FCLONE_UNAVAILABLE;
if (atomic_dec_and_test(fclone_ref)){
cache = skbuff_fclone_cache;
free = other;
goto free;
}
break;
};
return;
free:
if (!skb->emergency)
kmem_cache_free(cache, free);
else
emergency_rx_free(free, kmem_cache_size(cache));
}
>> Are these functions only called for incoming skbs? If not, calling
>> emerg_rx_pages_try_inc() is the wrong thing to do. A quick search says
>> they aren't. Add a RX check? Or is that fixed by SKB_FCLONE_UNAVAILABLE?
>> If so, why are skb_clone() and pskb_expand_head() modified at all?
>> (Probably ignorance on my end.)
>
> Well, no, however only incoming skbs can have skb->emergency set to
> begin with.
Yes, that solves it.
>> You can get rid of the memalloc_reserve and vmio_request_queues variables
>> if you want, they aren't really needed for anything. If using them reduces
>> the total code size I'd keep them though.
>
> I find my version easier to read, but that might just be the way my
> brain works.
Maybe true, but I believe my version is more natural in the sense that it makes
more clear what the code is doing. Less bookkeeping, more real work, so to speak.
But after another look things seem a bit shaky, in the locking corner anyway.
sk_adjust_memalloc() calls adjust_memalloc_reserve(), which changes var_free_kbytes
and then calls setup_per_zone_pages_min(), which does the real work. But it reads
min_free_kbytes without holding any locks. In mainline that's fine as the function
is only called by the proc handler and in obscure memory hotplug stuff. But with
your code it can also be called at any moment when a VMIO socket is made, which now
races with the proc callback. More a theoretical than a real problem, but still
slightly messy.
adjust_memalloc_reserve() has no locking at all, while it might be called concurrently
from different sources. Luckily sk_adjust_memalloc() is the only user, and which uses
its own spinlock for synchronization, so things go well by accident now. It seems
cleaner to move that spinlock so that it protects var|min_free_kbytes instead.
+int adjust_memalloc_reserve(int pages)
+{
+ int kbytes;
+ int err = 0;
+
+ kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+ if (kbytes < 0) {
+ err = -EINVAL;
+ goto out;
+ }
Shouldn't that be a BUG_ON instead?
Greetings,
Indan
On Mon, 2006-08-28 at 18:03 +0200, Indan Zupancic wrote:
> On Mon, August 28, 2006 12:22, Peter Zijlstra said:
> >> > @@ -391,6 +391,7 @@ enum sock_flags {
> >> > SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
> >> > SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
> >> > SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> >> > + SOCK_VMIO, /* promise to never block on receive */
> >>
> >> It might be used for IO related to the VM, but that doesn't tell _what_ it does.
> >> It also does much more than just not blocking on receive, so overal, aren't
> >> both the vmio name and the comment slightly misleading?
> >
> > I'm so having trouble with this name; I had SOCK_NONBLOCKING for a
> > while, but that is a very bad name because nonblocking has this well
> > defined meaning when talking about sockets, and this is not that.
> >
> > Hence I came up with the VMIO, because that is the only selecting
> > criteria for being special. - I'll fix up the comment.
>
> It's nice and short, but it might be weird if someone after a while finds another way
> of using this stuff. And it's relation to 'emergency' looks unclear. So maybe calling
> both the same makes most sense, no matter how you name it.
I've tried to come up with another use-case, but failed (of course that
doesn't mean there is no). Also, I'm really past caring what the thing
is called ;-) But if ppl object I guess its easy enough to run yet
another sed command over the patches.
> >> > @@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);
> >> >
> >> > static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
> >> > int min_free_kbytes = 1024;
> >> > +int var_free_kbytes;
> >>
> >> Using var_free_pages makes the code slightly simpler, as all that needless
> >> convertion isn't needed anymore. Perhaps the same is true for min_free_kbytes...
> >
> > 't seems I'm a bit puzzled as to what you mean here.
>
> I mean to store the variable reserve in pages instead of kilobytes. Currently you're
> converting from the one to the other both when setting and when using the value. That
> doesn't make much sense and can be avoided by storing the value in pages from the start.
right, will have a peek.
> void kfree_skbmem(struct sk_buff *skb)
> {
> struct sk_buff *other;
> atomic_t *fclone_ref;
> struct kmem_cache *cache = skbuff_head_cache;
> struct sk_buff *free = skb;
>
> skb_release_data(skb);
> switch (skb->fclone) {
> case SKB_FCLONE_UNAVAILABLE:
> goto free;
>
> case SKB_FCLONE_ORIG:
> fclone_ref = (atomic_t *) (skb + 2);
> if (atomic_dec_and_test(fclone_ref)){
> cache = skbuff_fclone_cache;
> goto free;
> }
> break;
>
> case SKB_FCLONE_CLONE:
> fclone_ref = (atomic_t *) (skb + 1);
> other = skb - 1;
>
> /* The clone portion is available for
> * fast-cloning again.
> */
> skb->fclone = SKB_FCLONE_UNAVAILABLE;
>
> if (atomic_dec_and_test(fclone_ref)){
> cache = skbuff_fclone_cache;
> free = other;
> goto free;
> }
> break;
> };
> return;
> free:
> if (!skb->emergency)
> kmem_cache_free(cache, free);
> else
> emergency_rx_free(free, kmem_cache_size(cache));
> }
Ah, like so, sure, that looks good.
> >> You can get rid of the memalloc_reserve and vmio_request_queues variables
> >> if you want, they aren't really needed for anything. If using them reduces
> >> the total code size I'd keep them though.
> >
> > I find my version easier to read, but that might just be the way my
> > brain works.
>
> Maybe true, but I believe my version is more natural in the sense that it makes
> more clear what the code is doing. Less bookkeeping, more real work, so to speak.
Ok, I'll have another look at it, perhaps my gray matter has shifted ;-)
> But after another look things seem a bit shaky, in the locking corner anyway.
>
> sk_adjust_memalloc() calls adjust_memalloc_reserve(), which changes var_free_kbytes
> and then calls setup_per_zone_pages_min(), which does the real work. But it reads
> min_free_kbytes without holding any locks. In mainline that's fine as the function
> is only called by the proc handler and in obscure memory hotplug stuff. But with
> your code it can also be called at any moment when a VMIO socket is made, which now
> races with the proc callback. More a theoretical than a real problem, but still
> slightly messy.
Knew about that, hadn't made up my mind on a fix yet. Good spot never
the less. Time to actually fix it I guess.
> adjust_memalloc_reserve() has no locking at all, while it might be called concurrently
> from different sources. Luckily sk_adjust_memalloc() is the only user, and which uses
> its own spinlock for synchronization, so things go well by accident now. It seems
> cleaner to move that spinlock so that it protects var|min_free_kbytes instead.
Ah, no accident there, I'm fully aware that there would need to be a
spinlock in adjust_memalloc_reserve() if there were another caller.
(I even had it there for some time) - added comment.
> +int adjust_memalloc_reserve(int pages)
> +{
> + int kbytes;
> + int err = 0;
> +
> + kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
> + if (kbytes < 0) {
> + err = -EINVAL;
> + goto out;
> + }
>
> Shouldn't that be a BUG_ON instead?
Yeah, might as well be.
On Mon, August 28, 2006 19:32, Peter Zijlstra said:
> Also, I'm really past caring what the thing
> is called ;-) But if ppl object I guess its easy enough to run yet
> another sed command over the patches.
True, same here.
>> >> You can get rid of the memalloc_reserve and vmio_request_queues variables
>> >> if you want, they aren't really needed for anything. If using them reduces
>> >> the total code size I'd keep them though.
>> >
>> > I find my version easier to read, but that might just be the way my
>> > brain works.
>>
>> Maybe true, but I believe my version is more natural in the sense that it makes
>> more clear what the code is doing. Less bookkeeping, more real work, so to speak.
>
> Ok, I'll have another look at it, perhaps my gray matter has shifted ;-)
I don't care either way, just providing an alternative. I'd compile both and see
which one is smaller.
> Ah, no accident there, I'm fully aware that there would need to be a
> spinlock in adjust_memalloc_reserve() if there were another caller.
> (I even had it there for some time) - added comment.
Good that you're aware of it. Thing is, how much sense does the split-up into
adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
add adjust_memalloc_reserve() when it's really needed? It saves an export.
Feedback on the 28-Aug-2006 19:24 version from
programming.kicks-ass.net/kernel-patches/vm_deadlock/current/
> +void setup_per_zone_pages_min(void)
> +{
> + static DEFINE_SPINLOCK(lock);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&lock, flags);
> + __setup_per_zone_pages_min();
> + spin_unlock_irqrestore(&lock, flags);
> +}
Better to put the lock next to min_free_kbytes, both for readability and
cache behaviour. And it satisfies the "lock data, not code" mantra.
> +static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
> +{
> + void * page = NULL;
> +
> + if (size > PAGE_SIZE)
> + return page;
> +
> + if (atomic_add_unless(&emergency_rx_pages_used, 1, RX_RESERVE_PAGES)) {
> + page = (void *)__get_free_page(gfp_mask);
> + if (!page) {
> + WARN_ON(1);
> + atomic_dec(&emergency_rx_pages_used);
> + }
> + }
> +
> + return page;
> +}
If you prefer to avoid cmpxchg (which is often used in atomic_add_unless
and can be expensive) then you can use something like:
static inline void * emergency_rx_alloc(size_t size, gfp_t gfp_mask)
{
void * page;
if (size > PAGE_SIZE)
return NULL;
if (atomic_inc_return(&emergency_rx_pages_used) == RX_RESERVE_PAGES)
goto out;
page = (void *)__get_free_page(gfp_mask);
if (page)
return page;
WARN_ON(1);
out:
atomic_dec(&emergency_rx_pages_used);
return NULL;
}
The tiny race should be totally harmless. Both versions are a bit big
to inline though.
> @@ -195,6 +196,86 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
> /* Maximal space eaten by iovec or ancilliary data plus some space */
> int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_reserve;
> +static unsigned int vmio_request_queues;
> +
> +atomic_t vmio_socks;
> +atomic_t emergency_rx_pages_used;
> +EXPORT_SYMBOL_GPL(vmio_socks);
Is this export needed? It's only used in net/core/skbuff.c and net/core/sock.c,
which are compiled into one module.
> +EXPORT_SYMBOL_GPL(emergency_rx_pages_used);
Same here. It's only used by code in sock.c and skbuff.c, and no external
code calls emergency_rx_alloc(), nor emergency_rx_free().
--
I think I depleted my usefulness, there isn't much left to say for me.
It's up to the big guys to decide about the merrit of this patch.
If Evgeniy's network allocator fixes all deadlocks and also has other
advantages, then great.
IMHO:
- This patch isn't really a framework, more a minimal fix for one specific,
though important problem. But it's small and doesn't have much impact
(numbers would be nice, e.g. vmlinux/modules size before and after, and
some network benchmark results).
- If Evgeniy's network allocator is as good as it looks, then why can't it
replace the existing one? Just adding private subsystem specific memory
allocators seems wrong. I might be missing the big picture, but it looks
like memory allocator things should be at least synchronized and discussed
with Christoph Lameter and his "modular slab allocator" patch.
All in all it seems it will take a while until Evgeniy's code will be merged,
so I think applying Peter's patch soonish and removing it again the moment it
becomes unnecessary is reasonable.
Greetings,
Indan
On Tue, 2006-08-29 at 02:01 +0200, Indan Zupancic wrote:
> On Mon, August 28, 2006 19:32, Peter Zijlstra said:
> > Ah, no accident there, I'm fully aware that there would need to be a
> > spinlock in adjust_memalloc_reserve() if there were another caller.
> > (I even had it there for some time) - added comment.
>
> Good that you're aware of it. Thing is, how much sense does the split-up into
> adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
> merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
> add adjust_memalloc_reserve() when it's really needed? It saves an export.
mm/ vs net/core/
> Better to put the lock next to min_free_kbytes, both for readability and
> cache behaviour. And it satisfies the "lock data, not code" mantra.
True enough.
> If you prefer to avoid cmpxchg (which is often used in atomic_add_unless
> and can be expensive) then you can use something like:
Yes, way too large, out of lined it already. Don't care about the
cmpxchg, its not a fast path anyway.
> > @@ -195,6 +196,86 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
> > /* Maximal space eaten by iovec or ancilliary data plus some space */
> > int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);
> >
> > +static DEFINE_SPINLOCK(memalloc_lock);
> > +static int memalloc_reserve;
> > +static unsigned int vmio_request_queues;
> > +
> > +atomic_t vmio_socks;
> > +atomic_t emergency_rx_pages_used;
> > +EXPORT_SYMBOL_GPL(vmio_socks);
>
> Is this export needed? It's only used in net/core/skbuff.c and net/core/sock.c,
> which are compiled into one module.
>
> > +EXPORT_SYMBOL_GPL(emergency_rx_pages_used);
>
> Same here. It's only used by code in sock.c and skbuff.c, and no external
> code calls emergency_rx_alloc(), nor emergency_rx_free().
Good point, I've gone over the link relations of these things and was
indeed capable of removing several EXPORTs. Thanks.
> I think I depleted my usefulness, there isn't much left to say for me.
> It's up to the big guys to decide about the merrit of this patch.
Thanks for all your feedback.
> IMHO:
>
> - This patch isn't really a framework, more a minimal fix for one specific,
> though important problem. But it's small and doesn't have much impact
Well, perhaps, its merit is that is allows for full service for a few
sockets even under severe memory pressure. And it provides the
primitives to solve this problem for all instances, (NBD, iSCSI, NFS,
AoE, ...) hence framework.
Evgeniy's allocator does not cater for this, so even if it were to
replace all the allocation stuff, we would still need the SOCK_VMIO and
all protocol hooks this patch introduces.
> - If Evgeniy's network allocator is as good as it looks, then why can't it
> replace the existing one? Just adding private subsystem specific memory
> allocators seems wrong. I might be missing the big picture, but it looks
> like memory allocator things should be at least synchronized and discussed
> with Christoph Lameter and his "modular slab allocator" patch.
SLAB is very very good in that is will not suffer from external
fragmentation (one could suffer from external fragmentation when viewing
the slab allocator from the page allocation layer - but most of that is
avoidable by allocation strategies in the slab layer), it does however
suffer from internal fragmentation - by design.
For variable size allocators it has been proven that for each allocator
there is an allocation pattern that will defeat it. And figuring the
pattern out and proving it will not happen in a long-running system is
hard hard work.
(free block coalescence is not a guarantee against fragmentation; there
is even evidence that delayed coalescence will reduce fragmentation - it
introduces history and this extra information can help predict the
future.)
This is exactly why long running systems (like our kernel) love slabs.
For those interested in memory allocators, this paper is a good (albeit
a bit dated) introduction:
http://citeseer.ist.psu.edu/wilson95dynamic.html
That said, it might be that Evgeniy's allocator works out for our
network load - only time will tell, the math is not tractable afaik.
> All in all it seems it will take a while until Evgeniy's code will be merged,
> so I think applying Peter's patch soonish and removing it again the moment it
> becomes unnecessary is reasonable.
Thanks and like said, I think even then most of this patch will need to
survive.
On Tue, August 29, 2006 11:49, Peter Zijlstra said:
> On Tue, 2006-08-29 at 02:01 +0200, Indan Zupancic wrote:
>> Good that you're aware of it. Thing is, how much sense does the split-up into
>> adjust_memalloc_reserve() and sk_adjust_memalloc() make at this point? Why not
>> merge the code of adjust_memalloc_reserve() with sk_adjust_memalloc() and only
>> add adjust_memalloc_reserve() when it's really needed? It saves an export.
>
> mm/ vs net/core/
I thought var_free_kbytes was exported too, but it isn't, so merging those two
won't save an export, scrap the idea.
If sk_adjust_memalloc() can be called with socks == 0 then a check for that needs
to be added, or else bugginess ensues (my fault ;-):
void sk_adjust_memalloc(int socks, int tx_reserve_pages)
{
unsigned long flags;
int nr_socks;
if (socks){
nr_socks = atomic_add_return(socks, &vmio_socks);
BUG_ON(nr_socks < 0);
if (nr_socks == socks)
tx_reserve_pages += RX_RESERVE_PAGES;
if (nr_socks == 0)
tx_reserve_pages -= RX_RESERVE_PAGES;
}
spin_lock_irqsave(&memalloc_lock, flags);
adjust_memalloc_reserve(tx_reserve_pages);
spin_unlock_irqrestore(&memalloc_lock, flags);
}
> Well, perhaps, its merit is that is allows for full service for a few
> sockets even under severe memory pressure. And it provides the
> primitives to solve this problem for all instances, (NBD, iSCSI, NFS,
> AoE, ...) hence framework.
>
> Evgeniy's allocator does not cater for this, so even if it were to
> replace all the allocation stuff, we would still need the SOCK_VMIO and
> all protocol hooks this patch introduces.
At least what I understood of it, the appealing thing was that it would make
the whole networking stack robust, as network operations will never fail.
So not only for VMIO sockets, but for all. Only thing that's missing then is
a way to guarantee that there's always forward progress. The ironic thing
is that to achieve that skbs/mem needs to be dropped/refused again, in the
case of blocking senders.
VMIO makes the choice what to refuse by making some skbs/sockets more important
than others and, indirectly, the kernel more important than userspace. A
different choice could be made. If one can be made which works for all, then
the VM deadlock problem is solved too.
Great if the VM deadlock problem is solved, but what about user space applications
that want to keep working under heavy memory pressure? Say, more or less all
decently written network servers. For those incoming data in general means more
work to do, and hence more memory to use and data to send. What they need is
receive throttling under memory pressure so that enough free memory is available
to handle the data that is received. A reliable network stack gives them nothing
if after receiving the data they can't do anything with it because the free
memory is up. So in that regard splitting the network allocator from the normal
allocator might only increase the problem.
(Swap only makes matters worse, as a program will slow down when it's used under
heavy memory pressure without knowing that there's any memory pressure. Failed
memory allocations are a good sign that there's memory pressure, but unfortunately
when that point is reached all performance is gone already and the application
can't really do anything useful, as it can't allocate nay memory. Something like
MAP_NONBLOCK, but which works with MAP_ANONYMOUS and let the mmap fail if swap
would be used may be helpful, because then the flow of new work can be throttled
earlier by the app. But coordination with malloc would be better.
Of course most can already be done by tuning the system and careful coding.
Sucks to be userspace I guess.)
>> - If Evgeniy's network allocator is as good as it looks, then why can't it
>> replace the existing one? Just adding private subsystem specific memory
>> allocators seems wrong. I might be missing the big picture, but it looks
>> like memory allocator things should be at least synchronized and discussed
>> with Christoph Lameter and his "modular slab allocator" patch.
>
> SLAB is very very good in that is will not suffer from external
> fragmentation (one could suffer from external fragmentation when viewing
> the slab allocator from the page allocation layer - but most of that is
> avoidable by allocation strategies in the slab layer), it does however
> suffer from internal fragmentation - by design.
>
> For variable size allocators it has been proven that for each allocator
> there is an allocation pattern that will defeat it. And figuring the
> pattern out and proving it will not happen in a long-running system is
> hard hard work.
>
> (free block coalescence is not a guarantee against fragmentation; there
> is even evidence that delayed coalescence will reduce fragmentation - it
> introduces history and this extra information can help predict the
> future.)
>
> This is exactly why long running systems (like our kernel) love slabs.
>
> For those interested in memory allocators, this paper is a good (albeit
> a bit dated) introduction:
> http://citeseer.ist.psu.edu/wilson95dynamic.html
>
> That said, it might be that Evgeniy's allocator works out for our
> network load - only time will tell, the math is not tractable afaik.
Thanks for the pointer, the paper is interesting, but doesn't give a lot
of new info, more a shocking insight in the absense of quantitative
allocator research. Lots of words, but few numbers from them too, slightly
disappointing. Besides, more or less everything in the paper is only
relevant for unmovable, variable sized objects allocation. Things like
page handling aren't researched, they limited it to research of allocators
using one big lump of virtual memory. Worth the read though.
Sorry for the rambling. :-)
Greetings,
Indan