2006-08-12 14:15:05

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 0/4] VM deadlock prevention -v4

Hi,

here the latest effort, it includes a whole new trivial allocator with a
horrid name and an almost full rewrite of the deadlock prevention core.
This version does not do anything per device and hence does not depend
on the new netdev_alloc_skb() API.

The reason to add a second allocator to the receive side is twofold:
1) it allows easy detection of the memory pressure / OOM situation;
2) it allows the receive path to be unbounded and go at full speed when
resources permit.

The choice of using the global memalloc reserve as a mempool makes that
the new allocator has to release pages as soon as possible; if we were
to hoard pages in the allocator the memalloc reserve would not get
replenished readily.

Peter


2006-08-12 14:15:45

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 3/4] deadlock prevention core


The core of the VM deadlock avoidance framework.

>From the 'user' side of things it provides a function to mark a 'struct sock'
as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
the receive side.

When *dev_alloc_skb() finds it cannot allocate a struct sk_buff the regular
way it will grab some memory from the memalloc reserve.

Network paths will drop !SOCK_MEMALLOC packets ASAP when reserve is being used.

Memalloc sk_buff allocations are not done from the SLAB but are done using
the new SROG allocator. sk_buff::memalloc records this exception so that
kfree_skbmem()/skb_clone() and others can do the right thing.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
include/linux/gfp.h | 3 -
include/linux/mmzone.h | 1
include/linux/skbuff.h | 6 +-
include/net/sock.h | 40 +++++++++++++++
mm/page_alloc.c | 41 ++++++++++++++-
net/core/skbuff.c | 127 ++++++++++++++++++++++++++++++++++++++++++++-----
net/core/sock.c | 74 ++++++++++++++++++++++++++++
net/ipv4/af_inet.c | 3 +
net/ipv4/icmp.c | 3 +
net/ipv4/tcp_ipv4.c | 3 +
net/ipv4/udp.c | 8 ++-
11 files changed, 290 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/mmzone.h 2006-08-12 12:56:09.000000000 +0200
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);

#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/linux/skbuff.h 2006-08-12 15:25:33.000000000 +0200
@@ -282,7 +282,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ memalloc:1;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
@@ -1086,7 +1087,8 @@ static inline void __skb_queue_purge(str
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD,
+ gfp_mask | __GFP_MEMALLOC);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/include/net/sock.h 2006-08-12 12:56:38.000000000 +0200
@@ -391,6 +391,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* protocol can use memalloc reserve */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,45 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}

+static inline int sk_is_memalloc(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Is this high enough, or do we want it to depend on the number of
+ * online devices and online CPUs?
+ *
+ * #define MAX_CONCURRENT_SKBS (64*nr_devices*num_online_cpus())
+ */
+#define MAX_CONCURRENT_SKBS 128
+
+/*
+ * Used to count skb payloads.
+ *
+ * The assumption is that the sk_buffs themselves are small enough to fit
+ * in the remaining page space.
+ */
+extern atomic_t memalloc_skbs_used;
+
+static inline int memalloc_skbs_try_inc(void)
+{
+ return atomic_add_unless(&memalloc_skbs_used, 1, MAX_CONCURRENT_SKBS);
+}
+
+static inline void memalloc_skbs_dec(void)
+{
+ atomic_dec(&memalloc_skbs_used);
+}
+
+static inline int memalloc_skbs_available(void)
+{
+ return atomic_read(&memalloc_skbs_used) < MAX_CONCURRENT_SKBS;
+}
+
+extern int sk_adjust_memalloc(int);
+extern int sk_set_memalloc(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/mm/page_alloc.c 2006-08-12 12:56:09.000000000 +0200
@@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);

static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+int var_free_kbytes;

unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
@@ -970,8 +971,8 @@ restart:

/* This allocation should allow future memory freeing. */

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
@@ -2196,7 +2197,8 @@ static void setup_per_zone_lowmem_reserv
*/
void setup_per_zone_pages_min(void)
{
- unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+ >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -2248,6 +2250,39 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
}

+/**
+ * adjust_memalloc_reserve - adjust the memalloc reserve
+ * @pages: number of pages to add
+ *
+ * It adds a number of pages to the memalloc reserve; if
+ * the number was positive it kicks kswapd into action to
+ * satisfy the higher watermarks.
+ */
+int adjust_memalloc_reserve(int pages)
+{
+ int kbytes;
+ int err = 0;
+
+ kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+ if (kbytes < 0) {
+ err = -EINVAL;
+ goto out;
+ }
+ var_free_kbytes = kbytes;
+ setup_per_zone_pages_min();
+ if (pages > 0) {
+ struct zone *zone;
+ for_each_zone(zone)
+ wakeup_kswapd(zone, 0);
+ }
+ printk(KERN_DEBUG "RX reserve: %d\n", var_free_kbytes);
+
+out:
+ return err;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
/*
* Initialise min_free_kbytes.
*
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/skbuff.c 2006-08-12 15:28:15.000000000 +0200
@@ -43,6 +43,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/srog.h>
#include <linux/interrupt.h>
#include <linux/in.h>
#include <linux/inet.h>
@@ -126,7 +127,7 @@ EXPORT_SYMBOL(skb_truesize_bug);
*/

/**
- * __alloc_skb - allocate a network buffer
+ * ___alloc_skb - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
* @fclone: allocate from fclone cache instead of head cache
@@ -139,14 +140,45 @@ EXPORT_SYMBOL(skb_truesize_bug);
* Buffers may only be allocated from interrupts using a @gfp_mask of
* %GFP_ATOMIC.
*/
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone)
+static
+struct sk_buff *___alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
{
kmem_cache_t *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;

+ size = SKB_DATA_ALIGN(size);
+
+ if (gfp_mask & __GFP_MEMALLOC) {
+ cache = NULL;
+ skb = NULL;
+ if (!memalloc_skbs_try_inc())
+ goto out;
+
+ /*
+ * Allocate the data section first because we know the first
+ * SROG alloc is a valid SROG entry point and skb->head is
+ * shared between clones. This saves us from tracking SROGs.
+ */
+ data = srog_alloc(NULL, size + sizeof(struct skb_shared_info),
+ gfp_mask);
+ if (!data)
+ goto dec_out;
+
+ skb = srog_alloc(data, fclone
+ ? 2*sizeof(struct sk_buff) + sizeof(atomic_t)
+ : sizeof(struct sk_buff), gfp_mask);
+ if (!skb) {
+ srog_free(NULL, data);
+dec_out:
+ memalloc_skbs_dec();
+ goto out;
+ }
+
+ goto allocated;
+ }
+
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

/* Get the HEAD */
@@ -155,12 +187,13 @@ struct sk_buff *__alloc_skb(unsigned int
goto out;

/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

+allocated:
memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->memalloc = !cache;
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -185,6 +218,7 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);

child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->memalloc = skb->memalloc;
}
out:
return skb;
@@ -194,6 +228,18 @@ nodata:
goto out;
}

+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
+{
+ struct sk_buff *skb;
+
+ skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
+
+ if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
+ skb = ___alloc_skb(size, gfp_mask, fclone);
+
+ return skb;
+}
+
/**
* alloc_skb_from_cache - allocate a network buffer
* @cp: kmem_cache from which to allocate the data area
@@ -267,7 +313,7 @@ struct sk_buff *__netdev_alloc_skb(struc
{
struct sk_buff *skb;

- skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ skb = alloc_skb(length + NET_SKB_PAD, gfp_mask | __GFP_MEMALLOC);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -299,7 +345,7 @@ static void skb_clone_fraglist(struct sk
skb_get(list);
}

-static void skb_release_data(struct sk_buff *skb)
+static int skb_release_data(struct sk_buff *skb)
{
if (!skb->cloned ||
!atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
@@ -313,8 +359,34 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);

- kfree(skb->head);
+ return 1;
}
+ return 0;
+}
+
+static void memalloc_free_skb(struct kmem_cache *cache, void *objp, int free)
+{
+ /*
+ * This complication is necessary because we need a valid SROG
+ * pointer. If we let skb_release_data() free the skb data, we
+ * loose the only valid SROG entry point we know about.
+ */
+ struct sk_buff *skb = objp;
+ u8 *data = skb->head;
+ srog_free(data, objp);
+ if (free) {
+ srog_free(NULL, data);
+ memalloc_skbs_dec();
+ }
+}
+
+static void kmem_cache_free_skb(struct kmem_cache *cache, void *objp, int free)
+{
+ struct sk_buff *skb = objp;
+ u8 *data = skb->head;
+ kmem_cache_free(cache, objp);
+ if (free)
+ kfree(data);
}

/*
@@ -324,17 +396,22 @@ void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
+ void (*free_skb)(struct kmem_cache *, void *, int);
+ int free;
+
+ free = skb_release_data(skb);
+
+ free_skb = skb->memalloc ? memalloc_free_skb : kmem_cache_free_skb;

- skb_release_data(skb);
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
- kmem_cache_free(skbuff_head_cache, skb);
+ free_skb(skbuff_head_cache, skb, free);
break;

case SKB_FCLONE_ORIG:
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, skb);
+ free_skb(skbuff_fclone_cache, skb, free);
break;

case SKB_FCLONE_CLONE:
@@ -347,7 +424,7 @@ void kfree_skbmem(struct sk_buff *skb)
skb->fclone = SKB_FCLONE_UNAVAILABLE;

if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, other);
+ free_skb(skbuff_fclone_cache, other, free);
break;
};
}
@@ -433,6 +510,11 @@ struct sk_buff *skb_clone(struct sk_buff
atomic_t *fclone_ref = (atomic_t *) (n + 1);
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
+ } else if (skb->memalloc) {
+ n = srog_alloc(skb->head, sizeof(struct sk_buff), gfp_mask);
+ if (!n)
+ return NULL;
+ n->fclone = SKB_FCLONE_UNAVAILABLE;
} else {
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
@@ -468,6 +550,7 @@ struct sk_buff *skb_clone(struct sk_buff
#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
C(ipvs_property);
#endif
+ C(memalloc);
C(protocol);
n->destructor = NULL;
#ifdef CONFIG_NETFILTER
@@ -688,7 +771,27 @@ int pskb_expand_head(struct sk_buff *skb

size = SKB_DATA_ALIGN(size);

- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ if (skb->memalloc) {
+ if (!memalloc_skbs_try_inc())
+ goto nodata;
+ /*
+ * Unfortunately we have to assume skb->head is in the first
+ * page of the SROG, hence we cannot reuse the old one.
+ */
+ data = srog_alloc(NULL,
+ size + sizeof(struct skb_shared_info),
+ gfp_mask | __GFP_MEMALLOC);
+ if (!data) {
+ memalloc_skbs_dec();
+ goto nodata;
+ }
+ /*
+ * But they must end up in the same SROG otherwise we cannot
+ * reliably free clones.
+ */
+ srog_link(skb->head, data);
+ } else
+ data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/icmp.c 2006-08-12 12:56:09.000000000 +0200
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
goto error;
}

+ if (unlikely(skb->memalloc))
+ goto drop;
+
if (!pskb_pull(skb, sizeof(struct icmphdr)))
goto error;

Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/tcp_ipv4.c 2006-08-12 12:56:09.000000000 +0200
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;

+ if (unlikely(skb->memalloc && !sk_is_memalloc(sk)))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/udp.c 2006-08-12 12:56:09.000000000 +0200
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);

if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ if (unlikely(skb->memalloc && !sk_is_memalloc(sk)))
+ goto drop_noncritical;
+
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
return 0;
}

+drop_noncritical:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset(skb);
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/ipv4/af_inet.c 2006-08-12 12:56:09.000000000 +0200
@@ -132,6 +132,9 @@ void inet_sock_destruct(struct sock *sk)
{
struct inet_sock *inet = inet_sk(sk);

+ if (sk_is_memalloc(sk))
+ sk_adjust_memalloc(-1);
+
__skb_queue_purge(&sk->sk_receive_queue);
__skb_queue_purge(&sk->sk_error_queue);

Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/sock.c 2006-08-12 13:02:59.000000000 +0200
@@ -111,6 +111,8 @@
#include <linux/poll.h>
#include <linux/tcp.h>
#include <linux/init.h>
+#include <linux/inetdevice.h>
+#include <linux/blkdev.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -195,6 +197,78 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+static unsigned long memalloc_reserve;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for this device
+ * @dev: device that has memalloc demands
+ * @nr_socks: number of new %SOCK_MEMALLOC sockets
+ *
+ * This function adjusts the memalloc reserve based on device
+ * demand. For each %SOCK_MEMALLOC socket this device will reserve
+ * 2 * %MAX_PHYS_SEGMENTS pages for outbound traffic (assumption:
+ * each %SOCK_MEMALLOC socket will have a %request_queue associated)
+ * and 5 * %MAX_CONCURRENT_SKBS pages.
+ *
+ * 2 * %MAX_PHYS_SEGMENTS - the request queue can hold up to 150% the
+ * remaining 50% goes to being sure we can write packets for
+ * the outgoing pages.
+ *
+ * 5 * %MAX_CONCURRENT_SKBS - for each skb 4 pages for high order
+ * jumbo frame allocs, and 1 for good measure.
+ */
+int sk_adjust_memalloc(int nr_socks)
+{
+ unsigned long flags;
+ unsigned long reserve;
+ int err;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+
+ memalloc_socks += nr_socks;
+ BUG_ON(memalloc_socks < 0);
+
+ reserve = memalloc_socks * 2 * MAX_PHYS_SEGMENTS + /* outbound */
+ MAX_CONCURRENT_SKBS * 5; /* inbound */
+
+ err = adjust_memalloc_reserve(reserve - memalloc_reserve);
+ if (err) {
+ printk(KERN_WARNING
+ "Unable to change RX reserve to: %lu, error: %d\n",
+ reserve, err);
+ goto unlock;
+ }
+ memalloc_reserve = reserve;
+
+unlock:
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);
+
+/**
+ * sk_set_memalloc - sets %SOCK_MEMALLOC
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
+ * accordingly.
+ */
+int sk_set_memalloc(struct sock *sk)
+{
+ int err = 0;
+
+ if (!(err = sk_adjust_memalloc(1)))
+ sock_set_flag(sk, SOCK_MEMALLOC);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;

2006-08-12 14:15:17

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 1/4] pfn_to_kaddr() for UML


Update UML with a proper 'pfn_to_kaddr()' definition, the SROG allocator
uses it.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
include/asm-um/page.h | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/include/asm-um/page.h
===================================================================
--- linux-2.6.orig/include/asm-um/page.h
+++ linux-2.6/include/asm-um/page.h
@@ -111,6 +111,8 @@ extern unsigned long uml_physmem;
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))

+#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
+
extern struct page *arch_validate(struct page *page, gfp_t mask, int order);
#define HAVE_ARCH_VALIDATE

2006-08-12 14:15:54

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 2/4] SROG allocator


A simple memory allocator with a horrid name.
(if someone knows the proper name of this thing, please speak up.
It's too trivial to not be described somewhere)

Its use is for cases where you have multiple objects of various sizes
that have similar livetimes and should release pages as soon as possible.

In a few words, it allocates pages and packs the objects in them, the
pages are kept on a list and the free space blocks are kept in address
order. On free insertion sort is used and front and back merges are tried.

Signed-off-by: Peter Zijlstra <[email protected]>

---
include/linux/srog.h | 14 ++
mm/Makefile | 3
mm/srog.c | 290 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 306 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/srog.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/srog.h 2006-08-12 11:49:10.000000000 +0200
@@ -0,0 +1,14 @@
+#ifndef _LINUX_SROG_H_
+#define _LINUX_SROG_H_
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+
+extern void *srog_alloc(void *srogp, unsigned long size, gfp_t gfp_mask);
+extern void srog_free(void *srogp, void *obj);
+extern void srog_link(void *srogp1, void *srogp2);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SSROG_H_ */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2006-08-12 11:49:05.000000000 +0200
+++ linux-2.6/mm/Makefile 2006-08-12 11:49:10.000000000 +0200
@@ -10,7 +10,8 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
- prio_tree.o util.o mmzone.o vmstat.o $(mmu-y)
+ prio_tree.o util.o mmzone.o vmstat.o srog.o \
+ $(mmu-y)

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
Index: linux-2.6/mm/srog.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/srog.c 2006-08-12 11:54:27.000000000 +0200
@@ -0,0 +1,289 @@
+/*
+ * mm/srog.c
+ *
+ * Written by Peter Zijlstra <[email protected]>
+ * Released under the GPLv2, see the file COPYING for details.
+ *
+ * Small Related Object Group Allocator
+ *
+ * Trivial allocator that has to release pages ASAP and pack objects
+ * of various sizes. It need not be fast, just reliable and relative
+ * small.
+ *
+ * The approach is a list of pages (need not be 0-order), each of
+ * these pages are subdivided on demand. The free blocks in each
+ * of these pages are kept in address order.
+ *
+ * On free, the block tries to merge with the previous and next block.
+ *
+ * When a page is found to be empty, its taken off the list and returned
+ * to the system.
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+/**
+ * srog_free_area - structure describing a free block of memory
+ * @size: size of the block
+ * @next: pointer to the next free block
+ */
+struct srog_free_area {
+ unsigned int size;
+ union {
+ struct srog_free_area *next;
+ char obj[0];
+ };
+};
+
+#define SROG_MAGIC 0xf00fde3d
+
+/**
+ * srog_head - structure managing a page
+ * @list: list of all related pages
+ * @free: first free area descriptor
+ * @order: allocation order of this page
+ * @gfp_mask: allocation flags
+ * @magic: simple sanity check
+ */
+struct srog_head {
+ struct list_head list;
+ struct srog_free_area free;
+ unsigned int order;
+ gfp_t gfp_mask;
+ unsigned int magic;
+};
+
+#define ceiling_log2(x) fls_long((x) - 1)
+
+#define SROG_SIZE(srog) \
+ ((PAGE_SIZE << (srog)->order) - sizeof(struct srog_head))
+
+/**
+ * __srog_alloc - allocate a new page and initialize the head
+ * @size: minimum size that must fit
+ * @gfp_mask: allocation flags
+ */
+static struct srog_head *__srog_alloc(unsigned long size, gfp_t gfp_mask)
+{
+ unsigned long pages;
+ unsigned int order;
+ struct page *page;
+ struct srog_head *srog = NULL;
+
+ size += sizeof(struct srog_head);
+
+ pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ order = ceiling_log2(pages);
+
+ page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);
+ if (!page)
+ goto out;
+
+ srog = (struct srog_head *)pfn_to_kaddr(page_to_pfn(page));
+
+ INIT_LIST_HEAD(&srog->list);
+ srog->order = order;
+ srog->free.next = (struct srog_free_area *)(srog + 1);
+ srog->free.size = 0; /* avoid back merge */
+ srog->free.next->next = NULL;
+ srog->free.next->size = SROG_SIZE(srog);
+ srog->gfp_mask = gfp_mask;
+ srog->magic = SROG_MAGIC;
+
+out:
+ return srog;
+}
+
+#define next_srog(s) list_entry((s)->list.next, typeof(*(s)), list)
+
+#define INC_PTR(p, a) ((typeof(p))(((void *)p) + a))
+
+/*
+ * Tricky macro; will go funny on higher order allocations.
+ */
+#define PTR_TO_SROG(p) ((struct srog_head *)((unsigned long)(p) & PAGE_MASK))
+
+/**
+ * srog_alloc - allocate an object from a SROG; possibly creating one
+ * @srogp: pointer to a SROG to use; NULL means create new one
+ * @gfp_mask: allocation flags to use on create
+ *
+ * Create a SROG when needed.
+ *
+ * Allocate an object of requested size from the free space, if
+ * not enough free space is found, add another page to this SROG.
+ */
+void *srog_alloc(void *srogp, unsigned long size, gfp_t gfp_mask)
+{
+ struct srog_head *srog = PTR_TO_SROG(srogp);
+ struct srog_head *srog_iter;
+ void *obj = NULL;
+
+ /*
+ * Minimum size requirement, we need to store a free area
+ * descriptor in there. Increment size with one int to store
+ * the object size in and make sure we stay aligned.
+ */
+ size = max(size, (unsigned long)sizeof(struct srog_free_area));
+ size += sizeof(unsigned int);
+ size = ALIGN(size, sizeof(void *));
+
+ if (!srog)
+ srog = __srog_alloc(size, gfp_mask);
+ if (!srog)
+ goto out;
+
+ BUG_ON(srog->magic != SROG_MAGIC);
+
+ srog_iter = srog;
+do_alloc:
+ do {
+ /*
+ * Walk the free block list, take the first block that
+ * is large enough to accommodate the requested size.
+ */
+ struct srog_free_area *sfa, *prev = &srog_iter->free;
+ for (sfa = prev->next; sfa; prev = sfa, sfa = sfa->next) {
+ if (sfa->size < size)
+ continue;
+
+ obj = (void *)sfa->obj;
+ sfa->size -= size;
+
+ /*
+ * Any remaining space is split into a new free block.
+ */
+ if (sfa->size) {
+ struct srog_free_area *new_sfa =
+ INC_PTR(sfa, size);
+ new_sfa->next = sfa->next;
+ new_sfa->size = sfa->size;
+ prev->next = new_sfa;
+ } else {
+ prev->next = sfa->next;
+ }
+
+ sfa->size = size;
+ goto out;
+ }
+
+ srog_iter = next_srog(srog_iter);
+ } while (srog_iter != srog);
+
+ if (!obj) {
+ struct srog_head *new_srog;
+ /*
+ * We cannot fail allocation when we just created a new SROG.
+ */
+ BUG_ON(!srogp);
+
+ new_srog = __srog_alloc(size, srog->gfp_mask);
+ list_add(&new_srog->list, &srog->list);
+ srog_iter = new_srog;
+ goto do_alloc;
+ }
+
+out:
+ return obj;
+}
+EXPORT_SYMBOL_GPL(srog_alloc);
+
+#define OBJ_SIZE_REF(o) (((unsigned int *)(o)) - 1)
+#define OBJ_SFA(o) ((struct srog_free_area *)OBJ_SIZE_REF(o))
+
+/**
+ * srog_free - free an object
+ * @srogp: pointer to an early SROG object; may be NULL.
+ * @obj: object to free.
+ */
+void srog_free(void *srogp, void *obj)
+{
+ struct srog_head *srog = PTR_TO_SROG(srogp);
+ struct srog_free_area *sfa, *prev, *new_sfa;
+
+ if (!srog)
+ srog = PTR_TO_SROG(obj);
+
+ BUG_ON(srog->magic != SROG_MAGIC);
+
+ /*
+ * Walk the page list when the object does not belong to this page.
+ */
+ if (((unsigned long)(obj - (void *)srog)) >
+ (PAGE_SIZE << srog->order)) {
+ struct srog_head *srog_iter;
+ list_for_each_entry(srog_iter, &srog->list, list) {
+ if (((unsigned long)(obj - (void *)srog_iter)) <=
+ (PAGE_SIZE << srog_iter->order)) {
+ srog = srog_iter;
+ break;
+ }
+ }
+ }
+
+ /*
+ * It doesn't seem to belong here at all!?
+ */
+ BUG_ON(((unsigned long)(obj - (void *)srog)) >
+ (PAGE_SIZE << srog->order));
+
+ /*
+ * Insertion sort.
+ */
+ for (prev = &srog->free, sfa = prev->next;
+ sfa; prev = sfa, sfa = sfa->next) {
+ if ((void*)prev < obj)
+ break;
+ }
+
+ new_sfa = OBJ_SFA(obj);
+ new_sfa->next = sfa;
+ prev->next = new_sfa;
+
+ /*
+ * Merge forward.
+ */
+ if (INC_PTR(new_sfa, new_sfa->size) == sfa) {
+ new_sfa->size += sfa->size;
+ new_sfa->next = sfa->next;
+ }
+
+ /*
+ * Merge backward.
+ */
+ if (INC_PTR(prev, prev->size) == new_sfa) {
+ prev->size += new_sfa->size;
+ prev->next = new_sfa->next;
+ }
+
+ /*
+ * If the page just became unused, free it.
+ */
+ if (srog->free.next->size == SROG_SIZE(srog)) {
+ list_del(&srog->list);
+ free_pages((unsigned long)srog, srog->order);
+ }
+}
+EXPORT_SYMBOL_GPL(srog_free);
+
+/**
+ * srog_link - link two seperatly allocated SROGs into one
+ * @srogp1: pointer into one SROG
+ * @srogp2: pointer into another SROG
+ */
+void srog_link(void *srogp1, void *srogp2)
+{
+ struct srog_head *srog1 = PTR_TO_SROG(srogp1);
+ struct srog_head *srog2 = PTR_TO_SROG(srogp2);
+
+ BUG_ON(srog1->magic != SROG_MAGIC);
+ BUG_ON(srog2->magic != SROG_MAGIC);
+
+ __list_splice(&srog2->list, &srog1->list);
+}
+EXPORT_SYMBOL_GPL(srog_link);

2006-08-12 14:16:36

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 4/4] deadlock prevention for NBD



Use sk_set_memalloc() on the nbd socket.

Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
block/elevator.c | 5 +++++
block/ll_rw_blk.c | 12 ++++++++++--
drivers/block/nbd.c | 12 +++++++++++-
include/linux/blkdev.h | 9 +++++++++
4 files changed, 35 insertions(+), 3 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c 2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c 2006-08-12 15:38:11.000000000 +0200
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
request_queue_t *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
+ return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+ char *elv_name)
+{
request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);

if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
/*
* all done
*/
- if (!elevator_init(q, NULL)) {
+ if (!elevator_init(q, elv_name)) {
blk_queue_congestion_threshold(q);
return q;
}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
blk_put_queue(q);
return NULL;
}
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);

int blk_get_queue(request_queue_t *q)
{
Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c 2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/drivers/block/nbd.c 2006-08-12 15:50:33.000000000 +0200
@@ -361,8 +361,13 @@ static void nbd_do_it(struct nbd_device

BUG_ON(lo->magic != LO_MAGIC);

+ if (sk_set_memalloc(lo->sock->sk))
+ printk(KERN_WARNING
+ "failed to set SO_MEMALLOC on NBD socket\n");
+
while ((req = nbd_read_stat(lo)) != NULL)
nbd_end_request(req);
+
return;
}

@@ -628,11 +633,16 @@ static int __init nbd_init(void)
* every gendisk to have its very own request_queue struct.
* These structs are big so we dynamically allocate them.
*/
- disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+ disk->queue = blk_init_queue_node_elv(do_nbd_request,
+ &nbd_lock, -1, "noop");
if (!disk->queue) {
put_disk(disk);
goto out;
}
+ blk_queue_pin_elevator(disk->queue);
+ blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+ blk_queue_max_hw_segments(disk->queue, 1);
+ blk_queue_max_phys_segments(disk->queue, 1);
}

if (register_blkdev(NBD_MAJOR, "nbd")) {
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h 2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/include/linux/blkdev.h 2006-08-12 15:38:11.000000000 +0200
@@ -444,6 +444,12 @@ struct request_queue
#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED 9 /* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+ set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}

enum {
/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
/*
* Access functions for manipulating queue properties
*/
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+ spinlock_t *lock, int node_id,
+ char *elv_name);
extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
spinlock_t *lock, int node_id);
extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c 2006-08-12 15:38:01.000000000 +0200
+++ linux-2.6/block/elevator.c 2006-08-12 15:38:11.000000000 +0200
@@ -861,6 +861,11 @@ ssize_t elv_iosched_store(request_queue_
size_t len;
struct elevator_type *e;

+ if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags)) {
+ printk(KERN_ERR "elevator: cannot switch elevator, pinned\n");
+ return count;
+ }
+
elevator_name[sizeof(elevator_name) - 1] = '\0';
strncpy(elevator_name, name, sizeof(elevator_name) - 1);
len = strlen(elevator_name);

2006-08-12 14:42:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

Peter Zijlstra wrote:
> Index: linux-2.6/include/linux/gfp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
> +++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
> @@ -46,6 +46,7 @@ struct vm_area_struct;
> #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
> #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> +#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */

This symbol name has nothing to do with its purpose. The entire area of
code you are modifying could be described as having something to do with
'memalloc'.

GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
symbol name.

I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC
would be nice.

Jeff


2006-08-12 15:07:31

by Peter Zijlstra

[permalink] [raw]
Subject: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)

On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
> Peter Zijlstra wrote:
> > Index: linux-2.6/include/linux/gfp.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
> > +++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
> > @@ -46,6 +46,7 @@ struct vm_area_struct;
> > #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
> > #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> > #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
> > +#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */
>
> This symbol name has nothing to do with its purpose. The entire area of
> code you are modifying could be described as having something to do with
> 'memalloc'.
>
> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
> symbol name.
>
> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
> the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC
> would be nice.

I'm rather bad at picking names, but here goes:

PF_MEMALLOC -> PF_EMERGALLOC
__GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
__GFP_MEMALLOC -> __GFP_EMERGALLOC

Is that suitable and shall I prepare patches? Or do we want more ppl to
chime in and have a few more rounds?

2006-08-12 15:29:08

by Indan Zupancic

[permalink] [raw]
Subject: Re: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)

On Sat, August 12, 2006 17:06, Peter Zijlstra said:
> On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
>> Peter Zijlstra wrote:
>> > Index: linux-2.6/include/linux/gfp.h
>> > ===================================================================
>> > --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
>> > +++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
>> > @@ -46,6 +46,7 @@ struct vm_area_struct;
>> > #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
>> > #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
>> > #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs
>> */
>> > +#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */
>>
>> This symbol name has nothing to do with its purpose. The entire area of
>> code you are modifying could be described as having something to do with
>> 'memalloc'.
>>
>> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
>> symbol name.
>>
>> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
>> the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC
>> would be nice.
>
> I'm rather bad at picking names, but here goes:
>
> PF_MEMALLOC -> PF_EMERGALLOC
> __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> __GFP_MEMALLOC -> __GFP_EMERGALLOC
>
> Is that suitable and shall I prepare patches? Or do we want more ppl to
> chime in and have a few more rounds?

Pardon my ignorance, but if we're doing cleanup anyway, why not use only one flag instead of two?
Why is __GFP_NOMEMALLOC needed when not setting __GFP_MEMALLOC could mean the same? Or else what
is the expected behaviour if both flags are set?


2006-08-12 15:35:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: rename *MEMALLOC flags (was: Re: [RFC][PATCH 3/4] deadlock prevention core)

On Sat, 2006-08-12 at 17:28 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 17:06, Peter Zijlstra said:
> > On Sat, 2006-08-12 at 10:41 -0400, Jeff Garzik wrote:
> >> Peter Zijlstra wrote:
> >> > Index: linux-2.6/include/linux/gfp.h
> >> > ===================================================================
> >> > --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
> >> > +++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
> >> > @@ -46,6 +46,7 @@ struct vm_area_struct;
> >> > #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
> >> > #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
> >> > #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs
> >> */
> >> > +#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */
> >>
> >> This symbol name has nothing to do with its purpose. The entire area of
> >> code you are modifying could be described as having something to do with
> >> 'memalloc'.
> >>
> >> GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
> >> symbol name.
> >>
> >> I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
> >> the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC
> >> would be nice.
> >
> > I'm rather bad at picking names, but here goes:
> >
> > PF_MEMALLOC -> PF_EMERGALLOC
> > __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> > __GFP_MEMALLOC -> __GFP_EMERGALLOC
SOCK_MEMALLOC -> SOCK_EMERGALLOC
> >
> > Is that suitable and shall I prepare patches? Or do we want more ppl to
> > chime in and have a few more rounds?
>
> Pardon my ignorance, but if we're doing cleanup anyway, why not use only one flag instead of two?
> Why is __GFP_NOMEMALLOC needed when not setting __GFP_MEMALLOC could mean the same? Or else what
> is the expected behaviour if both flags are set?

__GFP_NOMEMALLOC is most authorative; its use is (afaik) to negate
PF_MEMALLOC.

I agree that having both seems odd, but I haven't spend any significant
time on trying to find a 'nicer' solution.

2006-08-12 16:51:37

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> Hi,
>
> here the latest effort, it includes a whole new trivial allocator with a
> horrid name and an almost full rewrite of the deadlock prevention core.
> This version does not do anything per device and hence does not depend
> on the new netdev_alloc_skb() API.
>
> The reason to add a second allocator to the receive side is twofold:
> 1) it allows easy detection of the memory pressure / OOM situation;
> 2) it allows the receive path to be unbounded and go at full speed when
> resources permit.
>
> The choice of using the global memalloc reserve as a mempool makes that
> the new allocator has to release pages as soon as possible; if we were
> to hoard pages in the allocator the memalloc reserve would not get
> replenished readily.

Version 2 had about 250 new lines of code, while v3 has close to 600, when
including the SROG code. And that while things should have become simpler.
So why use SROG instead of the old alloc_pages() based code? And why couldn't
you use a slightly modified SLOB instead of writing a new allocator?
It looks like overkill to me.

Greetings,

Indan


2006-08-12 17:31:17

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> +struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
> +{
> + struct sk_buff *skb;
> +
> + skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
> +
> + if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
> + skb = ___alloc_skb(size, gfp_mask, fclone);
> +
> + return skb;
> +}
> +

I'd drop the memalloc_skbs_available() check, as that's already done by
___alloc_skb.

> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_socks;
> +static unsigned long memalloc_reserve;

Why is this a long? adjust_memalloc_reserve() takes an int.
Is it needed at all, considering var_free_kbytes already exists?

Greetings,

Indan


2006-08-12 17:34:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sat, 2006-08-12 at 18:51 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> > Hi,
> >
> > here the latest effort, it includes a whole new trivial allocator with a
> > horrid name and an almost full rewrite of the deadlock prevention core.
> > This version does not do anything per device and hence does not depend
> > on the new netdev_alloc_skb() API.
> >
> > The reason to add a second allocator to the receive side is twofold:
> > 1) it allows easy detection of the memory pressure / OOM situation;
> > 2) it allows the receive path to be unbounded and go at full speed when
> > resources permit.
> >
> > The choice of using the global memalloc reserve as a mempool makes that
> > the new allocator has to release pages as soon as possible; if we were
> > to hoard pages in the allocator the memalloc reserve would not get
> > replenished readily.
>
> Version 2 had about 250 new lines of code, while v3 has close to 600, when
> including the SROG code. And that while things should have become simpler.
> So why use SROG instead of the old alloc_pages() based code? And why couldn't
> you use a slightly modified SLOB instead of writing a new allocator?
> It looks like overkill to me.

Of the 611 new lines about 150 are new comments.

Simpler yes, but also more complete; the old patches had serious issues
with the alternative allocation scheme.

As for why SROG, because trying to stick all the semantics needed for
all skb operations into the old approach was nasty, I had it almost
complete but it was horror (and more code than the SROG approach).

Why not SLOB, well, I mistakenly assumed that it was a simpler SLAB
allocator, my bad. However after having had a quick peek at it; whereas
it seems similar in intent it does not provide the things I look for.
Making it so and keep the old semantics and make it compile along side
of SLAB will take quite some effort.





2006-08-12 17:45:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, 2006-08-12 at 19:31 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 16:14, Peter Zijlstra said:
> > +struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone)
> > +{
> > + struct sk_buff *skb;
> > +
> > + skb = ___alloc_skb(size, gfp_mask & ~__GFP_MEMALLOC, fclone);
> > +
> > + if (!skb && (gfp_mask & __GFP_MEMALLOC) && memalloc_skbs_available())
> > + skb = ___alloc_skb(size, gfp_mask, fclone);
> > +
> > + return skb;
> > +}
> > +
>
> I'd drop the memalloc_skbs_available() check, as that's already done by
> ___alloc_skb.

Right, thanks. Hmm, its the last occurence of that function, even
better.

> > +static DEFINE_SPINLOCK(memalloc_lock);
> > +static int memalloc_socks;
> > +static unsigned long memalloc_reserve;
>
> Why is this a long? adjust_memalloc_reserve() takes an int.

Euhm, right :-) long comes naturaly when I think about quantities op
pages. The adjust_memalloc_reserve() argument is an increment, a delta;
perhaps I should change that to long.

> Is it needed at all, considering var_free_kbytes already exists?

Having them separate would allow ajust_memalloc_reserve() to be used by
other callers too (would need some extra locking).

2006-08-12 17:54:53

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, August 12, 2006 19:44, Peter Zijlstra said:
> Euhm, right :-) long comes naturaly when I think about quantities op
> pages. The adjust_memalloc_reserve() argument is an increment, a delta;
> perhaps I should change that to long.

Maybe, but having 16 TB of reserved memory seems plenty for a while.

> Having them separate would allow ajust_memalloc_reserve() to be used by
> other callers too (would need some extra locking).

True, but currently memalloc_reserve isn't used in a sensible way,
or I'm missing something.

Greetings,

Indan


2006-08-12 18:09:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 19:44, Peter Zijlstra said:
> > Euhm, right :-) long comes naturaly when I think about quantities op
> > pages. The adjust_memalloc_reserve() argument is an increment, a delta;
> > perhaps I should change that to long.
>
> Maybe, but having 16 TB of reserved memory seems plenty for a while.

Oh, for sure, but since it doesn't really matter all that much, I'd
rather go for proper.

> > Having them separate would allow ajust_memalloc_reserve() to be used by
> > other callers too (would need some extra locking).
>
> True, but currently memalloc_reserve isn't used in a sensible way,
> or I'm missing something.

Well, I'm somewhat reluctant to stick network related code into mm/, it
seems well separated now.


2006-08-12 18:17:10

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sat, August 12, 2006 19:33, Peter Zijlstra said:
> Simpler yes, but also more complete; the old patches had serious issues
> with the alternative allocation scheme.

It sure is more complete, and looks nicer, but the price is IMHO too high.
I'm curious what those serious issues are, and if they can't be fixed.

> As for why SROG, because trying to stick all the semantics needed for
> all skb operations into the old approach was nasty, I had it almost
> complete but it was horror (and more code than the SROG approach).

What was missing or wrong in the old approach? Can't you use the new
approach, but use alloc_pages() instead of SROG?

Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)

Greetings,

Indan


2006-08-12 18:32:28

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, August 12, 2006 20:08, Peter Zijlstra said:
> On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
>> True, but currently memalloc_reserve isn't used in a sensible way,
>> or I'm missing something.
>
> Well, I'm somewhat reluctant to stick network related code into mm/, it
> seems well separated now.

What I had in mind was something like:

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+ unsigned long flags;
+ unsigned int reserve;
+ int err;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+
+ memalloc_socks += nr_socks;
+ BUG_ON(memalloc_socks < 0);
+
+ reserve = nr_socks * (2 * MAX_PHYS_SEGMENTS + /* outbound */
+ 5 * MAX_CONCURRENT_SKBS); /* inbound */
+
+ err = adjust_memalloc_reserve(reserve);
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+ if (err) {
+ printk(KERN_WARNING
+ "Unable to change RX reserve to: %lu, error: %d\n",
+ reserve, err);
+ }
+ return err;
+}

The original code missed the brackets, so 5 * MAX_CONCURRENT_SKBS wasn't done
per socket. But the comment said it was per socket, so I added in this version.

Greetings,

Indan


2006-08-12 18:48:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, 2006-08-12 at 20:32 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 20:08, Peter Zijlstra said:
> > On Sat, 2006-08-12 at 19:54 +0200, Indan Zupancic wrote:
> >> True, but currently memalloc_reserve isn't used in a sensible way,
> >> or I'm missing something.
> >
> > Well, I'm somewhat reluctant to stick network related code into mm/, it
> > seems well separated now.
>
> What I had in mind was something like:
>
> +static DEFINE_SPINLOCK(memalloc_lock);
> +static int memalloc_socks;
> +
> +atomic_t memalloc_skbs_used;
> +EXPORT_SYMBOL_GPL(memalloc_skbs_used);
> +
> +int sk_adjust_memalloc(int nr_socks)
> +{
> + unsigned long flags;
> + unsigned int reserve;
> + int err;
> +
> + spin_lock_irqsave(&memalloc_lock, flags);
> +
> + memalloc_socks += nr_socks;
> + BUG_ON(memalloc_socks < 0);
> +
> + reserve = nr_socks * (2 * MAX_PHYS_SEGMENTS + /* outbound */
> + 5 * MAX_CONCURRENT_SKBS); /* inbound */
> +
> + err = adjust_memalloc_reserve(reserve);
> + spin_unlock_irqrestore(&memalloc_lock, flags);
> + if (err) {
> + printk(KERN_WARNING
> + "Unable to change RX reserve to: %lu, error: %d\n",
> + reserve, err);
> + }
> + return err;
> +}
>
> The original code missed the brackets, so 5 * MAX_CONCURRENT_SKBS wasn't done
> per socket. But the comment said it was per socket, so I added in this version.

Ah right, I did that in v3, with a similar comment, but I realised that
the inbound reserve need not be per socket, and that the comment was
ambiguous enough to allow this reading.

Why not per socket, its used to place an upper bound, its not
calculating one, its setting one.

Like you can see in memalloc_skbs_inc(), it limits to
MAX_CONCURRENT_SKBS.

2006-08-12 18:54:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote:
> On Sat, August 12, 2006 19:33, Peter Zijlstra said:
> > Simpler yes, but also more complete; the old patches had serious issues
> > with the alternative allocation scheme.
>
> It sure is more complete, and looks nicer, but the price is IMHO too high.
> I'm curious what those serious issues are, and if they can't be fixed.
>
> > As for why SROG, because trying to stick all the semantics needed for
> > all skb operations into the old approach was nasty, I had it almost
> > complete but it was horror (and more code than the SROG approach).
>
> What was missing or wrong in the old approach? Can't you use the new
> approach, but use alloc_pages() instead of SROG?
>
> Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)

I'm almost sorry I threw that code out, you'd understand instantly..

Lemme see what I can do to explain; what I need/want is:
- single allocation group per packet - that is, when I free a packet
and all its associated object I get my memory back.
- not waste too much space managing the various objects

skb operations want to allocate various sk_buffs for the same data
clones. Also, it wants to be able to break the COW or realloc the data.

The trivial approach would be one page (or higher alloc page) per
object, and that will work quite well, except that it'll waste a _lot_
of memory.

So I tried manual packing (parts of that you have seen in previous
attempts). This gets hard when you want to do unlimited clones and COW
breaks. To do either you need to go link several pages.

So needing a list of pages and wanting packing gave me SROG. The biggest
wart is having to deal with higher order pages. Explicitly coding in
knowledge of the object you're packing just makes the code bigger - such
is the power of abstraction.

2006-08-12 19:45:12

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/4] deadlock prevention core

On Sat, August 12, 2006 20:47, Peter Zijlstra said:
> Ah right, I did that in v3, with a similar comment, but I realised that
> the inbound reserve need not be per socket, and that the comment was
> ambiguous enough to allow this reading.

True, but better to change the comment than to confuse people.
Lots of it is outdated because reservations aren't per device anymore.

Changes to your version:
- Got rid of memalloc_socks.
- Don't include inetdevice.h (it isn't needed anymore, right?)
- Updated comment.

(I'm editing the diff, so this won't apply)

Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c 2006-08-12 12:56:06.000000000 +0200
+++ linux-2.6/net/core/sock.c 2006-08-12 13:02:59.000000000 +0200
@@ -111,6 +111,8 @@
#include <linux/poll.h>
#include <linux/tcp.h>
#include <linux/init.h>
+#include <linux/blkdev.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -195,6 +197,78 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+static unsigned long memalloc_reserve;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve
+ * @nr_socks: number of new %SOCK_MEMALLOC sockets
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * For each %SOCK_MEMALLOC socket 2 * %MAX_PHYS_SEGMENTS pages are
+ * reserved for outbound traffic (assumption: each %SOCK_MEMALLOC
+ * socket will have a %request_queue associated).
+ *
+ * Pages for inbound traffic are already reserved.
+ *
+ * 2 * %MAX_PHYS_SEGMENTS - the request queue can hold up to 150%,
+ * the remaining 50% goes to being sure we can write packets
+ * for the outgoing pages.
+ */
+static DEFINE_SPINLOCK(memalloc_lock);
+static int memalloc_socks;
+
+atomic_t memalloc_skbs_used;
+EXPORT_SYMBOL_GPL(memalloc_skbs_used);
+
+int sk_adjust_memalloc(int nr_socks)
+{
+ unsigned long flags;
+ unsigned int reserve;
+ int err;
+
+ spin_lock_irqsave(&memalloc_lock, flags);
+
+ memalloc_socks += nr_socks;
+ BUG_ON(memalloc_socks < 0);
+
+ reserve = nr_socks * 2 * MAX_PHYS_SEGMENTS; /* outbound */
+
+ err = adjust_memalloc_reserve(reserve);
+ spin_unlock_irqrestore(&memalloc_lock, flags);
+ if (err) {
+ printk(KERN_WARNING
+ "Unable to adjust RX reserve by %lu, error: %d\n",
+ reserve, err);
+ }
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_adjust_memalloc);

What's missing now is an adjust_memalloc_reserve(5 * MAX_CONCURRENT_SKBS)
call in some init code.

Greetings,

Indan


2006-08-12 20:06:06

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sat, August 12, 2006 20:54, Peter Zijlstra said:
> - single allocation group per packet - that is, when I free a packet
> and all its associated object I get my memory back.

This is easy.

> - not waste too much space managing the various objects

This too, when ignoring clones and COW.

> skb operations want to allocate various sk_buffs for the same data
> clones. Also, it wants to be able to break the COW or realloc the data.

So this seems to be what adds all the complexity.

> So I tried manual packing (parts of that you have seen in previous
> attempts). This gets hard when you want to do unlimited clones and COW
> breaks. To do either you need to go link several pages.

It gets messy quite quickly, yes.

> So needing a list of pages and wanting packing gave me SROG. The biggest
> wart is having to deal with higher order pages. Explicitly coding in
> knowledge of the object you're packing just makes the code bigger - such
> is the power of abstraction.

I assume you meant "Not explicitly coding in", or else I'm tempted to disagree.
Abstraction that has only one user which uses it in one way only adds bloat.
But looking at the code a bit more I'm afraid you're right.

Greetings,

Indan


2006-08-14 00:07:58

by Daniel Phillips

[permalink] [raw]
Subject: Re: rename *MEMALLOC flags

Peter Zijlstra wrote:
>Jeff Garzik in his infinite wisdom spake thusly:
>>Peter Zijlstra wrote:
>>
>>>Index: linux-2.6/include/linux/gfp.h
>>>===================================================================
>>>--- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.000000000 +0200
>>>+++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.000000000 +0200
>>>@@ -46,6 +46,7 @@ struct vm_area_struct;
>>> #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
>>> #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
>>> #define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
>>>+#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */
>>
>>This symbol name has nothing to do with its purpose. The entire area of
>>code you are modifying could be described as having something to do with
>>'memalloc'.
>>
>>GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better
>>symbol name.
>>
>>I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change
>>the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC
>>would be nice.
>
> I'm rather bad at picking names, but here goes:
>
> PF_MEMALLOC -> PF_EMERGALLOC
> __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
> __GFP_MEMALLOC -> __GFP_EMERGALLOC
>
> Is that suitable and shall I prepare patches? Or do we want more ppl to
> chime in and have a few more rounds?

MEMALLOC is the name Linus chose to name exactly the reserve from which we
are allocating. Perhaps that was just Linus being denser than jgarzik and
not realizing that he should have called it EMERGALLOC right from the start.

BUT since Linus did call it MEMALLOC, we should too. Or just email Linus
and tell him how much better EMERGALLOC rolls off the tongue, and could we
please change all occurances of MEMALLOC to EMERGALLOC. Then don't read
your email for a week ;-)

Inventing a new name for an existing thing is very poor taste on grounds of
grepability alone.

Regards,

Daniel

2006-08-14 00:43:11

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

Peter Zijlstra wrote:
> On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote:
>>What was missing or wrong in the old approach? Can't you use the new
>>approach, but use alloc_pages() instead of SROG?
>>
>>Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)
>
> I'm almost sorry I threw that code out...

Good instinct :-)

> Lemme see what I can do to explain; what I need/want is:
> - single allocation group per packet - that is, when I free a packet
> and all its associated object I get my memory back.

First, try to recast all your objects as pages, as Evgeniy Polyakov
suggested. Then if there is some place where that just doesn't work
(please point it out) put a mempool there and tweak the high level
reservation setup accordingly.

> - not waste too much space managing the various objects

If we waste a little space only where the network would have otherwise
dropped a packet, that is still a big win. We just need to be sure the
normal path does not become more wasteful.

> skb operations want to allocate various sk_buffs for the same data
> clones. Also, it wants to be able to break the COW or realloc the data.
>
> The trivial approach would be one page (or higher alloc page) per
> object, and that will work quite well, except that it'll waste a _lot_
> of memory.

High order allocations are just way too undependable without active
defragmentation, which isn't even on the horizon at the moment. We
just need to treat any network hardware that can't scatter/gather into
single pages as too broken to use for network block io.

As for sk_buff cow break, we need to look at which network paths do it
(netfilter obviously, probably others) and decide whether we just want
to declare that the feature breaks network block IO, or fix the feature
so it plays well with reserve accounting.

> So I tried manual packing (parts of that you have seen in previous
> attempts). This gets hard when you want to do unlimited clones and COW
> breaks. To do either you need to go link several pages.

You need to avoid the temptation to fix the entire world on the first
attempt. Sure, there will be people who call for gobs of overengineering
right from the start, but simple has always worked better for Linux than
lathering on layers of complexity just to support some feature that may
arguably be broken by design. For example, swapping through a firewall.

Changing from per-interface to a global block IO reserve was a great
simplification, we need more of those.

Looking forward to -v5 ;-)

Regards,

Daniel

2006-08-14 01:01:32

by Paul Jackson

[permalink] [raw]
Subject: Re: rename *MEMALLOC flags

Daniel wrote:
> Inventing a new name for an existing thing is very poor taste on grounds of
> grepability alone.

I wouldn't say 'very poor taste' -- just something that should be
done infrequently, with good reason, and with reasonable concensus,
especially from the key maintainers in the affected area.

Good names are good taste, in my book. But stable naming is good too.

I wonder what Nick thinks of this? Looks like he added
__GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC.

I added him to the cc list.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-14 03:43:39

by Nick Piggin

[permalink] [raw]
Subject: Re: rename *MEMALLOC flags

Paul Jackson wrote:
> Daniel wrote:
>
>>Inventing a new name for an existing thing is very poor taste on grounds of
>>grepability alone.
>
>
> I wouldn't say 'very poor taste' -- just something that should be
> done infrequently, with good reason, and with reasonable concensus,
> especially from the key maintainers in the affected area.
>
> Good names are good taste, in my book. But stable naming is good too.
>
> I wonder what Nick thinks of this? Looks like he added
> __GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC.
>
> I added him to the cc list.
>

__GFP_NOMEMALLOC was added to prevent mempool backed allocations from
accessing the emergency reserve. Because that would just shift deadlocks
from mempool "safe" sites to those which have not been converted.

PF_MEMALLOC is a good name: PF_MEMALLOC says that the task is currently
allocating memory. It does not say anything about the actual allocator
implementation details to handle this (1. don't recurse into reclaim; 2.
allow access to reserves), but that is a good thing.

__GFP_NOMEMALLOC and __GFP_MEMALLOC are poorly named (I take the blame).
It isn't that the task is suddenly no longer allocating in the context
of an allocation, it is just that you want to allow or deny access to
the reserve.

__GFP_NOMEMALLOC should be something like __GFP_EMERG_NEVER and
__GFP_MEMALLOC should be _ALWAYS. Or something like that.

NOMEMALLOC is specific enough that I don't mind a rename at this stage.
Renaming PF_MEMALLOC would be wrong, however.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-14 05:21:12

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

On Sun, Aug 13, 2006 at 05:42:47PM -0700, Daniel Phillips ([email protected]) wrote:
> High order allocations are just way too undependable without active
> defragmentation, which isn't even on the horizon at the moment. We
> just need to treat any network hardware that can't scatter/gather into
> single pages as too broken to use for network block io.

A bit of network tree allocator free advertisement - per-CPU self
defragmentation works reliably in that allocator, one could even find a
graphs of memory usage for NTA and SLAB-like allocator.

> As for sk_buff cow break, we need to look at which network paths do it
> (netfilter obviously, probably others) and decide whether we just want
> to declare that the feature breaks network block IO, or fix the feature
> so it plays well with reserve accounting.

I would suggest to consider skb cow (cloning) as a must.


> Regards,
>
> Daniel

--
Evgeniy Polyakov

2006-08-14 12:21:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

Evgeniy Polyakov wrote:
> On Sun, Aug 13, 2006 at 05:42:47PM -0700, Daniel Phillips ([email protected]) wrote:

>> As for sk_buff cow break, we need to look at which network paths do it
>> (netfilter obviously, probably others) and decide whether we just want
>> to declare that the feature breaks network block IO, or fix the feature
>> so it plays well with reserve accounting.
>
> I would suggest to consider skb cow (cloning) as a must.

That should not be any problem, since skb's (including cowed ones)
are short lived anyway. Allocating a little bit more memory is
fine when we have a guarantee that the memory will be freed again
shortly.

--
What is important? What you want to be true, or what is true?

2006-08-14 12:52:20

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

Rik van Riel <[email protected]> wrote:
>
> That should not be any problem, since skb's (including cowed ones)
> are short lived anyway. Allocating a little bit more memory is
> fine when we have a guarantee that the memory will be freed again
> shortly.

I'm not sure about the context the comment applies to, but skb's are
not necessarily short-lived. For example, they could be queued for
a few seconds for ARP/NDISC and even longer for IPsec SA resolution.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2006-08-14 14:23:07

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

Herbert Xu wrote:
> Rik van Riel <[email protected]> wrote:
>> That should not be any problem, since skb's (including cowed ones)
>> are short lived anyway. Allocating a little bit more memory is
>> fine when we have a guarantee that the memory will be freed again
>> shortly.
>
> I'm not sure about the context the comment applies to, but skb's are
> not necessarily short-lived. For example, they could be queued for
> a few seconds for ARP/NDISC and even longer for IPsec SA resolution.

That's still below the threshold where it should cause problems
with the VM going OOM. Especially if there aren't too many of
these packets.

--
All Rights Reversed

2006-08-24 14:43:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/4] deadlock prevention for NBD

Hi!

> Limit each request to 1 page, so that the request throttling also limits the
> number of in-flight pages and force the IO scheduler to NOOP as anything else
> doesn't make sense anyway.

I'd like to understand why it breaks with other schedulers before
merging this. Maybe the failure in NOOP is just harder to trigger?

Pavel

--
Thanks for all the (sleeping) penguins.