2006-08-14 11:04:31

by Evgeniy Polyakov

[permalink] [raw]
Subject: [PATCH 1/1] network memory allocator.

Hello.

Network tree allocator can be used to allocate memory for all network
operations from any context. Main designed features are:
* reduced fragmentation (self defragmentation)
* possibility to create zero-copy sending and receiving (ground work
for userspace netchannels and high-performance sniffers)
* greater than SLAB speed
* full per CPU allocation and freeing (objects are never freed on
different CPU)
* separate network allocations from main system's ones

Design notes.
Original idea was to store meta information used for allocation in an
AVL tree [1], but since I found a way to use some "unused" fields in struct page,
tree is unused in the allocator.

Terms.
node - container for per-page (of different order) meta information.
chunk - region of free or allocated memory.

Each allocation is aligned to the minimum allocation size (32 bytes, but
actually it should be L1 cache entry size).
Each node contains a bitmask of used/unused chunks of memory for memory
region bound to that node. Each free chunk is placed into double linked
list inside array, indexed of size (divided by minimal allocation size).

Allocation algo.
When user requests some memory regiosn, it's size is rounded upto
minimum allocation size (instead of power of two in SLAB) and
appropriate entry in array of lists of free chunks is selected. If that
list contains free elements, first one is returned, otherwise next list
is selected with bigger size until non-empty list is found.
Then allocator determines a node for appropriate chunk (using fields in
corresponding struct page->lru), and bits corresponding to selected
chunk are marked as used. If chunk had bigger size than requested, the
rest is placed back into the list for appropriate size.
Thus allocator gets requested memory aligned to minimum allocation size.

Freeing algo.
When user frees chunk of the memory, appropriate node and allocation CPU
are selected for given memory regions (by dereferencing page->lru
fields). If current CPU does not correspond to allocation CPU, then
chunk is dereferenced into single-linked list entry and placed into list
of semi-free objects for freeing on original (allocation) CPU.
If freeing CPU is the same as allocation one, appropriate node is
checked and neighbours for being freed chunk are found. Free
neighbour(s) are then dereferenced into double linked list entry and
removed from appropriate lists of free elements. Then all free chunks
are combined into one and appropriate bitmask is updated in the node.
Chunk of (possibly) bigger size then is dereferenced into double-linked
list and placed into list of free objects for appropriate size.
Then freeing algo checks if list of "semi-free" objects (objects which
were started to be freed on different CPUs, but should be freed actually
on current one) is not empty, and if so, all chunks from that list are
freed as described above.

All above lists and arrays are not accessed by different CPUs, exept
"semi-freed" list, which is accessed under appropriate lock (each CPU
has it's own lock and list).

Defragmentation is a part of freeing algorithm and initial fragmentation
avoidance is being done at allocation time by removing power-of-two
allocations. Rate of fragmentation can be found in some userspace
modlling tests being done for both power-of-two SLAB-like and NTA
allocators. (more details on project's homepage [4]).

Benchmarks with trivial epoll based web server showed noticeble (more
than 40%) imrovements of the request rates (1600-1800 requests per
second vs. more than 2300 ones). It can be described by more
cache-friendly freeing algorithm, by tighter objects packing and thus
reduced cache line ping-pongs, reduced lookups into higher-layer caches
and so on.

Design of allocator allows to map all node's pages into userspace thus
allows to have true zero-copy support for both sending and receiving
dataflows.

As described in recent threads [3] it is also possible to eliminate any
kind of main system OOM influence on network dataflow processing, thus
it is possible to prevent deadlock for systems, which use network as
memory storage (swap over network, iSCSI, NBD and so on).

The only changes in core network stack includes new field in struct
sk_buff to store size of the original skb (skb->truesize is not constant
anymore) and replace of kmalloc()/kfree() with avl_alloc()/avl_free().

Things in TODO list:
* grow cache when predefined threshold of memory is used (simple task)
* implement userspace mapping and move netchannels [2] into userspace
(move netchannels to userspace is more complex task)

1. AVL tree.
http://en.wikipedia.org/wiki/AVL_tree

2. Netchannels implementation with alternative TCP/IP stack in kernel.
http://tservice.net.ru/~s0mbre/old/index.php?section=projects

3. Discussion about deadlock prevention in network based storage devices
http://thread.gmane.org/gmane.linux.kernel/434411/focus=434411
http://thread.gmane.org/gmane.linux.kernel/435986/focus=435986
http://thread.gmane.org/gmane.linux.kernel.mm/11682/focus=11682

4. Network tree allocator homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=nta

Current patch includes AVL tree implementation and other unused bits
for reference and will be removed in future releases.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19c96d4..ded8b31 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -311,7 +311,7 @@ #endif


/* These elements must be at the end, see alloc_skb() for details. */
- unsigned int truesize;
+ unsigned int truesize, __tsize;
atomic_t users;
unsigned char *head,
*data,
@@ -327,6 +327,10 @@ #include <linux/slab.h>

#include <asm/system.h>

+extern void *avl_alloc(unsigned int size, gfp_t gfp_mask);
+extern void avl_free(void *ptr, unsigned int size);
+extern int avl_init(void);
+
extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
diff --git a/net/core/Makefile b/net/core/Makefile
index 2645ba4..d86d468 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -10,6 +10,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.
obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o

+obj-y += alloc/
+
obj-$(CONFIG_XFRM) += flow.o
obj-$(CONFIG_SYSFS) += net-sysfs.o
obj-$(CONFIG_NET_DIVERT) += dv.o
diff --git a/net/core/alloc/Makefile b/net/core/alloc/Makefile
new file mode 100644
index 0000000..a98ee88
--- /dev/null
+++ b/net/core/alloc/Makefile
@@ -0,0 +1,3 @@
+obj-y := allocator.o
+
+allocator-y := avl.o
diff --git a/net/core/alloc/avl.c b/net/core/alloc/avl.c
new file mode 100644
index 0000000..ff85519
--- /dev/null
+++ b/net/core/alloc/avl.c
@@ -0,0 +1,882 @@
+/*
+ * avl.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/skbuff.h>
+
+#include "avl.h"
+
+#define AVL_CONTAINER_ARRAY_SIZE (AVL_MAX_SIZE/AVL_MIN_SIZE)
+#define AVL_NODES_ON_PAGE (PAGE_SIZE/sizeof(struct avl_node))
+#define AVL_NODE_PAGES ((AVL_NODE_NUM+AVL_NODES_ON_PAGE-1)/AVL_NODES_ON_PAGE)
+
+struct avl_free_list
+{
+ struct avl_free_list *next;
+ unsigned int size;
+ unsigned int cpu;
+};
+
+static avl_t avl_node_id[NR_CPUS];
+static struct avl_node **avl_node_array[NR_CPUS];
+static struct list_head *avl_container_array[NR_CPUS];
+static struct avl_node *avl_root[NR_CPUS];
+static struct avl_free_list *avl_free_list_head[NR_CPUS];
+static spinlock_t avl_free_lock[NR_CPUS];
+
+static inline struct avl_node *avl_get_node(avl_t id, int cpu)
+{
+ avl_t idx, off;
+
+ if (id >= AVL_NODE_NUM)
+ return NULL;
+
+ idx = id/AVL_NODES_ON_PAGE;
+ off = id%AVL_NODES_ON_PAGE;
+
+ return &(avl_node_array[cpu][idx][off]);
+}
+
+static inline struct avl_node *avl_get_node_ptr(unsigned long ptr)
+{
+ struct page *page = virt_to_page(ptr);
+ struct avl_node *node = (struct avl_node *)(page->lru.next);
+
+ return node;
+}
+
+static inline void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int order)
+{
+ int nr_pages = 1<<order, i;
+ struct page *page = virt_to_page(ptr);
+
+ for (i=0; i<nr_pages; ++i) {
+ page->lru.next = (void *)node;
+ page++;
+ }
+}
+
+static inline int avl_get_cpu_ptr(unsigned long ptr)
+{
+ struct page *page = virt_to_page(ptr);
+ int cpu = (int)(unsigned long)(page->lru.prev);
+
+ return cpu;
+}
+
+static inline void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order)
+{
+ int nr_pages = 1<<order, i;
+ struct page *page = virt_to_page(ptr);
+
+ for (i=0; i<nr_pages; ++i) {
+ page->lru.prev = (void *)(unsigned long)cpu;
+ page++;
+ }
+}
+
+static inline enum avl_balance avl_compare(struct avl_node *a, struct avl_node *b)
+{
+ if (a->value == b->value)
+ return AVL_BALANCED;
+ else if (a->value > b->value)
+ return AVL_RIGHT;
+ else
+ return AVL_LEFT;
+}
+
+static inline void avl_set_left(struct avl_node *node, avl_t val)
+{
+ node->left = val;
+}
+
+static inline void avl_set_right(struct avl_node *node, avl_t val)
+{
+ node->right = val;
+}
+
+static inline void avl_set_parent(struct avl_node *node, avl_t val)
+{
+ node->parent = val;
+}
+
+static inline void avl_rotate_complete(struct avl_node *parent, struct avl_node *node, int cpu)
+{
+ avl_set_parent(node, parent->parent);
+ if (parent->parent != AVL_NODE_EMPTY) {
+ if (parent->pos == AVL_RIGHT)
+ avl_set_right(avl_get_node(parent->parent, cpu), node->id);
+ else
+ avl_set_left(avl_get_node(parent->parent, cpu), node->id);
+ }
+ avl_set_parent(parent, node->id);
+}
+
+static inline void avl_ll(struct avl_node *node, int cpu)
+{
+ struct avl_node *parent = avl_get_node(node->parent, cpu);
+ struct avl_node *left = avl_get_node(node->left, cpu);
+
+ avl_rotate_complete(parent, node, cpu);
+ avl_set_left(node, parent->id);
+ node->pos = parent->pos;
+ parent->pos = AVL_LEFT;
+ if (!left) {
+ avl_set_right(parent, AVL_NODE_EMPTY);
+ } else {
+ avl_set_parent(left, parent->id);
+ left->pos = AVL_RIGHT;
+ avl_set_right(parent, left->id);
+ }
+}
+
+static inline void avl_rr(struct avl_node *node, int cpu)
+{
+ struct avl_node *parent = avl_get_node(node->parent, cpu);
+ struct avl_node *right = avl_get_node(node->right, cpu);
+
+ avl_rotate_complete(parent, node, cpu);
+ avl_set_right(node, parent->id);
+ node->pos = parent->pos;
+ parent->pos = AVL_RIGHT;
+ if (!right)
+ avl_set_left(parent, AVL_NODE_EMPTY);
+ else {
+ avl_set_parent(right, parent->id);
+ right->pos = AVL_LEFT;
+ avl_set_left(parent, right->id);
+ }
+}
+
+static inline void avl_rl_balance(struct avl_node *node, int cpu)
+{
+ avl_rr(node, cpu);
+ avl_ll(node, cpu);
+}
+
+static inline void avl_lr_balance(struct avl_node *node, int cpu)
+{
+ avl_ll(node, cpu);
+ avl_rr(node, cpu);
+}
+
+static inline void avl_balance_single(struct avl_node *node, struct avl_node *parent)
+{
+ node->balance = parent->balance = AVL_BALANCED;
+}
+
+static inline void avl_ll_balance(struct avl_node *node, int cpu)
+{
+ struct avl_node *parent = avl_get_node(node->parent, cpu);
+
+ avl_ll(node, cpu);
+ avl_balance_single(node, parent);
+}
+
+static inline void avl_rr_balance(struct avl_node *node, int cpu)
+{
+ struct avl_node *parent = avl_get_node(node->parent, cpu);
+
+ avl_rr(node, cpu);
+ avl_balance_single(node, parent);
+}
+
+static void avl_calc_balance_insert(struct avl_node *a, struct avl_node *b, struct avl_node *x, int cpu)
+{
+ int ao;
+
+ if (!a || !b || !x)
+ goto out;
+
+ ao = (a->balance == avl_compare(x, a));
+
+ if (ao) {
+ if (a->balance == AVL_LEFT) {
+ if (b->balance == AVL_LEFT) {
+ avl_rr_balance(b, cpu);
+ } else if (b->balance == AVL_RIGHT) {
+ avl_lr_balance(x, cpu);
+ switch (x->balance) {
+ case AVL_LEFT:
+ a->balance = AVL_RIGHT;
+ b->balance = AVL_BALANCED;
+ break;
+ case AVL_RIGHT:
+ a->balance = AVL_BALANCED;
+ b->balance = AVL_LEFT;
+ break;
+ case AVL_BALANCED:
+ a->balance = b->balance = AVL_BALANCED;
+ break;
+ }
+ x->balance = AVL_BALANCED;
+ }
+ } else if (a->balance == AVL_RIGHT){
+ if (b->balance == AVL_RIGHT) {
+ avl_ll_balance(b, cpu);
+ } else if (b->balance == AVL_LEFT) {
+ avl_rl_balance(x, cpu);
+ switch (x->balance) {
+ case AVL_LEFT:
+ a->balance = AVL_BALANCED;
+ b->balance = AVL_RIGHT;
+ break;
+ case AVL_RIGHT:
+ a->balance = AVL_LEFT;
+ b->balance = AVL_BALANCED;
+ break;
+ case AVL_BALANCED:
+ a->balance = b->balance = AVL_BALANCED;
+ break;
+ }
+ x->balance = AVL_BALANCED;
+ }
+ }
+ }
+out:
+ return;
+}
+
+static void avl_combine_nodes(struct avl_node *root, struct avl_node *node)
+{
+}
+
+static struct avl_node *__avl_set_balance(struct avl_node *node, enum avl_balance type)
+{
+ if (node->balance == AVL_BALANCED)
+ node->balance = type;
+ else if (node->balance != type)
+ node->balance = AVL_BALANCED;
+ else
+ return node;
+
+ return NULL;
+}
+
+static struct avl_node *avl_set_balance(struct avl_node *node, enum avl_balance type, int cpu)
+{
+ struct avl_node *root = avl_root[cpu];
+
+ if (__avl_set_balance(node, type))
+ return node;
+
+ if (node->balance == AVL_BALANCED)
+ return NULL;
+
+ type = node->pos;
+ while ((node = avl_get_node(node->parent, cpu)) && node != root) {
+ if (__avl_set_balance(node, type))
+ break;
+ if (node->balance == AVL_BALANCED) {
+ node = NULL;
+ break;
+ }
+ type = node->pos;
+ }
+
+ if (node == root)
+ node = NULL;
+
+ return node;
+}
+
+static struct avl_node *avl_try_insert(struct avl_node *r, struct avl_node *root,
+ struct avl_node *node, enum avl_balance type, int cpu)
+{
+ struct avl_node *s = NULL, *c = NULL, *R;
+ enum avl_balance t;
+
+ if (type == AVL_BALANCED) {
+ avl_combine_nodes(root, node);
+ return NULL;
+ }
+
+ if (type == AVL_RIGHT && root->right != AVL_NODE_EMPTY)
+ return avl_get_node(root->right, cpu);
+ if (type == AVL_LEFT && root->left != AVL_NODE_EMPTY)
+ return avl_get_node(root->left, cpu);
+
+ R = avl_set_balance(root, type, cpu);
+ if (type == AVL_RIGHT)
+ avl_set_right(root, node->id);
+ else
+ avl_set_left(root, node->id);
+ avl_set_parent(node, root->id);
+ node->pos = type;
+
+ if (!R)
+ return NULL;
+
+ t = avl_compare(node, r);
+ if (t == AVL_RIGHT)
+ s = avl_get_node(r->right, cpu);
+ else if (t == AVL_LEFT)
+ s = avl_get_node(r->left, cpu);
+
+ if (s) {
+ t = avl_compare(node, s);
+
+ if (t == AVL_RIGHT)
+ c = avl_get_node(s->right, cpu);
+ else if (t == AVL_LEFT)
+ c = avl_get_node(s->left, cpu);
+ }
+
+ avl_calc_balance_insert(r, s, c, cpu);
+
+ return NULL;
+}
+
+static void avl_insert(struct avl_node *root, struct avl_node *node, int cpu)
+{
+ struct avl_node *r, *s;
+
+ r = s = root;
+
+ while (root) {
+ enum avl_balance type = avl_compare(node, root);
+
+ s = avl_try_insert(r, root, node, type, cpu);
+
+ if (!s)
+ break;
+ if (type != AVL_BALANCED && root->balance == type)
+ r = root;
+ root = s;
+ }
+}
+
+static struct avl_node *avl_node_alloc(value_t value, int cpu)
+{
+ struct avl_node *node = avl_get_node(avl_node_id[cpu], cpu);
+
+ if (!node) {
+ avl_t idx, off;
+
+ idx = avl_node_id[cpu]/AVL_NODES_ON_PAGE;
+ off = avl_node_id[cpu]%AVL_NODES_ON_PAGE;
+ printk("%s: value: %lx, id: %u, max: %lu, cpu: %d, idx: %u, off: %u, on_page: %lu, node: %zu, pages: %lu.\n",
+ __func__, value, avl_node_id[cpu], AVL_NODE_NUM, cpu, idx, off,
+ AVL_NODES_ON_PAGE, sizeof(struct avl_node), AVL_NODE_PAGES);
+ }
+ BUG_ON(!node);
+
+ node->value = value;
+ node->balance = AVL_BALANCED;
+ node->left = node->right = node->parent = AVL_NODE_EMPTY;
+ node->id = avl_node_id[cpu]++;
+ memset(node->mask, 0xff, sizeof(node->mask));
+
+ return node;
+}
+
+static struct avl_node *avl_search(value_t value, int cpu)
+{
+ struct avl_node *node = avl_root[cpu];
+
+ while (node) {
+ if (value < node->value)
+ node = avl_get_node(node->left, cpu);
+ else if (value > node->value)
+ node = avl_get_node(node->right, cpu);
+ else
+ break;
+ }
+
+ return node;
+}
+
+static struct avl_node *avl_node_search_alloc(value_t value, int cpu)
+{
+ struct avl_node *node;
+
+ node = avl_search(value, cpu);
+ if (!node) {
+ node = avl_node_alloc(value, cpu);
+ if (node) {
+ avl_set_cpu_ptr(value, cpu, AVL_ORDER);
+ avl_set_node_ptr(value, node, AVL_ORDER);
+ avl_insert(avl_get_node(avl_root[cpu]->right, cpu), node, cpu);
+ }
+ }
+
+ return node;
+}
+
+static inline value_t avl_ptr_to_value(void *ptr)
+{
+ struct avl_node *node = avl_get_node_ptr((unsigned long)ptr);
+ return node->value;
+}
+
+static inline int avl_ptr_to_offset(void *ptr)
+{
+ return ((value_t)ptr - avl_ptr_to_value(ptr))/AVL_MIN_SIZE;
+}
+
+unsigned int avl_count_set_down(unsigned long *mask, unsigned int pos)
+{
+ unsigned int stop, bits = 0;
+ int idx;
+ unsigned long p, m;
+
+ idx = pos/BITS_PER_LONG;
+ pos = pos%BITS_PER_LONG;
+
+ while (idx >= 0) {
+ m = (~0UL>>pos)<<pos;
+ p = mask[idx] | m;
+
+ if (!(mask[idx] & m))
+ break;
+
+ stop = fls(~p);
+
+ if (!stop) {
+ bits += pos + 1;
+ pos = BITS_PER_LONG - 1;
+ idx--;
+ } else {
+ bits += pos - stop + 1;
+ break;
+ }
+ }
+
+ return bits;
+}
+
+unsigned int avl_count_set_up(unsigned long *mask, unsigned int mask_num,
+ unsigned int pos)
+{
+ unsigned int idx, stop, bits = 0;
+ unsigned long p, m;
+
+ idx = pos/BITS_PER_LONG;
+ pos = pos%BITS_PER_LONG;
+
+ while (idx < mask_num) {
+ if (!pos)
+ m = 0;
+ else
+ m = (~0UL<<(BITS_PER_LONG-pos))>>(BITS_PER_LONG-pos);
+ p = mask[idx] | m;
+
+ if (!(mask[idx] & ~m))
+ break;
+
+ stop = ffs(~p);
+
+ if (!stop) {
+ bits += BITS_PER_LONG - pos;
+ pos = 0;
+ idx++;
+ } else {
+ bits += stop - pos - 1;
+ break;
+ }
+ }
+
+ return bits;
+}
+
+static void avl_fill_bits(struct avl_node *node, unsigned long *mask, unsigned int mask_size,
+ unsigned int pos, unsigned int num, unsigned int bit)
+{
+ unsigned int idx, start;
+
+ idx = pos/BITS_PER_LONG;
+ start = pos%BITS_PER_LONG;
+
+ while (num && idx < mask_size) {
+ unsigned long m = ((~0UL)>>start)<<start;
+
+ if (start + num <= BITS_PER_LONG) {
+ unsigned long upper_bits = BITS_PER_LONG - (start+num);
+
+ m = (m<<upper_bits)>>upper_bits;
+ }
+
+ if (bit)
+ mask[idx] |= m;
+ else
+ mask[idx] &= ~m;
+
+ if (start + num <= BITS_PER_LONG)
+ num = 0;
+ else {
+ num -= BITS_PER_LONG - start;
+ start = 0;
+ idx++;
+ }
+ }
+}
+
+static inline void avl_container_insert(struct avl_container *c, unsigned int pos, int cpu)
+{
+ list_add_tail(&c->centry, &avl_container_array[cpu][pos]);
+}
+
+static void avl_update_node(struct avl_container *c, unsigned int cpos, unsigned int size)
+{
+ struct avl_node *node = avl_get_node_ptr((unsigned long)c->ptr);
+ unsigned int num = AVL_ALIGN(size)/AVL_MIN_SIZE;
+
+ BUG_ON(cpos < num - 1);
+
+ avl_fill_bits(node, node->mask, ARRAY_SIZE(node->mask), avl_ptr_to_offset(c->ptr), num, 0);
+
+ if (cpos != num-1) {
+ void *ptr = c->ptr + size;
+ c = ptr;
+ c->ptr = ptr;
+
+ cpos -= num;
+
+ avl_container_insert(c, cpos, smp_processor_id());
+ }
+}
+
+static int avl_container_add(void *ptr, unsigned int size, gfp_t gfp_mask, int cpu)
+{
+ struct avl_container *c = ptr;
+ unsigned int pos = AVL_ALIGN(size)/AVL_MIN_SIZE-1;
+
+ if (!size)
+ return -EINVAL;
+
+ c->ptr = ptr;
+ avl_container_insert(c, pos, cpu);
+
+ return 0;
+}
+
+static inline struct avl_container *avl_dequeue(struct list_head *head)
+{
+ struct avl_container *cnt;
+
+ cnt = list_entry(head->next, struct avl_container, centry);
+ list_del(&cnt->centry);
+
+ return cnt;
+}
+
+void *avl_alloc(unsigned int size, gfp_t gfp_mask)
+{
+ unsigned int i;
+ void *ptr = NULL;
+ unsigned long flags;
+
+ size = AVL_ALIGN(size);
+
+ if (size > AVL_MAX_SIZE || size < AVL_MIN_SIZE)
+ return NULL;
+
+ local_irq_save(flags);
+ for (i=size/AVL_MIN_SIZE-1; i<AVL_CONTAINER_ARRAY_SIZE; ++i) {
+ struct list_head *head = &avl_container_array[smp_processor_id()][i];
+ struct avl_container *c;
+
+ if (!list_empty(head)) {
+ c = avl_dequeue(head);
+ ptr = c->ptr;
+ avl_update_node(c, i, size);
+ break;
+ }
+ }
+ local_irq_restore(flags);
+
+ if (!ptr) {
+ printk("%s: Failed to allocate %u bytes with %02x mode.\n", __func__, size, gfp_mask);
+ WARN_ON(1);
+ }
+
+ return ptr;
+}
+
+static inline struct avl_container *avl_search_container(void *ptr, unsigned int idx, int cpu)
+{
+ struct avl_container *c = ptr;
+
+ list_del(&c->centry);
+ c->ptr = ptr;
+
+ return c;
+}
+
+static void avl_combine(struct avl_node *node, void *lp, unsigned int lbits, void *rp, unsigned int rbits,
+ void *cur_ptr, unsigned int cur_bits, gfp_t gfp_mask, int cpu)
+{
+ struct avl_container *lc, *rc, *c;
+ unsigned int idx;
+ void *ptr;
+
+ lc = rc = c = NULL;
+ idx = cur_bits - 1;
+ ptr = cur_ptr;
+
+ c = (struct avl_container *)cur_ptr;
+ c->ptr = cur_ptr;
+
+ if (rp) {
+ rc = avl_search_container(rp, rbits-1, cpu);
+ if (!rc) {
+ printk(KERN_ERR "%p.%p: Failed to find a container for right pointer %p, rbits: %u.\n",
+ node, cur_ptr, rp, rbits);
+ BUG();
+ }
+
+ c = rc;
+ idx += rbits;
+ ptr = c->ptr;
+ }
+
+ if (lp) {
+ lc = avl_search_container(lp, lbits-1, cpu);
+ if (!lc) {
+ printk(KERN_ERR "%p.%p: Failed to find a container for left pointer %p, lbits: %u.\n",
+ node, cur_ptr, lp, lbits);
+ BUG();
+ }
+
+ if (!c) {
+ c = lc;
+ c->ptr = cur_ptr;
+ }
+ idx += lbits;
+ ptr = c->ptr;
+ }
+ if (!c) {
+ if (avl_container_add(cur_ptr, cur_bits*AVL_MIN_SIZE, gfp_mask, cpu))
+ return;
+ } else {
+ avl_container_insert(c, idx, cpu);
+ }
+}
+
+static void __avl_free(void *ptr, unsigned int size, gfp_t gfp_mask)
+{
+ value_t val = avl_ptr_to_value(ptr);
+ unsigned int pos, idx, sbits = AVL_ALIGN(size)/AVL_MIN_SIZE;
+ unsigned int rbits, lbits, cpu = avl_get_cpu_ptr(val);
+ struct avl_node *node;
+ unsigned long p;
+ void *lp, *rp;
+
+ if (cpu != smp_processor_id()) {
+ struct avl_free_list *l, *this = ptr;
+
+ this->cpu = smp_processor_id();
+ this->size = size;
+
+ spin_lock(&avl_free_lock[cpu]);
+ l = avl_free_list_head[cpu];
+ avl_free_list_head[cpu] = this;
+ this->next = l;
+ spin_unlock(&avl_free_lock[cpu]);
+ return;
+ }
+
+ node = avl_get_node_ptr((unsigned long)ptr);
+
+ pos = avl_ptr_to_offset(ptr);
+ idx = pos/BITS_PER_LONG;
+
+ p = node->mask[idx] >> (pos%BITS_PER_LONG);
+
+ if ((p & 1)) {
+ printk(KERN_ERR "%p.%p: Broken pointer: value: %lx, pos: %u, idx: %u, mask: %lx, p: %lx.\n",
+ node, ptr, val, pos, idx, node->mask[idx], p);
+ BUG();
+ }
+
+ avl_fill_bits(node, node->mask, ARRAY_SIZE(node->mask), pos, sbits, 1);
+
+ lp = rp = NULL;
+ rbits = lbits = 0;
+
+ idx = (pos+sbits)/BITS_PER_LONG;
+ p = (pos+sbits)%BITS_PER_LONG;
+
+ if ((node->mask[idx] >> p) & 1) {
+ lbits = avl_count_set_up(node->mask, ARRAY_SIZE(node->mask), pos+sbits);
+ if (lbits) {
+ lp = (void *)(val + (pos + sbits)*AVL_MIN_SIZE);
+ }
+ }
+
+ if (pos) {
+ idx = (pos-1)/BITS_PER_LONG;
+ p = (pos-1)%BITS_PER_LONG;
+ if ((node->mask[idx] >> p) & 1) {
+ rbits = avl_count_set_down(node->mask, pos-1);
+ if (rbits) {
+ rp = (void *)(val + (pos-rbits)*AVL_MIN_SIZE);
+ }
+ }
+ }
+
+ avl_combine(node, lp, lbits, rp, rbits, ptr, sbits, gfp_mask, cpu);
+}
+
+void avl_free(void *ptr, unsigned int size)
+{
+ unsigned long flags;
+ struct avl_free_list *l;
+ int cpu;
+
+ local_irq_save(flags);
+ __avl_free(ptr, size, GFP_ATOMIC);
+
+ cpu = smp_processor_id();
+
+ while (avl_free_list_head[cpu]) {
+ spin_lock(&avl_free_lock[cpu]);
+ l = avl_free_list_head[cpu];
+ avl_free_list_head[cpu] = l->next;
+ spin_unlock(&avl_free_lock[cpu]);
+ __avl_free(l, l->size, GFP_ATOMIC);
+ };
+ local_irq_restore(flags);
+}
+
+static void avl_fini_cpu(int cpu)
+{
+ int i;
+
+ for (i=0; i<AVL_NODE_NUM; ++i) {
+ struct avl_node *node = avl_get_node(i, cpu);
+ if (node->value)
+ free_page(node->value);
+ node->value = 0;
+ }
+
+ kfree(avl_container_array[cpu]);
+
+ for (i=0; i<AVL_NODE_PAGES; ++i)
+ kfree(avl_node_array[cpu][i]);
+}
+
+static int avl_init_cpu(int cpu)
+{
+ unsigned int i;
+ int err;
+ unsigned long ptr;
+
+ spin_lock_init(&avl_free_lock[cpu]);
+
+ avl_node_array[cpu] = kzalloc(AVL_NODE_PAGES * sizeof(void *), GFP_KERNEL);
+ if (!avl_node_array[cpu])
+ return -ENOMEM;
+
+ for (i=0; i<AVL_NODE_PAGES; ++i) {
+ avl_node_array[cpu][i] = (struct avl_node *)__get_free_page(GFP_KERNEL);
+ if (!avl_node_array[cpu][i])
+ goto err_out_free_node;
+ }
+
+ avl_container_array[cpu] = kzalloc(sizeof(struct list_head) * AVL_CONTAINER_ARRAY_SIZE, GFP_KERNEL);
+ if (!avl_container_array[cpu])
+ goto err_out_free_node;
+
+ for (i=0; i<AVL_CONTAINER_ARRAY_SIZE; ++i)
+ INIT_LIST_HEAD(&avl_container_array[cpu][i]);
+
+ /*
+ * NTA steals pages and never return them back to the system.
+ */
+ ptr = __get_free_pages(GFP_KERNEL, AVL_ORDER);
+ if (!ptr)
+ goto err_out_free_container;
+
+ avl_root[cpu] = avl_node_alloc(ptr, cpu);
+ avl_set_cpu_ptr(ptr, cpu, AVL_ORDER);
+ avl_set_node_ptr(ptr, avl_root[cpu], AVL_ORDER);
+
+ avl_insert(avl_root[cpu], avl_node_alloc(ptr, cpu), cpu);
+ err = avl_container_add((void *)ptr, PAGE_SIZE*(1<<AVL_ORDER), GFP_KERNEL, cpu);
+ if (err) {
+ printk(KERN_ERR "Failed to add new container: ptr: %lx, size: %lu, err: %d.\n", ptr, PAGE_SIZE, err);
+ goto err_out_free_values;
+ }
+
+ for (i=0; i<AVL_NODE_NUM-2; ++i) {
+ ptr = __get_free_pages(GFP_KERNEL, AVL_ORDER);
+ if (!ptr) {
+ printk(KERN_ERR "Failed to allocate %d'th page.\n", i);
+ goto err_out_free_values;
+ }
+
+ avl_node_search_alloc(ptr, cpu);
+ err = avl_container_add((void *)ptr, PAGE_SIZE*(1<<AVL_ORDER), GFP_KERNEL, cpu);
+ if (err) {
+ printk(KERN_ERR "Failed to add new container: ptr: %lx, size: %lu, err: %d.\n", ptr, PAGE_SIZE, err);
+ goto err_out_free_values;
+ }
+ }
+
+ return 0;
+
+err_out_free_values:
+ for (i=0; i<AVL_NODE_NUM - 1; ++i) {
+ struct avl_node *node = avl_get_node(i, cpu);
+ if (node->value)
+ free_page(node->value);
+ node->value = 0;
+ }
+
+err_out_free_container:
+ kfree(avl_container_array[cpu]);
+err_out_free_node:
+ for (i=0; i<AVL_NODE_PAGES; ++i)
+ kfree(avl_node_array[cpu][i]);
+
+ kfree(avl_node_array[cpu]);
+
+ return -ENOMEM;
+}
+
+int avl_init(void)
+{
+ int err, cpu;
+
+ for_each_possible_cpu(cpu) {
+ err = avl_init_cpu(cpu);
+ if (err)
+ goto err_out;
+ }
+
+ printk(KERN_INFO "Network tree allocator has been initialized.\n");
+ return 0;
+
+err_out:
+ for_each_possible_cpu(cpu)
+ avl_fini_cpu(cpu);
+
+ return -ENOMEM;
+}
diff --git a/net/core/alloc/avl.h b/net/core/alloc/avl.h
new file mode 100644
index 0000000..e6f8b6c
--- /dev/null
+++ b/net/core/alloc/avl.h
@@ -0,0 +1,84 @@
+/*
+ * avl.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHAAVLBILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __AVL_H
+#define __AVL_H
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <asm/page.h>
+
+//#define AVL_DEBUG
+
+#ifdef AVL_DEBUG
+#define ulog(f, a...) printk(f, ##a)
+#else
+#define ulog(f, a...)
+#endif
+
+/*
+ * Network tree allocator variables.
+ */
+
+typedef unsigned long value_t;
+typedef u16 avl_t;
+
+#define AVL_ALIGN_SIZE 32
+#define AVL_ALIGN(x) ALIGN(x, AVL_ALIGN_SIZE)
+
+#define AVL_ORDER 2
+#define AVL_BITS 10 /* Must cover maximum number of pages used for allocation pools */
+#define AVL_NODE_EMPTY ((1UL<<AVL_BITS) - 1)
+
+#define AVL_MIN_SIZE AVL_ALIGN_SIZE
+#define AVL_MAX_SIZE (AVL_MIN_SIZE * AVL_NODE_EMPTY)
+#define AVL_NODE_NUM (AVL_NODE_EMPTY - 1)
+
+enum avl_balance {
+ AVL_LEFT = 0,
+ AVL_RIGHT,
+ AVL_BALANCED,
+};
+
+struct avl_node
+{
+ u64 left:AVL_BITS,
+ right:AVL_BITS,
+ parent:AVL_BITS,
+ id:AVL_BITS,
+ balance:2,
+ pos:1,
+ res:(64-4*AVL_BITS)-3;
+ value_t value;
+ DECLARE_BITMAP(mask, (1<<AVL_ORDER)*PAGE_SIZE/AVL_MIN_SIZE);
+};
+
+struct avl_container
+{
+ void *ptr;
+ struct list_head centry;
+};
+
+struct avl_container *avl_container_alloc(gfp_t gfp_mask);
+void avl_container_free(struct avl_container *);
+int avl_container_cache_init(void);
+
+#endif /* __AVL_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 022d889..7239208 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -156,7 +156,7 @@ struct sk_buff *__alloc_skb(unsigned int

/* Get the DATA. Size must match skb_add_mtu(). */
size = SKB_DATA_ALIGN(size);
- data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ data = avl_alloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

@@ -176,6 +176,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->gso_type = 0;
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
+
+ skb->__tsize = size + sizeof(struct skb_shared_info);

if (fclone) {
struct sk_buff *child = skb + 1;
@@ -223,7 +225,7 @@ struct sk_buff *alloc_skb_from_cache(kme

/* Get the DATA. */
size = SKB_DATA_ALIGN(size);
- data = kmem_cache_alloc(cp, gfp_mask);
+ data = avl_alloc(size, gfp_mask);
if (!data)
goto nodata;

@@ -234,6 +236,7 @@ struct sk_buff *alloc_skb_from_cache(kme
skb->data = data;
skb->tail = data;
skb->end = data + size;
+ skb->__tsize = size;

atomic_set(&(skb_shinfo(skb)->dataref), 1);
skb_shinfo(skb)->nr_frags = 0;
@@ -313,7 +316,7 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);

- kfree(skb->head);
+ avl_free(skb->head, skb->__tsize);
}
}

@@ -495,6 +498,7 @@ #endif
skb_copy_secmark(n, skb);
#endif
C(truesize);
+ C(__tsize);
atomic_set(&n->users, 1);
C(head);
C(data);
@@ -688,7 +692,7 @@ int pskb_expand_head(struct sk_buff *skb

size = SKB_DATA_ALIGN(size);

- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ data = avl_alloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

@@ -707,6 +711,7 @@ int pskb_expand_head(struct sk_buff *skb

off = (data + nhead) - skb->head;

+ skb->__tsize = size + sizeof(struct skb_shared_info);
skb->head = data;
skb->end = data + size;
skb->data += off;
@@ -2057,6 +2062,9 @@ void __init skb_init(void)
NULL, NULL);
if (!skbuff_fclone_cache)
panic("cannot create skbuff cache");
+
+ if (avl_init())
+ panic("Failed to initialize network tree allocator.\n");
}

EXPORT_SYMBOL(___pskb_trim);

--
Evgeniy Polyakov


2006-08-14 11:22:10

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Evgeniy Polyakov <[email protected]>
Date: Mon, 14 Aug 2006 15:04:03 +0400

> /* These elements must be at the end, see alloc_skb() for details. */
> - unsigned int truesize;
> + unsigned int truesize, __tsize;

There is no real need for new member.

> - kfree(skb->head);
> + avl_free(skb->head, skb->__tsize);

Just use "skb->end - skb->head + sizeof(struct skb_shared_info)"
as the size argument.

Then, there is no reason for skb->__tsize :-)

2006-08-14 11:33:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 04:22:06AM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Mon, 14 Aug 2006 15:04:03 +0400
>
> > /* These elements must be at the end, see alloc_skb() for details. */
> > - unsigned int truesize;
> > + unsigned int truesize, __tsize;
>
> There is no real need for new member.
>
> > - kfree(skb->head);
> > + avl_free(skb->head, skb->__tsize);
>
> Just use "skb->end - skb->head + sizeof(struct skb_shared_info)"
> as the size argument.
>
> Then, there is no reason for skb->__tsize :-)

Oh, my fault - that simple calculation dropped out of my head...

--
Evgeniy Polyakov

2006-08-14 11:40:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

Evgeniy Polyakov <[email protected]> writes:

> Design notes.
> Original idea was to store meta information used for allocation in an
> AVL tree [1], but since I found a way to use some "unused" fields in struct page,
> tree is unused in the allocator.

But there seems to be still an AVL tree in there?


> Benchmarks with trivial epoll based web server showed noticeble (more
> than 40%) imrovements of the request rates (1600-1800 requests per
> second vs. more than 2300 ones). It can be described by more
> cache-friendly freeing algorithm, by tighter objects packing and thus
> reduced cache line ping-pongs, reduced lookups into higher-layer caches
> and so on.

So what are its drawbacks compared to slab/kmalloc?

Also if it really performs that much better it might be a good
idea to replace all of kmalloc() with it, but doing that
would require a lot more benchmarks with various workloads
and small and big machines first.

-Andi

2006-08-14 11:46:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 01:40:21PM +0200, Andi Kleen ([email protected]) wrote:
> Evgeniy Polyakov <[email protected]> writes:
>
> > Design notes.
> > Original idea was to store meta information used for allocation in an
> > AVL tree [1], but since I found a way to use some "unused" fields in struct page,
> > tree is unused in the allocator.
>
> But there seems to be still an AVL tree in there?

Yep.
Tree structure can be used for simpler memory addon/removal from
hotplug, but I have not that in mind.
It will be removed soon.

> > Benchmarks with trivial epoll based web server showed noticeble (more
> > than 40%) imrovements of the request rates (1600-1800 requests per
> > second vs. more than 2300 ones). It can be described by more
> > cache-friendly freeing algorithm, by tighter objects packing and thus
> > reduced cache line ping-pongs, reduced lookups into higher-layer caches
> > and so on.
>
> So what are its drawbacks compared to slab/kmalloc?

Hmm... Bigger per-page overhead (additional bitmask of free/used
objects). More complex algorithm behind freeing.

> Also if it really performs that much better it might be a good
> idea to replace all of kmalloc() with it, but doing that
> would require a lot more benchmarks with various workloads
> and small and big machines first.

First user can be MMU-less systems which suffer noticebly from
fragmentations and power-of-two overhead.

> -Andi

--
Evgeniy Polyakov

2006-08-14 12:07:50

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

Evgeniy Polyakov (on Mon, 14 Aug 2006 15:04:03 +0400) wrote:
>Network tree allocator can be used to allocate memory for all network
>operations from any context....
>...
>Design of allocator allows to map all node's pages into userspace thus
>allows to have true zero-copy support for both sending and receiving
>dataflows.

Is that true for architectures with virtually indexed caches? How do
you avoid the cache aliasing problems?

2006-08-14 12:21:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 10:07:48PM +1000, Keith Owens ([email protected]) wrote:
> Evgeniy Polyakov (on Mon, 14 Aug 2006 15:04:03 +0400) wrote:
> >Network tree allocator can be used to allocate memory for all network
> >operations from any context....
> >...
> >Design of allocator allows to map all node's pages into userspace thus
> >allows to have true zero-copy support for both sending and receiving
> >dataflows.
>
> Is that true for architectures with virtually indexed caches? How do
> you avoid the cache aliasing problems?

Pages are preallocated and stolen from main memory allocator, what is
the problem with that caches? Userspace can provide enough offset so
that pages would not create aliases - it is usuall mmap.

--
Evgeniy Polyakov

2006-08-14 12:26:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, 2006-08-14 at 15:04 +0400, Evgeniy Polyakov wrote:

> Defragmentation is a part of freeing algorithm and initial fragmentation
> avoidance is being done at allocation time by removing power-of-two
> allocations. Rate of fragmentation can be found in some userspace
> modlling tests being done for both power-of-two SLAB-like and NTA
> allocators. (more details on project's homepage [4]).

Only with a perfect allocation pattern. And then still only internal
fragmentation; your allocator is still vulnerable to external
fragmentation - you cannot move allocated chunks around because there
are pointers into it, hence you will suffer from external fragmentation.

http://en.wikipedia.org/wiki/Fragmentation_%28computer%29

> Benchmarks with trivial epoll based web server showed noticeble (more
> than 40%) imrovements of the request rates (1600-1800 requests per
> second vs. more than 2300 ones). It can be described by more
> cache-friendly freeing algorithm, by tighter objects packing and thus
> reduced cache line ping-pongs, reduced lookups into higher-layer caches
> and so on.

Nice :-)

> Design of allocator allows to map all node's pages into userspace thus
> allows to have true zero-copy support for both sending and receiving
> dataflows.

I'm still not clear on how you want to do this, only the trivial case of
a sniffer was mentioned by you. To be able to do true zero-copy receive
each packet will have to have its own page(s). Simply because you do not
know the destination before you receive it, the packet could end up
going to a whole different socket that the prev/next. As soon as you
start packing multiple packets on 1 page, you've lost the zero-copy
receive game.

> As described in recent threads [3] it is also possible to eliminate any
> kind of main system OOM influence on network dataflow processing, thus
> it is possible to prevent deadlock for systems, which use network as
> memory storage (swap over network, iSCSI, NBD and so on).

How? You have never stated how you will avoid getting all packets stuck
in blocked sockets.


On another note, I think you misunderstand our SLAB allocator; we do not
round up to nearest order page alloc per object; SLAB is build to avoid
that and is designed to pack equal size objects into pages. The kmalloc
allocator is build on top of several SLAB allocators; each with its
specific size objects to serve.

For example, the 64 byte SLAB will serve 64 byte objects, and packs
about PAGE_SIZE/64 per page (about since there is some overhead).

So the actual internal fragmentation of the current kmalloc/SLAB
allocator is not as bad as you paint it. The biggest problem we have
with the SLAB thing is getting pages back from it. (And the horrific
complexity of the current implementation)

2006-08-14 12:35:53

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 02:25:13PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Mon, 2006-08-14 at 15:04 +0400, Evgeniy Polyakov wrote:
>
> > Defragmentation is a part of freeing algorithm and initial fragmentation
> > avoidance is being done at allocation time by removing power-of-two
> > allocations. Rate of fragmentation can be found in some userspace
> > modlling tests being done for both power-of-two SLAB-like and NTA
> > allocators. (more details on project's homepage [4]).
>
> Only with a perfect allocation pattern. And then still only internal
> fragmentation; your allocator is still vulnerable to external
> fragmentation - you cannot move allocated chunks around because there
> are pointers into it, hence you will suffer from external fragmentation.
>
> http://en.wikipedia.org/wiki/Fragmentation_%28computer%29

Nature of network dataflow does not obey to that requirements - all
chunks are "short-lives", i.e. sooner or later they will be put back and
thus initial region will be repaired.
With existing allocator it never happens by deisgn.

> > Benchmarks with trivial epoll based web server showed noticeble (more
> > than 40%) imrovements of the request rates (1600-1800 requests per
> > second vs. more than 2300 ones). It can be described by more
> > cache-friendly freeing algorithm, by tighter objects packing and thus
> > reduced cache line ping-pongs, reduced lookups into higher-layer caches
> > and so on.
>
> Nice :-)
>
> > Design of allocator allows to map all node's pages into userspace thus
> > allows to have true zero-copy support for both sending and receiving
> > dataflows.
>
> I'm still not clear on how you want to do this, only the trivial case of
> a sniffer was mentioned by you. To be able to do true zero-copy receive
> each packet will have to have its own page(s). Simply because you do not
> know the destination before you receive it, the packet could end up
> going to a whole different socket that the prev/next. As soon as you
> start packing multiple packets on 1 page, you've lost the zero-copy
> receive game.

Userspace can sak for next packet and pointer to the new location will
be removed.

> > As described in recent threads [3] it is also possible to eliminate any
> > kind of main system OOM influence on network dataflow processing, thus
> > it is possible to prevent deadlock for systems, which use network as
> > memory storage (swap over network, iSCSI, NBD and so on).
>
> How? You have never stated how you will avoid getting all packets stuck
> in blocked sockets.

Each socket has it's limit, so if allocator got enough memory, blocked
sockets will not affect it's behaviour.

> On another note, I think you misunderstand our SLAB allocator; we do not
> round up to nearest order page alloc per object; SLAB is build to avoid
> that and is designed to pack equal size objects into pages. The kmalloc
> allocator is build on top of several SLAB allocators; each with its
> specific size objects to serve.
>
> For example, the 64 byte SLAB will serve 64 byte objects, and packs
> about PAGE_SIZE/64 per page (about since there is some overhead).
>
> So the actual internal fragmentation of the current kmalloc/SLAB
> allocator is not as bad as you paint it. The biggest problem we have
> with the SLAB thing is getting pages back from it. (And the horrific
> complexity of the current implementation)

Ok, not SLAB, but kmaloc/SLAB.
That allocator uses power-of-two allocation, so there is extremely
large overhead for several (and in some cases for all) usage cases
(e1000 with jumbo frames and unix sockets).
SLAB allows to have chunks of memory from differenct CPU, so it is
impossible to create defragmentation, thus kmalloc/SLAB by design will
suffer from fragmentation.
Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
- overhead is extremely large.

--
Evgeniy Polyakov

2006-08-14 12:39:25

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 04:35:30PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> > I'm still not clear on how you want to do this, only the trivial case of
> > a sniffer was mentioned by you. To be able to do true zero-copy receive
> > each packet will have to have its own page(s). Simply because you do not
> > know the destination before you receive it, the packet could end up
> > going to a whole different socket that the prev/next. As soon as you
> > start packing multiple packets on 1 page, you've lost the zero-copy
> > receive game.
>
> Userspace can sak for next packet and pointer to the new location will
> be removed.

... returned.

The same will be applied for sending support - userspace will request
new packet with given size and pointer to some chunk inside mapped area
will be returned.

--
Evgeniy Polyakov

2006-08-14 17:43:05

by Rick Jones

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

Evgeniy Polyakov wrote:
> On Mon, Aug 14, 2006 at 10:07:48PM +1000, Keith Owens ([email protected]) wrote:
>
>>Evgeniy Polyakov (on Mon, 14 Aug 2006 15:04:03 +0400) wrote:
>>
>>>Network tree allocator can be used to allocate memory for all network
>>>operations from any context....
>>>...
>>>Design of allocator allows to map all node's pages into userspace thus
>>>allows to have true zero-copy support for both sending and receiving
>>>dataflows.
>>
>>Is that true for architectures with virtually indexed caches? How do
>>you avoid the cache aliasing problems?
>
>
> Pages are preallocated and stolen from main memory allocator, what is
> the problem with that caches? Userspace can provide enough offset so
> that pages would not create aliases - it is usuall mmap.

That may depend heavily on the architecture. PA-RISC has the concept of
spaceid's, and bits from the spaceid can be included in the hash along
with bits from the offset. So, it is not possible to simply match the
offset, one has to make sure that hash bits from the spaceid hash the
same as well.

Now, PA-RISC CPUs have the ability to disable spaceid hashing, and it is
entirely possible that the PA-RISC linux port does that, but I thought I
would mention it as an example. I'm sure the "official" PA-RISC linux
folks can expand on that much much better than I can.

rick jones

2006-08-14 17:46:20

by Rick Jones

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

> Benchmarks with trivial epoll based web server showed noticeble (more
> than 40%) imrovements of the request rates (1600-1800 requests per
> second vs. more than 2300 ones). It can be described by more
> cache-friendly freeing algorithm, by tighter objects packing and thus
> reduced cache line ping-pongs, reduced lookups into higher-layer caches
> and so on.

Is that an hypothesis, or did you get a chance to gather cache stats
with something like http://www.hp.com/go/Caliper or the like on the
platform(s) you were testing?

rick jones

2006-08-14 19:42:38

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, Aug 14, 2006 at 10:46:17AM -0700, Rick Jones ([email protected]) wrote:
> >Benchmarks with trivial epoll based web server showed noticeble (more
> >than 40%) imrovements of the request rates (1600-1800 requests per
> >second vs. more than 2300 ones). It can be described by more
> >cache-friendly freeing algorithm, by tighter objects packing and thus
> >reduced cache line ping-pongs, reduced lookups into higher-layer caches
> >and so on.
>
> Is that an hypothesis, or did you get a chance to gather cache stats
> with something like http://www.hp.com/go/Caliper or the like on the
> platform(s) you were testing?

It is theory based on code observation, design comparison and logic.

> rick jones

--
Evgeniy Polyakov

2006-08-14 20:15:21

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Rick Jones <[email protected]>
Date: Mon, 14 Aug 2006 10:42:55 -0700

> Now, PA-RISC CPUs have the ability to disable spaceid hashing, and it is
> entirely possible that the PA-RISC linux port does that, but I thought I
> would mention it as an example. I'm sure the "official" PA-RISC linux
> folks can expand on that much much better than I can.

Regardless, the "offset" it usually taken care of transparently
by the kernel in order to avoid cache aliasing issues.

It is definitely something we'll need to deal with for zero-copy
I/O using NTA. We'll have to make sure that the user mapping of
the page is of the same color as the mapping the kernel uses.

2006-08-15 07:27:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, 14 Aug 2006 15:04:03 +0400
Evgeniy Polyakov <[email protected]> wrote:

>
> Design of allocator allows to map all node's pages into userspace thus
> allows to have true zero-copy support for both sending and receiving
> dataflows.

If the pages can be order-1 or higher then they'll need to be compound
pages so that the kernel gets the page's refcounting correct if the user
tries to perform direct-io against them (get_user_pages()).

And compound pages use ->lru for internal metadata - see prep_compound_page().

> +static avl_t avl_node_id[NR_CPUS];
> +static struct avl_node **avl_node_array[NR_CPUS];
> +static struct list_head *avl_container_array[NR_CPUS];
> +static struct avl_node *avl_root[NR_CPUS];
> +static struct avl_free_list *avl_free_list_head[NR_CPUS];
> +static spinlock_t avl_free_lock[NR_CPUS];

There will be heaps of cacheline pingpong accessing these arrays. I'd have
though that

static struct whatever {
avl_t avl_node_id;
struct avl_node **avl_node_array;
struct list_head *avl_container_array;
struct avl_node *avl_root;
struct avl_free_list *avl_free_list_head;
spinlock_t avl_free_lock;
} __cacheline_aligned_in_smp whatevers[NR_CPUS];

would be better.

> +typedef unsigned long value_t;
> +typedef u16 avl_t;

Do we really need the typedefs?

If so, these have too generic names for globally-scoped identifiers.

> +static inline struct avl_node *avl_get_node(avl_t id, int cpu)
> +{
> + avl_t idx, off;
> +
> + if (id >= AVL_NODE_NUM)
> + return NULL;
> +
> + idx = id/AVL_NODES_ON_PAGE;
> + off = id%AVL_NODES_ON_PAGE;
> +
> + return &(avl_node_array[cpu][idx][off]);
> +}

Too big to inline.

> +static inline struct avl_node *avl_get_node_ptr(unsigned long ptr)
> +{
> + struct page *page = virt_to_page(ptr);
> + struct avl_node *node = (struct avl_node *)(page->lru.next);
> +
> + return node;
> +}

Probably OK.

> +static inline void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int order)
> +{
> + int nr_pages = 1<<order, i;
> + struct page *page = virt_to_page(ptr);
> +
> + for (i=0; i<nr_pages; ++i) {
> + page->lru.next = (void *)node;
> + page++;
> + }
> +}

Too big

> +static inline int avl_get_cpu_ptr(unsigned long ptr)
> +{
> + struct page *page = virt_to_page(ptr);
> + int cpu = (int)(unsigned long)(page->lru.prev);
> +
> + return cpu;
> +}

Too big

> +static inline void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order)
> +{
> + int nr_pages = 1<<order, i;
> + struct page *page = virt_to_page(ptr);
> +
> + for (i=0; i<nr_pages; ++i) {
> + page->lru.prev = (void *)(unsigned long)cpu;
> + page++;
> + }
> +}

Too big

> +static inline enum avl_balance avl_compare(struct avl_node *a, struct avl_node *b)
> +{
> + if (a->value == b->value)
> + return AVL_BALANCED;
> + else if (a->value > b->value)
> + return AVL_RIGHT;
> + else
> + return AVL_LEFT;
> +}

Might be too big.

> +static inline void avl_set_left(struct avl_node *node, avl_t val)
> +{
> + node->left = val;
> +}
> +
> +static inline void avl_set_right(struct avl_node *node, avl_t val)
> +{
> + node->right = val;
> +}
> +
> +static inline void avl_set_parent(struct avl_node *node, avl_t val)
> +{
> + node->parent = val;
> +}

Not too big ;)

> +static inline void avl_rotate_complete(struct avl_node *parent, struct avl_node *node, int cpu)
> +{
> + avl_set_parent(node, parent->parent);
> + if (parent->parent != AVL_NODE_EMPTY) {
> + if (parent->pos == AVL_RIGHT)
> + avl_set_right(avl_get_node(parent->parent, cpu), node->id);
> + else
> + avl_set_left(avl_get_node(parent->parent, cpu), node->id);
> + }
> + avl_set_parent(parent, node->id);
> +}
> +
> +static inline void avl_ll(struct avl_node *node, int cpu)
> +{
> + struct avl_node *parent = avl_get_node(node->parent, cpu);
> + struct avl_node *left = avl_get_node(node->left, cpu);
> +
> + avl_rotate_complete(parent, node, cpu);
> + avl_set_left(node, parent->id);
> + node->pos = parent->pos;
> + parent->pos = AVL_LEFT;
> + if (!left) {
> + avl_set_right(parent, AVL_NODE_EMPTY);
> + } else {
> + avl_set_parent(left, parent->id);
> + left->pos = AVL_RIGHT;
> + avl_set_right(parent, left->id);
> + }
> +}
> +
> +static inline void avl_rr(struct avl_node *node, int cpu)
> +{
> + struct avl_node *parent = avl_get_node(node->parent, cpu);
> + struct avl_node *right = avl_get_node(node->right, cpu);
> +
> + avl_rotate_complete(parent, node, cpu);
> + avl_set_right(node, parent->id);
> + node->pos = parent->pos;
> + parent->pos = AVL_RIGHT;
> + if (!right)
> + avl_set_left(parent, AVL_NODE_EMPTY);
> + else {
> + avl_set_parent(right, parent->id);
> + right->pos = AVL_LEFT;
> + avl_set_left(parent, right->id);
> + }
> +}
> +
> +static inline void avl_rl_balance(struct avl_node *node, int cpu)
> +{
> + avl_rr(node, cpu);
> + avl_ll(node, cpu);
> +}
> +
> +static inline void avl_lr_balance(struct avl_node *node, int cpu)
> +{
> + avl_ll(node, cpu);
> + avl_rr(node, cpu);
> +}
> +
> +static inline void avl_balance_single(struct avl_node *node, struct avl_node *parent)
> +{
> + node->balance = parent->balance = AVL_BALANCED;
> +}
> +
> +static inline void avl_ll_balance(struct avl_node *node, int cpu)
> +{
> + struct avl_node *parent = avl_get_node(node->parent, cpu);
> +
> + avl_ll(node, cpu);
> + avl_balance_single(node, parent);
> +}
> +
> +static inline void avl_rr_balance(struct avl_node *node, int cpu)
> +{
> + struct avl_node *parent = avl_get_node(node->parent, cpu);
> +
> + avl_rr(node, cpu);
> + avl_balance_single(node, parent);
> +}

All way too big.

(Yes, avl_rr_balance() has a single callsite, as does avl_ll_balance(), but
they each have separate copies of the very large avl_rr())

> + /*
> + * NTA steals pages and never return them back to the system.
> + */

That's the only code comment in this entire 882-line file?

2006-08-15 08:08:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

Andrew Morton <[email protected]> writes:
>
> There will be heaps of cacheline pingpong accessing these arrays. I'd have
> though that
>
> static struct whatever {
> avl_t avl_node_id;
> struct avl_node **avl_node_array;
> struct list_head *avl_container_array;
> struct avl_node *avl_root;
> struct avl_free_list *avl_free_list_head;
> spinlock_t avl_free_lock;
> } __cacheline_aligned_in_smp whatevers[NR_CPUS];
>
> would be better.

Or even better per cpu data. New global/static NR_CPUS arrays should be really discouraged.

-Andi

2006-08-15 09:21:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 12:27:24AM -0700, Andrew Morton ([email protected]) wrote:
> On Mon, 14 Aug 2006 15:04:03 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> >
> > Design of allocator allows to map all node's pages into userspace thus
> > allows to have true zero-copy support for both sending and receiving
> > dataflows.
>
> If the pages can be order-1 or higher then they'll need to be compound
> pages so that the kernel gets the page's refcounting correct if the user
> tries to perform direct-io against them (get_user_pages()).
>
> And compound pages use ->lru for internal metadata - see prep_compound_page().

It is tricky - pages are not marked as compound, I split them manually
into 0-order ones after they were allocated as higher order.
I do that exactly for purpose of freein lru list.

> > +static avl_t avl_node_id[NR_CPUS];
> > +static struct avl_node **avl_node_array[NR_CPUS];
> > +static struct list_head *avl_container_array[NR_CPUS];
> > +static struct avl_node *avl_root[NR_CPUS];
> > +static struct avl_free_list *avl_free_list_head[NR_CPUS];
> > +static spinlock_t avl_free_lock[NR_CPUS];
>
> There will be heaps of cacheline pingpong accessing these arrays. I'd have
> though that
>
> static struct whatever {
> avl_t avl_node_id;
> struct avl_node **avl_node_array;
> struct list_head *avl_container_array;
> struct avl_node *avl_root;
> struct avl_free_list *avl_free_list_head;
> spinlock_t avl_free_lock;
> } __cacheline_aligned_in_smp whatevers[NR_CPUS];
>
> would be better.

Yes, I is better.

> > +typedef unsigned long value_t;
> > +typedef u16 avl_t;
>
> Do we really need the typedefs?
>
> If so, these have too generic names for globally-scoped identifiers.

It came from AVL tre implementation - generally it is better to operate
with one type so it could be replaced easily - for example is someone
will decide to use it not only for networking and thus 16bits will not
be enough one should only change above typedef.

> > +static inline struct avl_node *avl_get_node(avl_t id, int cpu)
> > +{
> > + avl_t idx, off;
> > +
> > + if (id >= AVL_NODE_NUM)
> > + return NULL;
> > +
> > + idx = id/AVL_NODES_ON_PAGE;
> > + off = id%AVL_NODES_ON_PAGE;
> > +
> > + return &(avl_node_array[cpu][idx][off]);
> > +}
>
> Too big to inline.

If you think so...

> > +static inline struct avl_node *avl_get_node_ptr(unsigned long ptr)
> > +{
> > + struct page *page = virt_to_page(ptr);
> > + struct avl_node *node = (struct avl_node *)(page->lru.next);
> > +
> > + return node;
> > +}
>
> Probably OK.
>
> > +static inline void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int order)
> > +{
> > + int nr_pages = 1<<order, i;
> > + struct page *page = virt_to_page(ptr);
> > +
> > + for (i=0; i<nr_pages; ++i) {
> > + page->lru.next = (void *)node;
> > + page++;
> > + }
> > +}
>
> Too big

Ugh, it has a loop, will uninline.

> > +static inline int avl_get_cpu_ptr(unsigned long ptr)
> > +{
> > + struct page *page = virt_to_page(ptr);
> > + int cpu = (int)(unsigned long)(page->lru.prev);
> > +
> > + return cpu;
> > +}
>
> Too big

I think it is ok - it the same as avl_get_cpu_ptr() above only return
value differs.

> > +static inline void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order)
> > +{
> > + int nr_pages = 1<<order, i;
> > + struct page *page = virt_to_page(ptr);
> > +
> > + for (i=0; i<nr_pages; ++i) {
> > + page->lru.prev = (void *)(unsigned long)cpu;
> > + page++;
> > + }
> > +}
>
> Too big

Yep.

> > +static inline enum avl_balance avl_compare(struct avl_node *a, struct avl_node *b)
> > +{
> > + if (a->value == b->value)
> > + return AVL_BALANCED;
> > + else if (a->value > b->value)
> > + return AVL_RIGHT;
> > + else
> > + return AVL_LEFT;
> > +}
>
> Might be too big.
>
> > +static inline void avl_set_left(struct avl_node *node, avl_t val)
> > +{
> > + node->left = val;
> > +}
> > +
> > +static inline void avl_set_right(struct avl_node *node, avl_t val)
> > +{
> > + node->right = val;
> > +}
> > +
> > +static inline void avl_set_parent(struct avl_node *node, avl_t val)
> > +{
> > + node->parent = val;
> > +}
>
> Not too big ;)

That's good :)

> > +static inline void avl_rotate_complete(struct avl_node *parent, struct avl_node *node, int cpu)
> > +{
> > + avl_set_parent(node, parent->parent);
> > + if (parent->parent != AVL_NODE_EMPTY) {
> > + if (parent->pos == AVL_RIGHT)
> > + avl_set_right(avl_get_node(parent->parent, cpu), node->id);
> > + else
> > + avl_set_left(avl_get_node(parent->parent, cpu), node->id);
> > + }
> > + avl_set_parent(parent, node->id);
> > +}
> > +
> > +static inline void avl_ll(struct avl_node *node, int cpu)
> > +{
> > + struct avl_node *parent = avl_get_node(node->parent, cpu);
> > + struct avl_node *left = avl_get_node(node->left, cpu);
> > +
> > + avl_rotate_complete(parent, node, cpu);
> > + avl_set_left(node, parent->id);
> > + node->pos = parent->pos;
> > + parent->pos = AVL_LEFT;
> > + if (!left) {
> > + avl_set_right(parent, AVL_NODE_EMPTY);
> > + } else {
> > + avl_set_parent(left, parent->id);
> > + left->pos = AVL_RIGHT;
> > + avl_set_right(parent, left->id);
> > + }
> > +}
> > +
> > +static inline void avl_rr(struct avl_node *node, int cpu)
> > +{
> > + struct avl_node *parent = avl_get_node(node->parent, cpu);
> > + struct avl_node *right = avl_get_node(node->right, cpu);
> > +
> > + avl_rotate_complete(parent, node, cpu);
> > + avl_set_right(node, parent->id);
> > + node->pos = parent->pos;
> > + parent->pos = AVL_RIGHT;
> > + if (!right)
> > + avl_set_left(parent, AVL_NODE_EMPTY);
> > + else {
> > + avl_set_parent(right, parent->id);
> > + right->pos = AVL_LEFT;
> > + avl_set_left(parent, right->id);
> > + }
> > +}
> > +
> > +static inline void avl_rl_balance(struct avl_node *node, int cpu)
> > +{
> > + avl_rr(node, cpu);
> > + avl_ll(node, cpu);
> > +}
> > +
> > +static inline void avl_lr_balance(struct avl_node *node, int cpu)
> > +{
> > + avl_ll(node, cpu);
> > + avl_rr(node, cpu);
> > +}
> > +
> > +static inline void avl_balance_single(struct avl_node *node, struct avl_node *parent)
> > +{
> > + node->balance = parent->balance = AVL_BALANCED;
> > +}
> > +
> > +static inline void avl_ll_balance(struct avl_node *node, int cpu)
> > +{
> > + struct avl_node *parent = avl_get_node(node->parent, cpu);
> > +
> > + avl_ll(node, cpu);
> > + avl_balance_single(node, parent);
> > +}
> > +
> > +static inline void avl_rr_balance(struct avl_node *node, int cpu)
> > +{
> > + struct avl_node *parent = avl_get_node(node->parent, cpu);
> > +
> > + avl_rr(node, cpu);
> > + avl_balance_single(node, parent);
> > +}
>
> All way too big.
>
> (Yes, avl_rr_balance() has a single callsite, as does avl_ll_balance(), but
> they each have separate copies of the very large avl_rr())

Ok, it should be even better for icache to uninline them.

> > + /*
> > + * NTA steals pages and never return them back to the system.
> > + */
>
> That's the only code comment in this entire 882-line file?

:) understood.

--
Evgeniy Polyakov

2006-08-15 10:03:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 10:08:23AM +0200, Andi Kleen ([email protected]) wrote:
> Andrew Morton <[email protected]> writes:
> >
> > There will be heaps of cacheline pingpong accessing these arrays. I'd have
> > though that
> >
> > static struct whatever {
> > avl_t avl_node_id;
> > struct avl_node **avl_node_array;
> > struct list_head *avl_container_array;
> > struct avl_node *avl_root;
> > struct avl_free_list *avl_free_list_head;
> > spinlock_t avl_free_lock;
> > } __cacheline_aligned_in_smp whatevers[NR_CPUS];
> >
> > would be better.
>
> Or even better per cpu data. New global/static NR_CPUS arrays should be really discouraged.

I had a version with per-cpu data - it is not very convenient to use here with it's
per_cpu_ptr dereferencings....

> -Andi

--
Evgeniy Polyakov

2006-08-15 10:27:40

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Evgeniy Polyakov <[email protected]>
Date: Tue, 15 Aug 2006 14:02:28 +0400

> I had a version with per-cpu data - it is not very convenient to use
> here with it's per_cpu_ptr dereferencings....

It does eat lots of space though, even for non-present cpus, and for
local cpu case the access may even get optimized to a single register
+ offset computation on some platforms :-)

2006-08-15 10:56:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Mon, 2006-08-14 at 16:35 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 14, 2006 at 02:25:13PM +0200, Peter Zijlstra ([email protected]) wrote:
> > On Mon, 2006-08-14 at 15:04 +0400, Evgeniy Polyakov wrote:
> >
> > > Defragmentation is a part of freeing algorithm and initial fragmentation
> > > avoidance is being done at allocation time by removing power-of-two
> > > allocations. Rate of fragmentation can be found in some userspace
> > > modlling tests being done for both power-of-two SLAB-like and NTA
> > > allocators. (more details on project's homepage [4]).
> >
> > Only with a perfect allocation pattern. And then still only internal
> > fragmentation; your allocator is still vulnerable to external
> > fragmentation - you cannot move allocated chunks around because there
> > are pointers into it, hence you will suffer from external fragmentation.
> >
> > http://en.wikipedia.org/wiki/Fragmentation_%28computer%29
>
> Nature of network dataflow does not obey to that requirements - all
> chunks are "short-lives", i.e. sooner or later they will be put back and
> thus initial region will be repaired.
> With existing allocator it never happens by deisgn.
>
> > > Benchmarks with trivial epoll based web server showed noticeble (more
> > > than 40%) imrovements of the request rates (1600-1800 requests per
> > > second vs. more than 2300 ones). It can be described by more
> > > cache-friendly freeing algorithm, by tighter objects packing and thus
> > > reduced cache line ping-pongs, reduced lookups into higher-layer caches
> > > and so on.
> >
> > Nice :-)
> >
> > > Design of allocator allows to map all node's pages into userspace thus
> > > allows to have true zero-copy support for both sending and receiving
> > > dataflows.
> >
> > I'm still not clear on how you want to do this, only the trivial case of
> > a sniffer was mentioned by you. To be able to do true zero-copy receive
> > each packet will have to have its own page(s). Simply because you do not
> > know the destination before you receive it, the packet could end up
> > going to a whole different socket that the prev/next. As soon as you
> > start packing multiple packets on 1 page, you've lost the zero-copy
> > receive game.
>
> Userspace can sak for next packet and pointer to the new location will
> be removed.

/sak/ask/?

I'm not understanding, if you have a page A with two packets, a and b;
once you map that page into user-space that process has access to both
packets, which is a security problem. How are you going to solve this?

Also note that zero-copy sending does not have this problem, since data
is already grouped by socket.

> > > As described in recent threads [3] it is also possible to eliminate any
> > > kind of main system OOM influence on network dataflow processing, thus
> > > it is possible to prevent deadlock for systems, which use network as
> > > memory storage (swap over network, iSCSI, NBD and so on).
> >
> > How? You have never stated how you will avoid getting all packets stuck
> > in blocked sockets.
>
> Each socket has it's limit, so if allocator got enough memory, blocked
> sockets will not affect it's behaviour.

But isn't the total capacity of the network stack much larger than any
allocator can provide?

> > On another note, I think you misunderstand our SLAB allocator; we do not
> > round up to nearest order page alloc per object; SLAB is build to avoid
> > that and is designed to pack equal size objects into pages. The kmalloc
> > allocator is build on top of several SLAB allocators; each with its
> > specific size objects to serve.
> >
> > For example, the 64 byte SLAB will serve 64 byte objects, and packs
> > about PAGE_SIZE/64 per page (about since there is some overhead).
> >
> > So the actual internal fragmentation of the current kmalloc/SLAB
> > allocator is not as bad as you paint it. The biggest problem we have
> > with the SLAB thing is getting pages back from it. (And the horrific
> > complexity of the current implementation)
>
> Ok, not SLAB, but kmaloc/SLAB.

The page-allocator does what you describe, but hardly anybody uses that
to store small objects.

Page allocator - buddy allocator, gives out memory in 1<<n pages.

SLAB allocator - uses the page allocator for backing, each SLAB issues
objects of a fixed, predetermined size, packed in pages.

kmalloc - uses a collection of SLAB allocators to issue 'variable' size
objects (see kmalloc_sizes.h - as you will see internal fragmentation
can become quite large for larger objects, but small objects do rather
well - and one could always add a frequently used size if it shows to be
beneficial).

> That allocator uses power-of-two allocation, so there is extremely
> large overhead for several (and in some cases for all) usage cases
> (e1000 with jumbo frames and unix sockets).

Wrong example :-), e1000 is the only driver that doesn't do high order
allocs for jumbo frames. But yes, the other drivers should be fixed,
relying on higher order allocations is unsound.

> SLAB allows to have chunks of memory from differenct CPU, so it is
> impossible to create defragmentation, thus kmalloc/SLAB by design will
> suffer from fragmentation.

*confused* memory is not bound to CPUs other than by NUMA, but even
there there is only a single address space.

> Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
> - overhead is extremely large.

Yes seen that, but as stated, hardly anybody uses the page allocator to
store small objects. However if you do, you get large internal
fragmentation but zero external fragmentation (on that allocation
level).

This is where my SROG allocator comes in, it is used to group objects by
lifetime and returns the pages to the page allocator. This makes the
whole allocator short-lived and hence cannot add (external)
fragmentation on this level. The use I have for that is that I can then
properly gauge how much memory there is available. External
fragmentation and guarantees can be difficult to reconcile.

I have no idea how fast/slow the SROG allocator is, and don't really
care since its only used as a fallback allocator; what I do care about
is determinism (in space).

However, I do have a patch that converts the whole skb layer to use the
SROG allocator, not only the payload, so I could do some test. But this
is not a serious candidate until all jumbo frame capable drivers have
been converted to skb fragments instead of high order allocations - a
Good Thing [tm].

2006-08-15 11:27:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 12:55:02PM +0200, Peter Zijlstra ([email protected]) wrote:
> > Userspace can sak for next packet and pointer to the new location will
> > be removed.
>
> /sak/ask/?
>
> I'm not understanding, if you have a page A with two packets, a and b;
> once you map that page into user-space that process has access to both
> packets, which is a security problem. How are you going to solve this?

Yep, there is such issue.
But no one is ever going to replace socket code with zero-copy
interfaces - Linux has backward compatibility noone ever had, so
send()/recv() will be there.
It is new interface which can be changed as described in previous
e-mails - copy if next chunk belongs to different socket and so on.

Initial user will be sniffer, which should get all packets.

> Also note that zero-copy sending does not have this problem, since data
> is already grouped by socket.
>
> > > > As described in recent threads [3] it is also possible to eliminate any
> > > > kind of main system OOM influence on network dataflow processing, thus
> > > > it is possible to prevent deadlock for systems, which use network as
> > > > memory storage (swap over network, iSCSI, NBD and so on).
> > >
> > > How? You have never stated how you will avoid getting all packets stuck
> > > in blocked sockets.
> >
> > Each socket has it's limit, so if allocator got enough memory, blocked
> > sockets will not affect it's behaviour.
>
> But isn't the total capacity of the network stack much larger than any
> allocator can provide?

TCP has 768kb limit on my amd64 with 1gb of ram, so I expect allocator
can handle all requests.
And there is a simple task in TODO list to dynamically grow cache when
threshold of memory is in use. It is really simple task and will be
implemented as soon as I complete suggestions mentioned by Andrew Morton.

> > > On another note, I think you misunderstand our SLAB allocator; we do not
> > > round up to nearest order page alloc per object; SLAB is build to avoid
> > > that and is designed to pack equal size objects into pages. The kmalloc
> > > allocator is build on top of several SLAB allocators; each with its
> > > specific size objects to serve.
> > >
> > > For example, the 64 byte SLAB will serve 64 byte objects, and packs
> > > about PAGE_SIZE/64 per page (about since there is some overhead).
> > >
> > > So the actual internal fragmentation of the current kmalloc/SLAB
> > > allocator is not as bad as you paint it. The biggest problem we have
> > > with the SLAB thing is getting pages back from it. (And the horrific
> > > complexity of the current implementation)
> >
> > Ok, not SLAB, but kmaloc/SLAB.
>
> The page-allocator does what you describe, but hardly anybody uses that
> to store small objects.

Network stack uses kmalloc.

> Page allocator - buddy allocator, gives out memory in 1<<n pages.
>
> SLAB allocator - uses the page allocator for backing, each SLAB issues
> objects of a fixed, predetermined size, packed in pages.
>
> kmalloc - uses a collection of SLAB allocators to issue 'variable' size
> objects (see kmalloc_sizes.h - as you will see internal fragmentation
> can become quite large for larger objects, but small objects do rather
> well - and one could always add a frequently used size if it shows to be
> beneficial).

There is no "frequently used size", kmalloc() does not know what size is
frequent and what is not. And there are other mentioned problems with
kmalloc/SLAB besides power-of-two, which prevent fragmentation problem
resolution.

> > That allocator uses power-of-two allocation, so there is extremely
> > large overhead for several (and in some cases for all) usage cases
> > (e1000 with jumbo frames and unix sockets).
>
> Wrong example :-), e1000 is the only driver that doesn't do high order
> allocs for jumbo frames. But yes, the other drivers should be fixed,
> relying on higher order allocations is unsound.

:) do you read netdev@? There are several quite long recent discussions
where network hackers blame exactly e1000 for it's hardware problems and
ugly memory usage model.
We even think how to change struct sk_buff - Holy Grail of network code
- just to help e1000 driver (well, not exactly for e1000, but that
driver was a cause).

> > SLAB allows to have chunks of memory from differenct CPU, so it is
> > impossible to create defragmentation, thus kmalloc/SLAB by design will
> > suffer from fragmentation.
>
> *confused* memory is not bound to CPUs other than by NUMA, but even
> there there is only a single address space.

Each slab can have objects allocated in different CPUs, it was done to
reduce freeing algorithm. If system wants to defragment several objects
into bigger one, it must check all CPUs and find in which cache those
objects are placed, which is extremely expensive, so SLAB can not
perform defragmentation.

> > Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
> > - overhead is extremely large.
>
> Yes seen that, but as stated, hardly anybody uses the page allocator to
> store small objects. However if you do, you get large internal
> fragmentation but zero external fragmentation (on that allocation
> level).

Truncated cat /proc/slabinfo on my machine (usual desktop):
size-32 1170 1232 32
size-128 663 780 128
size-64 4239 9558 64

> This is where my SROG allocator comes in, it is used to group objects by
> lifetime and returns the pages to the page allocator. This makes the
> whole allocator short-lived and hence cannot add (external)
> fragmentation on this level. The use I have for that is that I can then
> properly gauge how much memory there is available. External
> fragmentation and guarantees can be difficult to reconcile.
>
> I have no idea how fast/slow the SROG allocator is, and don't really
> care since its only used as a fallback allocator; what I do care about
> is determinism (in space).
>
> However, I do have a patch that converts the whole skb layer to use the
> SROG allocator, not only the payload, so I could do some test. But this
> is not a serious candidate until all jumbo frame capable drivers have
> been converted to skb fragments instead of high order allocations - a
> Good Thing [tm].

You created SROG after my suggestion and discussion about NTA and it works
well for it's purpose (doesn't it?), further extension could lead to creation
of NTA (or could not).
SROG is a wrapper on top of alloc_pages and list of free objects,
there are "several" differencies between allocators and I do not see how
they can compete right now.

--
Evgeniy Polyakov

2006-08-15 12:04:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, 2006-08-15 at 15:26 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 15, 2006 at 12:55:02PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > Userspace can sak for next packet and pointer to the new location will
> > > be removed.
> >
> > /sak/ask/?
> >
> > I'm not understanding, if you have a page A with two packets, a and b;
> > once you map that page into user-space that process has access to both
> > packets, which is a security problem. How are you going to solve this?
>
> Yep, there is such issue.
> But no one is ever going to replace socket code with zero-copy
> interfaces - Linux has backward compatibility noone ever had, so
> send()/recv() will be there.

The new AIO network API should be able to provide the needed userspace
changes.

> It is new interface which can be changed as described in previous
> e-mails - copy if next chunk belongs to different socket and so on.

But if you copy you're not zero-copy anymore. If you copy every second
packet you have a 1/2 copy receive, but not zero.

> Initial user will be sniffer, which should get all packets.

Only the root user may get all packets. And most enterprise systems I've
seen don't generally run a sniffer. That is usually done by
redirecting/copying the data stream in a router and attach a second host
to analyse the data.

> > > > > As described in recent threads [3] it is also possible to eliminate any
> > > > > kind of main system OOM influence on network dataflow processing, thus
> > > > > it is possible to prevent deadlock for systems, which use network as
> > > > > memory storage (swap over network, iSCSI, NBD and so on).
> > > >
> > > > How? You have never stated how you will avoid getting all packets stuck
> > > > in blocked sockets.
> > >
> > > Each socket has it's limit, so if allocator got enough memory, blocked
> > > sockets will not affect it's behaviour.
> >
> > But isn't the total capacity of the network stack much larger than any
> > allocator can provide?
>
> TCP has 768kb limit on my amd64 with 1gb of ram, so I expect allocator
> can handle all requests.

But the capacity of the network stack is larger than this (arbitrary)
limit. It is possible to have all 768kb worth of packets stuck on
blocked sockets.

> And there is a simple task in TODO list to dynamically grow cache when
> threshold of memory is in use. It is really simple task and will be
> implemented as soon as I complete suggestions mentioned by Andrew Morton.

Growing will not help, the problem is you are out of memory, you cannot
grow at that point.

> > > > On another note, I think you misunderstand our SLAB allocator; we do not
> > > > round up to nearest order page alloc per object; SLAB is build to avoid
> > > > that and is designed to pack equal size objects into pages. The kmalloc
> > > > allocator is build on top of several SLAB allocators; each with its
> > > > specific size objects to serve.
> > > >
> > > > For example, the 64 byte SLAB will serve 64 byte objects, and packs
> > > > about PAGE_SIZE/64 per page (about since there is some overhead).
> > > >
> > > > So the actual internal fragmentation of the current kmalloc/SLAB
> > > > allocator is not as bad as you paint it. The biggest problem we have
> > > > with the SLAB thing is getting pages back from it. (And the horrific
> > > > complexity of the current implementation)
> > >
> > > Ok, not SLAB, but kmaloc/SLAB.
> >
> > The page-allocator does what you describe, but hardly anybody uses that
> > to store small objects.
>
> Network stack uses kmalloc.

skbuff_head_cache and skbuff_fclone_cache are SLABs.

> > Page allocator - buddy allocator, gives out memory in 1<<n pages.
> >
> > SLAB allocator - uses the page allocator for backing, each SLAB issues
> > objects of a fixed, predetermined size, packed in pages.
> >
> > kmalloc - uses a collection of SLAB allocators to issue 'variable' size
> > objects (see kmalloc_sizes.h - as you will see internal fragmentation
> > can become quite large for larger objects, but small objects do rather
> > well - and one could always add a frequently used size if it shows to be
> > beneficial).
>
> There is no "frequently used size", kmalloc() does not know what size is
> frequent and what is not. And there are other mentioned problems with
> kmalloc/SLAB besides power-of-two, which prevent fragmentation problem
> resolution.

Yes SLAB is a horrid thing on some points but very good at a lot of
other things. But surely there are frequently used sizes, kmalloc will
not know, but a developer with profiling tools might.

> > > That allocator uses power-of-two allocation, so there is extremely
> > > large overhead for several (and in some cases for all) usage cases
> > > (e1000 with jumbo frames and unix sockets).
> >
> > Wrong example :-), e1000 is the only driver that doesn't do high order
> > allocs for jumbo frames. But yes, the other drivers should be fixed,
> > relying on higher order allocations is unsound.
>
> :) do you read netdev@? There are several quite long recent discussions
> where network hackers blame exactly e1000 for it's hardware problems and
> ugly memory usage model.
> We even think how to change struct sk_buff - Holy Grail of network code
> - just to help e1000 driver (well, not exactly for e1000, but that
> driver was a cause).

Have you seen the latest code? It allocates single pages and puts them
in the skb_shared_info fragments. Surely it might have been the one
pushing for these changes, but they are done. Current status.

> > > SLAB allows to have chunks of memory from differenct CPU, so it is
> > > impossible to create defragmentation, thus kmalloc/SLAB by design will
> > > suffer from fragmentation.
> >
> > *confused* memory is not bound to CPUs other than by NUMA, but even
> > there there is only a single address space.
>
> Each slab can have objects allocated in different CPUs, it was done to
> reduce freeing algorithm. If system wants to defragment several objects
> into bigger one, it must check all CPUs and find in which cache those
> objects are placed, which is extremely expensive, so SLAB can not
> perform defragmentation.

What you are referring to is coalescence, and yes coalescing over page
boundaries is hard in the SLAB layer, the page allocator does that.

> > > Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
> > > - overhead is extremely large.
> >
> > Yes seen that, but as stated, hardly anybody uses the page allocator to
> > store small objects. However if you do, you get large internal
> > fragmentation but zero external fragmentation (on that allocation
> > level).
>
> Truncated cat /proc/slabinfo on my machine (usual desktop):
> size-32 1170 1232 32
> size-128 663 780 128
> size-64 4239 9558 64

Sure, point being?

size-64 4497 4602 64 59 1 : tunables 120 60
8 : slabdata 78 78 0

4497 objects used out of 4602 available, object size 64 bytes, 59
objects per slab of 1 page. .... 78 pages with active objects out of 78
pages allocated.


> > This is where my SROG allocator comes in, it is used to group objects by
> > lifetime and returns the pages to the page allocator. This makes the
> > whole allocator short-lived and hence cannot add (external)
> > fragmentation on this level. The use I have for that is that I can then
> > properly gauge how much memory there is available. External
> > fragmentation and guarantees can be difficult to reconcile.
> >
> > I have no idea how fast/slow the SROG allocator is, and don't really
> > care since its only used as a fallback allocator; what I do care about
> > is determinism (in space).
> >
> > However, I do have a patch that converts the whole skb layer to use the
> > SROG allocator, not only the payload, so I could do some test. But this
> > is not a serious candidate until all jumbo frame capable drivers have
> > been converted to skb fragments instead of high order allocations - a
> > Good Thing [tm].
>
> You created SROG after my suggestion and discussion about NTA and it works
> well for it's purpose (doesn't it?), further extension could lead to creation
> of NTA (or could not).

I started with a very broken in-situ allocator (that tried to do the
same thing) in the very first patch. It was only later that I realised
the full extend of the skbuff requirements.

And no, NTA is too complex an allocator to do what I need. And more
specifically its design is quite contrary to what I have done. I
create/destroy an allocator instance per packet, you have one allocator
instance and serve multiple packets.

> SROG is a wrapper on top of alloc_pages and list of free objects,
> there are "several" differencies between allocators and I do not see how
> they can compete right now.

Yes allocators are build in layers, the page allocator the the basic
building block in Linux.

2006-08-15 12:35:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 02:03:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Tue, 2006-08-15 at 15:26 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 15, 2006 at 12:55:02PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > > Userspace can sak for next packet and pointer to the new location will
> > > > be removed.
> > >
> > > /sak/ask/?
> > >
> > > I'm not understanding, if you have a page A with two packets, a and b;
> > > once you map that page into user-space that process has access to both
> > > packets, which is a security problem. How are you going to solve this?
> >
> > Yep, there is such issue.
> > But no one is ever going to replace socket code with zero-copy
> > interfaces - Linux has backward compatibility noone ever had, so
> > send()/recv() will be there.
>
> The new AIO network API should be able to provide the needed userspace
> changes.

Kevent based network AIO already handle it.
This changes are different.

> > It is new interface which can be changed as described in previous
> > e-mails - copy if next chunk belongs to different socket and so on.
>
> But if you copy you're not zero-copy anymore. If you copy every second
> packet you have a 1/2 copy receive, but not zero.
>
> > Initial user will be sniffer, which should get all packets.
>
> Only the root user may get all packets. And most enterprise systems I've
> seen don't generally run a sniffer. That is usually done by
> redirecting/copying the data stream in a router and attach a second host
> to analyse the data.

Nice words. And what "second host" is supposed to do with that traffic?
I expect it should not be Linux (not an "enterprise system"?). WIll it
copy to userspace or have some kind of a zero-copy?

> > > > > > As described in recent threads [3] it is also possible to eliminate any
> > > > > > kind of main system OOM influence on network dataflow processing, thus
> > > > > > it is possible to prevent deadlock for systems, which use network as
> > > > > > memory storage (swap over network, iSCSI, NBD and so on).
> > > > >
> > > > > How? You have never stated how you will avoid getting all packets stuck
> > > > > in blocked sockets.
> > > >
> > > > Each socket has it's limit, so if allocator got enough memory, blocked
> > > > sockets will not affect it's behaviour.
> > >
> > > But isn't the total capacity of the network stack much larger than any
> > > allocator can provide?
> >
> > TCP has 768kb limit on my amd64 with 1gb of ram, so I expect allocator
> > can handle all requests.
>
> But the capacity of the network stack is larger than this (arbitrary)
> limit. It is possible to have all 768kb worth of packets stuck on
> blocked sockets.

And so what? You said, that separated from main allocator can not handle
all memory requests being done by the stack, as you can see it easily
can.

> > And there is a simple task in TODO list to dynamically grow cache when
> > threshold of memory is in use. It is really simple task and will be
> > implemented as soon as I complete suggestions mentioned by Andrew Morton.
>
> Growing will not help, the problem is you are out of memory, you cannot
> grow at that point.

You do not see the point of network tree allocator.

It can live with main system OOM since it has preallocated separate
pool, which can be increased when there is a requirement for that, for
example when system is not in OOM.

> > > > > On another note, I think you misunderstand our SLAB allocator; we do not
> > > > > round up to nearest order page alloc per object; SLAB is build to avoid
> > > > > that and is designed to pack equal size objects into pages. The kmalloc
> > > > > allocator is build on top of several SLAB allocators; each with its
> > > > > specific size objects to serve.
> > > > >
> > > > > For example, the 64 byte SLAB will serve 64 byte objects, and packs
> > > > > about PAGE_SIZE/64 per page (about since there is some overhead).
> > > > >
> > > > > So the actual internal fragmentation of the current kmalloc/SLAB
> > > > > allocator is not as bad as you paint it. The biggest problem we have
> > > > > with the SLAB thing is getting pages back from it. (And the horrific
> > > > > complexity of the current implementation)
> > > >
> > > > Ok, not SLAB, but kmaloc/SLAB.
> > >
> > > The page-allocator does what you describe, but hardly anybody uses that
> > > to store small objects.
> >
> > Network stack uses kmalloc.
>
> skbuff_head_cache and skbuff_fclone_cache are SLABs.

It is quite small part of the stack, isn't it?
And btw, they still suffer from SLAB design, since it is possibly to get
another smaller object right after all skbs are allocated from given page.
It is a minor thing of course, but nevertheless worh mentioning.

> > > Page allocator - buddy allocator, gives out memory in 1<<n pages.
> > >
> > > SLAB allocator - uses the page allocator for backing, each SLAB issues
> > > objects of a fixed, predetermined size, packed in pages.
> > >
> > > kmalloc - uses a collection of SLAB allocators to issue 'variable' size
> > > objects (see kmalloc_sizes.h - as you will see internal fragmentation
> > > can become quite large for larger objects, but small objects do rather
> > > well - and one could always add a frequently used size if it shows to be
> > > beneficial).
> >
> > There is no "frequently used size", kmalloc() does not know what size is
> > frequent and what is not. And there are other mentioned problems with
> > kmalloc/SLAB besides power-of-two, which prevent fragmentation problem
> > resolution.
>
> Yes SLAB is a horrid thing on some points but very good at a lot of
> other things. But surely there are frequently used sizes, kmalloc will
> not know, but a developer with profiling tools might.

Does not scale - admin must run system under profiling, add new
entries into kmalloc_sizes.h recompile the kernel... No way.

> > > > That allocator uses power-of-two allocation, so there is extremely
> > > > large overhead for several (and in some cases for all) usage cases
> > > > (e1000 with jumbo frames and unix sockets).
> > >
> > > Wrong example :-), e1000 is the only driver that doesn't do high order
> > > allocs for jumbo frames. But yes, the other drivers should be fixed,
> > > relying on higher order allocations is unsound.
> >
> > :) do you read netdev@? There are several quite long recent discussions
> > where network hackers blame exactly e1000 for it's hardware problems and
> > ugly memory usage model.
> > We even think how to change struct sk_buff - Holy Grail of network code
> > - just to help e1000 driver (well, not exactly for e1000, but that
> > driver was a cause).
>
> Have you seen the latest code? It allocates single pages and puts them
> in the skb_shared_info fragments. Surely it might have been the one
> pushing for these changes, but they are done. Current status.

Plese check e1000_alloc_rx_buffers() function and rx_buffer_len value.
You are wrong.

> > > > SLAB allows to have chunks of memory from differenct CPU, so it is
> > > > impossible to create defragmentation, thus kmalloc/SLAB by design will
> > > > suffer from fragmentation.
> > >
> > > *confused* memory is not bound to CPUs other than by NUMA, but even
> > > there there is only a single address space.
> >
> > Each slab can have objects allocated in different CPUs, it was done to
> > reduce freeing algorithm. If system wants to defragment several objects
> > into bigger one, it must check all CPUs and find in which cache those
> > objects are placed, which is extremely expensive, so SLAB can not
> > perform defragmentation.
>
> What you are referring to is coalescence, and yes coalescing over page
> boundaries is hard in the SLAB layer, the page allocator does that.

Page boundaries is only minor part of the problem.
Objects from the same page can live in different (per-cpu) SLAB caches.

> > > > Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
> > > > - overhead is extremely large.
> > >
> > > Yes seen that, but as stated, hardly anybody uses the page allocator to
> > > store small objects. However if you do, you get large internal
> > > fragmentation but zero external fragmentation (on that allocation
> > > level).
> >
> > Truncated cat /proc/slabinfo on my machine (usual desktop):
> > size-32 1170 1232 32
> > size-128 663 780 128
> > size-64 4239 9558 64
>
> Sure, point being?
>
> size-64 4497 4602 64 59 1 : tunables 120 60
> 8 : slabdata 78 78 0
>
> 4497 objects used out of 4602 available, object size 64 bytes, 59
> objects per slab of 1 page. .... 78 pages with active objects out of 78
> pages allocated.

It was your words quoted above that "hardly anybody uses the page
allocator to store small objects", as you can see, there are quite a few
of them allocated through kmalloc but not through kmem_cache.

> > > This is where my SROG allocator comes in, it is used to group objects by
> > > lifetime and returns the pages to the page allocator. This makes the
> > > whole allocator short-lived and hence cannot add (external)
> > > fragmentation on this level. The use I have for that is that I can then
> > > properly gauge how much memory there is available. External
> > > fragmentation and guarantees can be difficult to reconcile.
> > >
> > > I have no idea how fast/slow the SROG allocator is, and don't really
> > > care since its only used as a fallback allocator; what I do care about
> > > is determinism (in space).
> > >
> > > However, I do have a patch that converts the whole skb layer to use the
> > > SROG allocator, not only the payload, so I could do some test. But this
> > > is not a serious candidate until all jumbo frame capable drivers have
> > > been converted to skb fragments instead of high order allocations - a
> > > Good Thing [tm].
> >
> > You created SROG after my suggestion and discussion about NTA and it works
> > well for it's purpose (doesn't it?), further extension could lead to creation
> > of NTA (or could not).
>
> I started with a very broken in-situ allocator (that tried to do the
> same thing) in the very first patch. It was only later that I realised
> the full extend of the skbuff requirements.
>
> And no, NTA is too complex an allocator to do what I need. And more
> specifically its design is quite contrary to what I have done. I
> create/destroy an allocator instance per packet, you have one allocator
> instance and serve multiple packets.
>
> > SROG is a wrapper on top of alloc_pages and list of free objects,
> > there are "several" differencies between allocators and I do not see how
> > they can compete right now.
>
> Yes allocators are build in layers, the page allocator the the basic
> building block in Linux.

Ok, it's your point.
Actum est, ilicet.

--
Evgeniy Polyakov

2006-08-15 13:51:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, 2006-08-15 at 16:34 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 15, 2006 at 02:03:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> > On Tue, 2006-08-15 at 15:26 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 15, 2006 at 12:55:02PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > > > Userspace can sak for next packet and pointer to the new location will
> > > > > be removed.
> > > >
> > > > /sak/ask/?
> > > >
> > > > I'm not understanding, if you have a page A with two packets, a and b;
> > > > once you map that page into user-space that process has access to both
> > > > packets, which is a security problem. How are you going to solve this?
> > >
> > > Yep, there is such issue.
> > > But no one is ever going to replace socket code with zero-copy
> > > interfaces - Linux has backward compatibility noone ever had, so
> > > send()/recv() will be there.
> >
> > The new AIO network API should be able to provide the needed userspace
> > changes.
>
> Kevent based network AIO already handle it.
> This changes are different.

You said you couldn't do zero-copy because the userspace interface
didn't allow it.
I assume the network AIO interface does, but still you have your initial
allocation problem, one cannot put more than 1 packet on each page.

> > > It is new interface which can be changed as described in previous
> > > e-mails - copy if next chunk belongs to different socket and so on.
> >
> > But if you copy you're not zero-copy anymore. If you copy every second
> > packet you have a 1/2 copy receive, but not zero.
> >
> > > Initial user will be sniffer, which should get all packets.
> >
> > Only the root user may get all packets. And most enterprise systems I've
> > seen don't generally run a sniffer. That is usually done by
> > redirecting/copying the data stream in a router and attach a second host
> > to analyse the data.
>
> Nice words. And what "second host" is supposed to do with that traffic?
> I expect it should not be Linux (not an "enterprise system"?). WIll it
> copy to userspace or have some kind of a zero-copy?

Ofcourse that will sniff, but only having a sniffer zero-copy is not of
much general use.

> > > > > > > As described in recent threads [3] it is also possible to eliminate any
> > > > > > > kind of main system OOM influence on network dataflow processing, thus
> > > > > > > it is possible to prevent deadlock for systems, which use network as
> > > > > > > memory storage (swap over network, iSCSI, NBD and so on).
> > > > > >
> > > > > > How? You have never stated how you will avoid getting all packets stuck
> > > > > > in blocked sockets.
> > > > >
> > > > > Each socket has it's limit, so if allocator got enough memory, blocked
> > > > > sockets will not affect it's behaviour.
> > > >
> > > > But isn't the total capacity of the network stack much larger than any
> > > > allocator can provide?
> > >
> > > TCP has 768kb limit on my amd64 with 1gb of ram, so I expect allocator
> > > can handle all requests.
> >
> > But the capacity of the network stack is larger than this (arbitrary)
> > limit. It is possible to have all 768kb worth of packets stuck on
> > blocked sockets.
>
> And so what?

Well, if all packets are stuck, you have a dead system.

> You said, that separated from main allocator can not handle
> all memory requests being done by the stack, as you can see it easily
> can.

It could if you can provide adequate detection of memory pressure and
fallback to a degraded mode within the same allocator/stack and can
guarantee limited service to critical parts.

> > > And there is a simple task in TODO list to dynamically grow cache when
> > > threshold of memory is in use. It is really simple task and will be
> > > implemented as soon as I complete suggestions mentioned by Andrew Morton.
> >
> > Growing will not help, the problem is you are out of memory, you cannot
> > grow at that point.
>
> You do not see the point of network tree allocator.
>
> It can live with main system OOM since it has preallocated separate
> pool, which can be increased when there is a requirement for that, for
> example when system is not in OOM.

It cannot increase enough, ever. The total capacity of the network stack
is huge.
And the sole problem I'm addressing is getting the system to work
reliably in tight memory situations, that is during reclaim; one cannot
decide to grow then, nor postpone, too late.

> > > > > > On another note, I think you misunderstand our SLAB allocator; we do not
> > > > > > round up to nearest order page alloc per object; SLAB is build to avoid
> > > > > > that and is designed to pack equal size objects into pages. The kmalloc
> > > > > > allocator is build on top of several SLAB allocators; each with its
> > > > > > specific size objects to serve.
> > > > > >
> > > > > > For example, the 64 byte SLAB will serve 64 byte objects, and packs
> > > > > > about PAGE_SIZE/64 per page (about since there is some overhead).
> > > > > >
> > > > > > So the actual internal fragmentation of the current kmalloc/SLAB
> > > > > > allocator is not as bad as you paint it. The biggest problem we have
> > > > > > with the SLAB thing is getting pages back from it. (And the horrific
> > > > > > complexity of the current implementation)
> > > > >
> > > > > Ok, not SLAB, but kmaloc/SLAB.
> > > >
> > > > The page-allocator does what you describe, but hardly anybody uses that
> > > > to store small objects.
> > >
> > > Network stack uses kmalloc.
> >
> > skbuff_head_cache and skbuff_fclone_cache are SLABs.
>
> It is quite small part of the stack, isn't it?
> And btw, they still suffer from SLAB design, since it is possibly to get
> another smaller object right after all skbs are allocated from given page.
> It is a minor thing of course, but nevertheless worh mentioning.

Small but crucial, that is why I've been replacing all.

> > > > Page allocator - buddy allocator, gives out memory in 1<<n pages.
> > > >
> > > > SLAB allocator - uses the page allocator for backing, each SLAB issues
> > > > objects of a fixed, predetermined size, packed in pages.
> > > >
> > > > kmalloc - uses a collection of SLAB allocators to issue 'variable' size
> > > > objects (see kmalloc_sizes.h - as you will see internal fragmentation
> > > > can become quite large for larger objects, but small objects do rather
> > > > well - and one could always add a frequently used size if it shows to be
> > > > beneficial).
> > >
> > > There is no "frequently used size", kmalloc() does not know what size is
> > > frequent and what is not. And there are other mentioned problems with
> > > kmalloc/SLAB besides power-of-two, which prevent fragmentation problem
> > > resolution.
> >
> > Yes SLAB is a horrid thing on some points but very good at a lot of
> > other things. But surely there are frequently used sizes, kmalloc will
> > not know, but a developer with profiling tools might.
>
> Does not scale - admin must run system under profiling, add new
> entries into kmalloc_sizes.h recompile the kernel... No way.

s/admin/developer/
It has been the way so far.

> > > > > That allocator uses power-of-two allocation, so there is extremely
> > > > > large overhead for several (and in some cases for all) usage cases
> > > > > (e1000 with jumbo frames and unix sockets).
> > > >
> > > > Wrong example :-), e1000 is the only driver that doesn't do high order
> > > > allocs for jumbo frames. But yes, the other drivers should be fixed,
> > > > relying on higher order allocations is unsound.
> > >
> > > :) do you read netdev@? There are several quite long recent discussions
> > > where network hackers blame exactly e1000 for it's hardware problems and
> > > ugly memory usage model.
> > > We even think how to change struct sk_buff - Holy Grail of network code
> > > - just to help e1000 driver (well, not exactly for e1000, but that
> > > driver was a cause).
> >
> > Have you seen the latest code? It allocates single pages and puts them
> > in the skb_shared_info fragments. Surely it might have been the one
> > pushing for these changes, but they are done. Current status.
>
> Plese check e1000_alloc_rx_buffers() function and rx_buffer_len value.
> You are wrong.

e1000_alloc_rx_buffers_ps()

but it does not take away that any card/driver that does depend on
higher order allocations is unreliable.

> > > > > SLAB allows to have chunks of memory from differenct CPU, so it is
> > > > > impossible to create defragmentation, thus kmalloc/SLAB by design will
> > > > > suffer from fragmentation.
> > > >
> > > > *confused* memory is not bound to CPUs other than by NUMA, but even
> > > > there there is only a single address space.
> > >
> > > Each slab can have objects allocated in different CPUs, it was done to
> > > reduce freeing algorithm. If system wants to defragment several objects
> > > into bigger one, it must check all CPUs and find in which cache those
> > > objects are placed, which is extremely expensive, so SLAB can not
> > > perform defragmentation.
> >
> > What you are referring to is coalescence, and yes coalescing over page
> > boundaries is hard in the SLAB layer, the page allocator does that.
>
> Page boundaries is only minor part of the problem.
> Objects from the same page can live in different (per-cpu) SLAB caches.

My bad, SLAB never needs to coalesce objects, by design. What SLAB could
benefit from is compaction.

> > > > > Graphs of power-of-two vs. NTA overhead is shown on projects' homepage
> > > > > - overhead is extremely large.
> > > >
> > > > Yes seen that, but as stated, hardly anybody uses the page allocator to
> > > > store small objects. However if you do, you get large internal
> > > > fragmentation but zero external fragmentation (on that allocation
> > > > level).
> > >
> > > Truncated cat /proc/slabinfo on my machine (usual desktop):
> > > size-32 1170 1232 32
> > > size-128 663 780 128
> > > size-64 4239 9558 64
> >
> > Sure, point being?
> >
> > size-64 4497 4602 64 59 1 : tunables 120 60
> > 8 : slabdata 78 78 0
> >
> > 4497 objects used out of 4602 available, object size 64 bytes, 59
> > objects per slab of 1 page. .... 78 pages with active objects out of 78
> > pages allocated.
>
> It was your words quoted above that "hardly anybody uses the page
> allocator to store small objects", as you can see, there are quite a few
> of them allocated through kmalloc but not through kmem_cache.

kmalloc != page allocator. And as I have been trying to explain, kmalloc
uses kmem_cache.

> > > > This is where my SROG allocator comes in, it is used to group objects by
> > > > lifetime and returns the pages to the page allocator. This makes the
> > > > whole allocator short-lived and hence cannot add (external)
> > > > fragmentation on this level. The use I have for that is that I can then
> > > > properly gauge how much memory there is available. External
> > > > fragmentation and guarantees can be difficult to reconcile.
> > > >
> > > > I have no idea how fast/slow the SROG allocator is, and don't really
> > > > care since its only used as a fallback allocator; what I do care about
> > > > is determinism (in space).
> > > >
> > > > However, I do have a patch that converts the whole skb layer to use the
> > > > SROG allocator, not only the payload, so I could do some test. But this
> > > > is not a serious candidate until all jumbo frame capable drivers have
> > > > been converted to skb fragments instead of high order allocations - a
> > > > Good Thing [tm].
> > >
> > > You created SROG after my suggestion and discussion about NTA and it works
> > > well for it's purpose (doesn't it?), further extension could lead to creation
> > > of NTA (or could not).
> >
> > I started with a very broken in-situ allocator (that tried to do the
> > same thing) in the very first patch. It was only later that I realised
> > the full extend of the skbuff requirements.
> >
> > And no, NTA is too complex an allocator to do what I need. And more
> > specifically its design is quite contrary to what I have done. I
> > create/destroy an allocator instance per packet, you have one allocator
> > instance and serve multiple packets.
> >
> > > SROG is a wrapper on top of alloc_pages and list of free objects,
> > > there are "several" differencies between allocators and I do not see how
> > > they can compete right now.
> >
> > Yes allocators are build in layers, the page allocator the the basic
> > building block in Linux.
>
> Ok, it's your point.
> Actum est, ilicet.



2006-08-15 14:15:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 03:49:28PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Tue, 2006-08-15 at 16:34 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 15, 2006 at 02:03:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > On Tue, 2006-08-15 at 15:26 +0400, Evgeniy Polyakov wrote:
> > > > On Tue, Aug 15, 2006 at 12:55:02PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > > > > Userspace can sak for next packet and pointer to the new location will
> > > > > > be removed.
> > > > >
> > > > > /sak/ask/?
> > > > >
> > > > > I'm not understanding, if you have a page A with two packets, a and b;
> > > > > once you map that page into user-space that process has access to both
> > > > > packets, which is a security problem. How are you going to solve this?
> > > >
> > > > Yep, there is such issue.
> > > > But no one is ever going to replace socket code with zero-copy
> > > > interfaces - Linux has backward compatibility noone ever had, so
> > > > send()/recv() will be there.
> > >
> > > The new AIO network API should be able to provide the needed userspace
> > > changes.
> >
> > Kevent based network AIO already handle it.
> > This changes are different.
>
> You said you couldn't do zero-copy because the userspace interface
> didn't allow it.
> I assume the network AIO interface does, but still you have your initial
> allocation problem, one cannot put more than 1 packet on each page.

Kevent network AIO is completely different from network tree allocator.

> > > > TCP has 768kb limit on my amd64 with 1gb of ram, so I expect allocator
> > > > can handle all requests.
> > >
> > > But the capacity of the network stack is larger than this (arbitrary)
> > > limit. It is possible to have all 768kb worth of packets stuck on
> > > blocked sockets.
> >
> > And so what?
>
> Well, if all packets are stuck, you have a dead system.

Do you remember what we are talking about?
About network tree allocator, it does not know how and where packets are
processed, it only provides a memory.
It was shown several times that with appropriate reserve it is possible
to satisfy all system network requests, even if system is in OOM
conditions.

> > You said, that separated from main allocator can not handle
> > all memory requests being done by the stack, as you can see it easily
> > can.
>
> It could if you can provide adequate detection of memory pressure and
> fallback to a degraded mode within the same allocator/stack and can
> guarantee limited service to critical parts.

It is not needed, since network allocations are separated from main
system ones.
I think I need to show an example here.

Let's main system works only with TCP for simplicity.
Let's maximum allowed memory is limited by 1mb (it is 768k on machine
with 1gb of ram).

So network allocator reserves above megabyte and works with it in a
smart way (without too much overhead).
Then system goes into OOM and requires to swap a page, which
notification was sent to remote swap storage.
Swap storage then sends an ack for that data, since network allocations
are separated from main system ones, network allocator easily gets 60
(or 4k, since it has a reserve, which exeeds maximum allowed TCP memory
limit) bytes for ack and process than notification thus "freeing" acked
data and main system can work with that free memory.
No need to detect OOM or something other - it just works.

I expect you will give me an example, when all above megabyte is going
to be stuck somewhere.
But... If it is not acked, each new packet goes slow path since VJ header
prediction fails and falls into memory limit check which will drop that
packet immediately without event trying to select a socket.

> > > > And there is a simple task in TODO list to dynamically grow cache when
> > > > threshold of memory is in use. It is really simple task and will be
> > > > implemented as soon as I complete suggestions mentioned by Andrew Morton.
> > >
> > > Growing will not help, the problem is you are out of memory, you cannot
> > > grow at that point.
> >
> > You do not see the point of network tree allocator.
> >
> > It can live with main system OOM since it has preallocated separate
> > pool, which can be increased when there is a requirement for that, for
> > example when system is not in OOM.
>
> It cannot increase enough, ever. The total capacity of the network stack
> is huge.
> And the sole problem I'm addressing is getting the system to work
> reliably in tight memory situations, that is during reclaim; one cannot
> decide to grow then, nor postpone, too late.

Network *is* limited, it is not terabyte array which is going to be
placed into VFS cache.

> > > skbuff_head_cache and skbuff_fclone_cache are SLABs.
> >
> > It is quite small part of the stack, isn't it?
> > And btw, they still suffer from SLAB design, since it is possibly to get
> > another smaller object right after all skbs are allocated from given page.
> > It is a minor thing of course, but nevertheless worh mentioning.
>
> Small but crucial, that is why I've been replacing all.

Sigh, replace kmem_cache_alloc() with avl_alloc() - it does not matter.

> > > Yes SLAB is a horrid thing on some points but very good at a lot of
> > > other things. But surely there are frequently used sizes, kmalloc will
> > > not know, but a developer with profiling tools might.
> >
> > Does not scale - admin must run system under profiling, add new
> > entries into kmalloc_sizes.h recompile the kernel... No way.
>
> s/admin/developer/
> It has been the way so far.

Could you say what are preferred sizes in my testing machines here? :)
For example MMIO-read based chips (excellent realtek 8139 adapter) can
allocate not only 1500 bytes of memory, but real size of received frame.
I even used it for receiving zero-copy (really excellent hardware
for it's price) into VFS cache implementation (without any kind of
page-per-packet stuff), but it is not related to our discussion.

> > Plese check e1000_alloc_rx_buffers() function and rx_buffer_len value.
> > You are wrong.
>
> e1000_alloc_rx_buffers_ps()
>
> but it does not take away that any card/driver that does depend on
> higher order allocations is unreliable.

Have you seen how many adapters support packet split?

--
Evgeniy Polyakov

2006-08-15 14:50:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, 2006-08-15 at 18:15 +0400, Evgeniy Polyakov wrote:

> Kevent network AIO is completely different from network tree allocator.

How can that be? packets still need to be received, yes?

> So network allocator reserves above megabyte and works with it in a
> smart way (without too much overhead).
> Then system goes into OOM and requires to swap a page, which
> notification was sent to remote swap storage.
> Swap storage then sends an ack for that data, since network allocations
> are separated from main system ones, network allocator easily gets 60
> (or 4k, since it has a reserve, which exeeds maximum allowed TCP memory
> limit) bytes for ack and process than notification thus "freeing" acked
> data and main system can work with that free memory.
> No need to detect OOM or something other - it just works.
>
> I expect you will give me an example, when all above megabyte is going
> to be stuck somewhere.
> But... If it is not acked, each new packet goes slow path since VJ header
> prediction fails and falls into memory limit check which will drop that
> packet immediately without event trying to select a socket.

Not sure on the details; but you say: when we reach the threshold all
following packets will be dropped. So if you provide enough memory to
exceed the limit, you have some extra. If you then use that extra bit to
allow ACKs to pass through, then you're set.

Sounds good, but you'd have to carve a path for the ACKs, right? Or is
that already there?

Also, I'm worried with the effects of external fragmentation esp. after
long run times. Analysing non trivial memory allocators is hard, very
often too hard.

> > > > > And there is a simple task in TODO list to dynamically grow cache when
> > > > > threshold of memory is in use. It is really simple task and will be
> > > > > implemented as soon as I complete suggestions mentioned by Andrew Morton.
> > > >
> > > > Growing will not help, the problem is you are out of memory, you cannot
> > > > grow at that point.
> > >
> > > You do not see the point of network tree allocator.
> > >
> > > It can live with main system OOM since it has preallocated separate
> > > pool, which can be increased when there is a requirement for that, for
> > > example when system is not in OOM.
> >
> > It cannot increase enough, ever. The total capacity of the network stack
> > is huge.
> > And the sole problem I'm addressing is getting the system to work
> > reliably in tight memory situations, that is during reclaim; one cannot
> > decide to grow then, nor postpone, too late.
>
> Network *is* limited, it is not terabyte array which is going to be
> placed into VFS cache.

No it is not, but you bound it.

> > > > skbuff_head_cache and skbuff_fclone_cache are SLABs.
> > >
> > > It is quite small part of the stack, isn't it?
> > > And btw, they still suffer from SLAB design, since it is possibly to get
> > > another smaller object right after all skbs are allocated from given page.
> > > It is a minor thing of course, but nevertheless worh mentioning.
> >
> > Small but crucial, that is why I've been replacing all.
>
> Sigh, replace kmem_cache_alloc() with avl_alloc() - it does not matter.

It does matter, you need the whole packet, if you cannot allocate a
sk_buff you're still stuck.

> > > > Yes SLAB is a horrid thing on some points but very good at a lot of
> > > > other things. But surely there are frequently used sizes, kmalloc will
> > > > not know, but a developer with profiling tools might.
> > >
> > > Does not scale - admin must run system under profiling, add new
> > > entries into kmalloc_sizes.h recompile the kernel... No way.
> >
> > s/admin/developer/
> > It has been the way so far.
>
> Could you say what are preferred sizes in my testing machines here? :)
> For example MMIO-read based chips (excellent realtek 8139 adapter) can
> allocate not only 1500 bytes of memory, but real size of received frame.
> I even used it for receiving zero-copy (really excellent hardware
> for it's price) into VFS cache implementation (without any kind of
> page-per-packet stuff), but it is not related to our discussion.

Well generally the developer of the driver can say, and very often it
just doesn't matter, but see the wide spread use of private SLABs to see
there is benefit in manually tuning stuff.

> Have you seen how many adapters support packet split?

Not many I guess. That does not make higher order allocations any more
reliable.

2006-08-15 15:05:29

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 04:48:59PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Tue, 2006-08-15 at 18:15 +0400, Evgeniy Polyakov wrote:
>
> > Kevent network AIO is completely different from network tree allocator.
>
> How can that be? packets still need to be received, yes?

Kevent network AIO uses usual alloc_skb(), naio is called when packet is
received and checked on IP layer similar to where tcp_v4_do_rcv() is
called.

> > So network allocator reserves above megabyte and works with it in a
> > smart way (without too much overhead).
> > Then system goes into OOM and requires to swap a page, which
> > notification was sent to remote swap storage.
> > Swap storage then sends an ack for that data, since network allocations
> > are separated from main system ones, network allocator easily gets 60
> > (or 4k, since it has a reserve, which exeeds maximum allowed TCP memory
> > limit) bytes for ack and process than notification thus "freeing" acked
> > data and main system can work with that free memory.
> > No need to detect OOM or something other - it just works.
> >
> > I expect you will give me an example, when all above megabyte is going
> > to be stuck somewhere.
> > But... If it is not acked, each new packet goes slow path since VJ header
> > prediction fails and falls into memory limit check which will drop that
> > packet immediately without event trying to select a socket.

I mean without trying to queue data into socket.

> Not sure on the details; but you say: when we reach the threshold all
> following packets will be dropped. So if you provide enough memory to
> exceed the limit, you have some extra. If you then use that extra bit to
> allow ACKs to pass through, then you're set.
>
> Sounds good, but you'd have to carve a path for the ACKs, right? Or is
> that already there?

Acks with or without attached data are processed before data queueing.
See tcp_rcv_established().

> Also, I'm worried with the effects of external fragmentation esp. after
> long run times. Analysing non trivial memory allocators is hard, very
> often too hard.
>
> > > > > > And there is a simple task in TODO list to dynamically grow cache when
> > > > > > threshold of memory is in use. It is really simple task and will be
> > > > > > implemented as soon as I complete suggestions mentioned by Andrew Morton.
> > > > >
> > > > > Growing will not help, the problem is you are out of memory, you cannot
> > > > > grow at that point.
> > > >
> > > > You do not see the point of network tree allocator.
> > > >
> > > > It can live with main system OOM since it has preallocated separate
> > > > pool, which can be increased when there is a requirement for that, for
> > > > example when system is not in OOM.
> > >
> > > It cannot increase enough, ever. The total capacity of the network stack
> > > is huge.
> > > And the sole problem I'm addressing is getting the system to work
> > > reliably in tight memory situations, that is during reclaim; one cannot
> > > decide to grow then, nor postpone, too late.
> >
> > Network *is* limited, it is not terabyte array which is going to be
> > placed into VFS cache.
>
> No it is not, but you bound it.

I expect we will not convince each other :)

> > > > > skbuff_head_cache and skbuff_fclone_cache are SLABs.
> > > >
> > > > It is quite small part of the stack, isn't it?
> > > > And btw, they still suffer from SLAB design, since it is possibly to get
> > > > another smaller object right after all skbs are allocated from given page.
> > > > It is a minor thing of course, but nevertheless worh mentioning.
> > >
> > > Small but crucial, that is why I've been replacing all.
> >
> > Sigh, replace kmem_cache_alloc() with avl_alloc() - it does not matter.
>
> It does matter, you need the whole packet, if you cannot allocate a
> sk_buff you're still stuck.

Multiple above megabyte to 2.5 (if we assume the smallest packet as
100 bytes and add 150 bytes of skb) and you win.

> > > > > Yes SLAB is a horrid thing on some points but very good at a lot of
> > > > > other things. But surely there are frequently used sizes, kmalloc will
> > > > > not know, but a developer with profiling tools might.
> > > >
> > > > Does not scale - admin must run system under profiling, add new
> > > > entries into kmalloc_sizes.h recompile the kernel... No way.
> > >
> > > s/admin/developer/
> > > It has been the way so far.
> >
> > Could you say what are preferred sizes in my testing machines here? :)
> > For example MMIO-read based chips (excellent realtek 8139 adapter) can
> > allocate not only 1500 bytes of memory, but real size of received frame.
> > I even used it for receiving zero-copy (really excellent hardware
> > for it's price) into VFS cache implementation (without any kind of
> > page-per-packet stuff), but it is not related to our discussion.
>
> Well generally the developer of the driver can say, and very often it
> just doesn't matter, but see the wide spread use of private SLABs to see
> there is benefit in manually tuning stuff.

It can not be tuned in network, since packet can have any size from
about 60 to 16k bytes. Doing tricky stuff with allocation will end up in NTA
in some or other way.

--
Evgeniy Polyakov

2006-08-15 15:08:05

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 07:05:07PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> > > So network allocator reserves above megabyte and works with it in a
> > > smart way (without too much overhead).
> > > Then system goes into OOM and requires to swap a page, which
> > > notification was sent to remote swap storage.
> > > Swap storage then sends an ack for that data, since network allocations
> > > are separated from main system ones, network allocator easily gets 60
> > > (or 4k, since it has a reserve, which exeeds maximum allowed TCP memory
> > > limit) bytes for ack and process than notification thus "freeing" acked
> > > data and main system can work with that free memory.
> > > No need to detect OOM or something other - it just works.
> > >
> > > I expect you will give me an example, when all above megabyte is going
> > > to be stuck somewhere.
> > > But... If it is not acked, each new packet goes slow path since VJ header
> > > prediction fails and falls into memory limit check which will drop that
> > > packet immediately without event trying to select a socket.
>
> I mean without trying to queue data into socket.
>
> > Not sure on the details; but you say: when we reach the threshold all
> > following packets will be dropped. So if you provide enough memory to
> > exceed the limit, you have some extra. If you then use that extra bit to
> > allow ACKs to pass through, then you're set.
> >
> > Sounds good, but you'd have to carve a path for the ACKs, right? Or is
> > that already there?
>
> Acks with or without attached data are processed before data queueing.
> See tcp_rcv_established().

Just for clarification: we are talking about slow path above.

--
Evgeniy Polyakov

2006-08-15 17:42:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, 2006-08-15 at 19:05 +0400, Evgeniy Polyakov wrote:

> > Not sure on the details; but you say: when we reach the threshold all
> > following packets will be dropped. So if you provide enough memory to
> > exceed the limit, you have some extra. If you then use that extra bit to
> > allow ACKs to pass through, then you're set.
> >
> > Sounds good, but you'd have to carve a path for the ACKs, right? Or is
> > that already there?
>
> Acks with or without attached data are processed before data queueing.
> See tcp_rcv_established().

Right, however I just realised that most storage protocols (level 7)
have their own ACK msgs and do not rely on TCP (level 4) ACKs like this.

So I would like to come back on this, I do need a full data channel
open.

2006-08-15 17:49:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 07:42:16PM +0200, Peter Zijlstra ([email protected]) wrote:
> Right, however I just realised that most storage protocols (level 7)
> have their own ACK msgs and do not rely on TCP (level 4) ACKs like this.
>
> So I would like to come back on this, I do need a full data channel
> open.

In that case you can not solve problem with emergensy pool until you
mark all needed sockets as capable to do it.
Global socket limits are still there in sk_stream_alloc_pskb().

--
Evgeniy Polyakov

2006-08-16 02:52:48

by Bill Fink

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, 15 Aug 2006, Evgeniy Polyakov wrote:

> On Tue, Aug 15, 2006 at 03:49:28PM +0200, Peter Zijlstra ([email protected]) wrote:
>
> > It could if you can provide adequate detection of memory pressure and
> > fallback to a degraded mode within the same allocator/stack and can
> > guarantee limited service to critical parts.
>
> It is not needed, since network allocations are separated from main
> system ones.
> I think I need to show an example here.
>
> Let's main system works only with TCP for simplicity.
> Let's maximum allowed memory is limited by 1mb (it is 768k on machine
> with 1gb of ram).

The maximum amount of memory available for TCP on a system with 1 GB
of memory is 768 MB (not 768 KB).

[bill@chance4 ~]$ cat /proc/meminfo
MemTotal: 1034924 kB
...

[bill@chance4 ~]$ cat /proc/sys/net/ipv4/tcp_mem
98304 131072 196608

Since tcp_mem is in pages (4K in this case), maximum TCP memory
is 196608*4K or 768 MB.

Or am I missing something obvious.

-Bill

2006-08-16 05:39:26

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Tue, Aug 15, 2006 at 10:52:37PM -0400, Bill Fink ([email protected]) wrote:
> > Let's main system works only with TCP for simplicity.
> > Let's maximum allowed memory is limited by 1mb (it is 768k on machine
> > with 1gb of ram).
>
> The maximum amount of memory available for TCP on a system with 1 GB
> of memory is 768 MB (not 768 KB).

It does not matter, let's it be 100mb or any other number, since
allocation is separated and does not depend on main system one.
Network allocator can steal pages from main one, but it does not suffer
from SLAB OOM.

Btw, I have a system with 1gb of ram and 1.5gb of low-mem tcp limit and
3gb of high-mem tcp memory limit calculated automatically.

--
Evgeniy Polyakov

2006-08-16 07:52:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: [PATCH2 1/1] network memory allocator.

Hello.

Network tree allocator can be used to allocate memory for all network
operations from any context.

Changes from previous release:
* added dynamically grown cache
* changed some inline issues
* reduced code size
* removed AVL tree implementation from the sources
* changed minimum allocation size to l1 cache line size (some arches require that)
* removed skb->__tsize parameter
* added a lot of comments
* a lot of small cleanups

Trivial epoll based web server achieved more than 2450 requests per
second with this version (usual numbers are about 1600-1800 when usual
kmalloc is used for all network operations).

Network allocator design and implementation notes as long as performance
and fragmentation analysis can be found at project homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=nta

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19c96d4..f550f95 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -327,6 +327,10 @@ #include <linux/slab.h>

#include <asm/system.h>

+extern void *avl_alloc(unsigned int size, gfp_t gfp_mask);
+extern void avl_free(void *ptr, unsigned int size);
+extern int avl_init(void);
+
extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
diff --git a/net/core/Makefile b/net/core/Makefile
index 2645ba4..d86d468 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -10,6 +10,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.
obj-y += dev.o ethtool.o dev_mcast.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o

+obj-y += alloc/
+
obj-$(CONFIG_XFRM) += flow.o
obj-$(CONFIG_SYSFS) += net-sysfs.o
obj-$(CONFIG_NET_DIVERT) += dv.o
diff --git a/net/core/alloc/Makefile b/net/core/alloc/Makefile
new file mode 100644
index 0000000..21b7c51
--- /dev/null
+++ b/net/core/alloc/Makefile
@@ -0,0 +1,3 @@
+obj-y := allocator.o
+
+allocator-y := avl.o
diff --git a/net/core/alloc/avl.c b/net/core/alloc/avl.c
new file mode 100644
index 0000000..c404cbe
--- /dev/null
+++ b/net/core/alloc/avl.c
@@ -0,0 +1,651 @@
+/*
+ * avl.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/percpu.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/skbuff.h>
+
+#include "avl.h"
+
+static struct avl_allocator_data avl_allocator[NR_CPUS];
+
+/*
+ * Get node pointer from address.
+ */
+static inline struct avl_node *avl_get_node_ptr(unsigned long ptr)
+{
+ struct page *page = virt_to_page(ptr);
+ struct avl_node *node = (struct avl_node *)(page->lru.next);
+
+ return node;
+}
+
+/*
+ * Set node pointer for page for given address.
+ */
+static void avl_set_node_ptr(unsigned long ptr, struct avl_node *node, int order)
+{
+ int nr_pages = 1<<order, i;
+ struct page *page = virt_to_page(ptr);
+
+ for (i=0; i<nr_pages; ++i) {
+ page->lru.next = (void *)node;
+ page++;
+ }
+}
+
+/*
+ * Get allocation CPU from address.
+ */
+static inline int avl_get_cpu_ptr(unsigned long ptr)
+{
+ struct page *page = virt_to_page(ptr);
+ int cpu = (int)(unsigned long)(page->lru.prev);
+
+ return cpu;
+}
+
+/*
+ * Set allocation cpu for page for given address.
+ */
+static void avl_set_cpu_ptr(unsigned long ptr, int cpu, int order)
+{
+ int nr_pages = 1<<order, i;
+ struct page *page = virt_to_page(ptr);
+
+ for (i=0; i<nr_pages; ++i) {
+ page->lru.prev = (void *)(unsigned long)cpu;
+ page++;
+ }
+}
+
+/*
+ * Convert pointer to node's value.
+ * Node's value is a start address for contiguous chunk bound to given node.
+ */
+static inline unsigned long avl_ptr_to_value(void *ptr)
+{
+ struct avl_node *node = avl_get_node_ptr((unsigned long)ptr);
+ return node->value;
+}
+
+/*
+ * Convert pointer into offset from start address of the contiguous chunk
+ * allocated for appropriate node.
+ */
+static inline int avl_ptr_to_offset(void *ptr)
+{
+ return ((unsigned long)ptr - avl_ptr_to_value(ptr))/AVL_MIN_SIZE;
+}
+
+/*
+ * Count number of bits set down (until first unset is met in a mask)
+ * to the smaller addresses including bit at @pos in @mask.
+ */
+unsigned int avl_count_set_down(unsigned long *mask, unsigned int pos)
+{
+ unsigned int stop, bits = 0;
+ int idx;
+ unsigned long p, m;
+
+ idx = pos/BITS_PER_LONG;
+ pos = pos%BITS_PER_LONG;
+
+ while (idx >= 0) {
+ m = (~0UL>>pos)<<pos;
+ p = mask[idx] | m;
+
+ if (!(mask[idx] & m))
+ break;
+
+ stop = fls(~p);
+
+ if (!stop) {
+ bits += pos + 1;
+ pos = BITS_PER_LONG - 1;
+ idx--;
+ } else {
+ bits += pos - stop + 1;
+ break;
+ }
+ }
+
+ return bits;
+}
+
+/*
+ * Count number of bits set up (until first unset is met in a mask)
+ * to the bigger addresses including bit at @pos in @mask.
+ */
+unsigned int avl_count_set_up(unsigned long *mask, unsigned int mask_num,
+ unsigned int pos)
+{
+ unsigned int idx, stop, bits = 0;
+ unsigned long p, m;
+
+ idx = pos/BITS_PER_LONG;
+ pos = pos%BITS_PER_LONG;
+
+ while (idx < mask_num) {
+ if (!pos)
+ m = 0;
+ else
+ m = (~0UL<<(BITS_PER_LONG-pos))>>(BITS_PER_LONG-pos);
+ p = mask[idx] | m;
+
+ if (!(mask[idx] & ~m))
+ break;
+
+ stop = ffs(~p);
+
+ if (!stop) {
+ bits += BITS_PER_LONG - pos;
+ pos = 0;
+ idx++;
+ } else {
+ bits += stop - pos - 1;
+ break;
+ }
+ }
+
+ return bits;
+}
+
+/*
+ * Fill @num bits from position @pos up with bit value @bit in a @mask.
+ */
+
+static void avl_fill_bits(unsigned long *mask, unsigned int mask_size,
+ unsigned int pos, unsigned int num, unsigned int bit)
+{
+ unsigned int idx, start;
+
+ idx = pos/BITS_PER_LONG;
+ start = pos%BITS_PER_LONG;
+
+ while (num && idx < mask_size) {
+ unsigned long m = ((~0UL)>>start)<<start;
+
+ if (start + num <= BITS_PER_LONG) {
+ unsigned long upper_bits = BITS_PER_LONG - (start+num);
+
+ m = (m<<upper_bits)>>upper_bits;
+ }
+
+ if (bit)
+ mask[idx] |= m;
+ else
+ mask[idx] &= ~m;
+
+ if (start + num <= BITS_PER_LONG)
+ num = 0;
+ else {
+ num -= BITS_PER_LONG - start;
+ start = 0;
+ idx++;
+ }
+ }
+}
+
+/*
+ * Add free chunk into array.
+ */
+static inline void avl_container_insert(struct avl_container *c, unsigned int pos, int cpu)
+{
+ list_add_tail(&c->centry, &avl_allocator[cpu].avl_container_array[pos]);
+}
+
+/*
+ * Update node's bitmask of free/used chunks.
+ * If processed chunk size is bigger than requested one,
+ * split it and add the rest into list of free chunks with appropriate size.
+ */
+static void avl_update_node(struct avl_container *c, unsigned int cpos, unsigned int size)
+{
+ struct avl_node *node = avl_get_node_ptr((unsigned long)c->ptr);
+ unsigned int num = AVL_ALIGN(size)/AVL_MIN_SIZE;
+
+ BUG_ON(cpos < num - 1);
+
+ avl_fill_bits(node->mask, ARRAY_SIZE(node->mask), avl_ptr_to_offset(c->ptr), num, 0);
+
+ if (cpos != num-1) {
+ void *ptr = c->ptr + size;
+ c = ptr;
+ c->ptr = ptr;
+
+ cpos -= num;
+
+ avl_container_insert(c, cpos, smp_processor_id());
+ }
+}
+
+/*
+ * Dereference free chunk into container and add it into list of free
+ * chunks with appropriate size.
+ */
+static int avl_container_add(void *ptr, unsigned int size, int cpu)
+{
+ struct avl_container *c = ptr;
+ unsigned int pos = AVL_ALIGN(size)/AVL_MIN_SIZE-1;
+
+ if (!size)
+ return -EINVAL;
+
+ c->ptr = ptr;
+ avl_container_insert(c, pos, cpu);
+
+ return 0;
+}
+
+/*
+ * Dequeue first free chunk from the list.
+ */
+static inline struct avl_container *avl_dequeue(struct list_head *head)
+{
+ struct avl_container *cnt;
+
+ cnt = list_entry(head->next, struct avl_container, centry);
+ list_del(&cnt->centry);
+
+ return cnt;
+}
+
+/*
+ * Add new node entry int network allocator.
+ * must be called with disabled preemtpion.
+ */
+static void avl_node_entry_commit(struct avl_node_entry *entry, int cpu)
+{
+ int i, idx, off;
+
+ idx = off = 0;
+ for (i=0; i<entry->avl_node_num; ++i) {
+ struct avl_node *node;
+
+ node = &entry->avl_node_array[idx][off];
+
+ if (++off >= AVL_NODES_ON_PAGE) {
+ idx++;
+ off = 0;
+ }
+
+ avl_set_cpu_ptr(node->value, cpu, entry->avl_node_order);
+ avl_set_node_ptr(node->value, node, entry->avl_node_order);
+ avl_container_add((void *)node->value, (1<<entry->avl_node_order)<<PAGE_SHIFT, cpu);
+ }
+}
+
+/*
+ * Simple cache growing function - allocate as much as possible,
+ * but no more than @AVL_NODE_NUM pages when there is a need for that.
+ */
+static struct avl_node_entry *avl_node_entry_alloc(gfp_t gfp_mask, int order)
+{
+ struct avl_node_entry *entry;
+ int i, num = 0, idx, off;
+ unsigned long ptr;
+
+ entry = kzalloc(sizeof(struct avl_node_entry), gfp_mask);
+ if (!entry)
+ return NULL;
+
+ entry->avl_node_array = kzalloc(AVL_NODE_PAGES * sizeof(void *), gfp_mask);
+ if (!entry->avl_node_array)
+ goto err_out_free_entry;
+
+ for (i=0; i<AVL_NODE_PAGES; ++i) {
+ entry->avl_node_array[i] = (struct avl_node *)__get_free_page(gfp_mask);
+ if (!entry->avl_node_array[i]) {
+ num = i;
+ goto err_out_free;
+ }
+ }
+
+ idx = off = 0;
+
+ for (i=0; i<AVL_NODE_NUM; ++i) {
+ struct avl_node *node;
+
+ ptr = __get_free_pages(gfp_mask, order);
+ if (!ptr)
+ break;
+
+ node = &entry->avl_node_array[idx][off];
+
+ if (++off >= AVL_NODES_ON_PAGE) {
+ idx++;
+ off = 0;
+ }
+
+ node->value = ptr;
+ memset(node->mask, 0, sizeof(node->mask));
+ avl_fill_bits(node->mask, ARRAY_SIZE(node->mask), 0, ((1<<order)<<PAGE_SHIFT)/AVL_MIN_SIZE, 1);
+ }
+
+ ulog("%s: entry: %p, node: %u, node_pages: %lu, node_num: %lu, order: %d, allocated: %d, container: %u, max_size: %u, min_size: %u, bits: %u.\n",
+ __func__, entry, sizeof(struct avl_node), AVL_NODE_PAGES, AVL_NODE_NUM, order,
+ i, AVL_CONTAINER_ARRAY_SIZE, AVL_MAX_SIZE, AVL_MIN_SIZE, ((1<<order)<<PAGE_SHIFT)/AVL_MIN_SIZE);
+
+ if (i == 0)
+ goto err_out_free;
+
+ entry->avl_node_num = i;
+ entry->avl_node_order = order;
+
+ return entry;
+
+err_out_free:
+ for (i=0; i<AVL_NODE_PAGES; ++i)
+ free_page((unsigned long)entry->avl_node_array[i]);
+err_out_free_entry:
+ kfree(entry);
+ return NULL;
+}
+
+/*
+ * Allocate memory region with given size and mode.
+ * If allocation fails due to unsupported order, otherwise
+ * allocate new node entry with given mode and try to allocate again
+ * Cache growing happens only with 0-order allocations.
+ */
+void *avl_alloc(unsigned int size, gfp_t gfp_mask)
+{
+ unsigned int i, try = 0;
+ void *ptr = NULL;
+ unsigned long flags;
+
+ size = AVL_ALIGN(size);
+
+ if (size > AVL_MAX_SIZE || size < AVL_MIN_SIZE) {
+ /*
+ * Print info about unsupported order so user could send a "bug report"
+ * or increase initial allocation order.
+ */
+ if (get_order(size) > AVL_ORDER && net_ratelimit()) {
+ printk(KERN_INFO "%s: Failed to allocate %u bytes with %02x mode, order %u, max order %u.\n",
+ __func__, size, gfp_mask, get_order(size), AVL_ORDER);
+ WARN_ON(1);
+ }
+
+ return NULL;
+ }
+
+ local_irq_save(flags);
+repeat:
+ for (i=size/AVL_MIN_SIZE-1; i<AVL_CONTAINER_ARRAY_SIZE; ++i) {
+ struct list_head *head = &avl_allocator[smp_processor_id()].avl_container_array[i];
+ struct avl_container *c;
+
+ if (!list_empty(head)) {
+ c = avl_dequeue(head);
+ ptr = c->ptr;
+ avl_update_node(c, i, size);
+ break;
+ }
+ }
+ local_irq_restore(flags);
+
+ if (!ptr && !try) {
+ struct avl_node_entry *entry;
+
+ try = 1;
+
+ entry = avl_node_entry_alloc(gfp_mask, 0);
+ if (entry) {
+ local_irq_save(flags);
+ avl_node_entry_commit(entry, smp_processor_id());
+ goto repeat;
+ }
+
+ }
+
+
+ return ptr;
+}
+
+/*
+ * Remove free chunk from the list.
+ */
+static inline struct avl_container *avl_search_container(void *ptr, unsigned int idx, int cpu)
+{
+ struct avl_container *c = ptr;
+
+ list_del(&c->centry);
+ c->ptr = ptr;
+
+ return c;
+}
+
+/*
+ * Combine neighbour free chunks into the one with bigger size
+ * and put new chunk into list of free chunks with appropriate size.
+ */
+static void avl_combine(struct avl_node *node, void *lp, unsigned int lbits, void *rp, unsigned int rbits,
+ void *cur_ptr, unsigned int cur_bits, int cpu)
+{
+ struct avl_container *lc, *rc, *c;
+ unsigned int idx;
+ void *ptr;
+
+ lc = rc = c = NULL;
+ idx = cur_bits - 1;
+ ptr = cur_ptr;
+
+ c = (struct avl_container *)cur_ptr;
+ c->ptr = cur_ptr;
+
+ if (rp) {
+ rc = avl_search_container(rp, rbits-1, cpu);
+ if (!rc) {
+ printk(KERN_ERR "%p.%p: Failed to find a container for right pointer %p, rbits: %u.\n",
+ node, cur_ptr, rp, rbits);
+ BUG();
+ }
+
+ c = rc;
+ idx += rbits;
+ ptr = c->ptr;
+ }
+
+ if (lp) {
+ lc = avl_search_container(lp, lbits-1, cpu);
+ if (!lc) {
+ printk(KERN_ERR "%p.%p: Failed to find a container for left pointer %p, lbits: %u.\n",
+ node, cur_ptr, lp, lbits);
+ BUG();
+ }
+
+ idx += lbits;
+ ptr = c->ptr;
+ }
+ avl_container_insert(c, idx, cpu);
+}
+
+/*
+ * Free memory region of given size.
+ * Must be called on the same CPU where allocation happend
+ * with disabled interrupts.
+ */
+static void __avl_free_local(void *ptr, unsigned int size)
+{
+ unsigned long val = avl_ptr_to_value(ptr);
+ unsigned int pos, idx, sbits = AVL_ALIGN(size)/AVL_MIN_SIZE;
+ unsigned int rbits, lbits, cpu = avl_get_cpu_ptr(val);
+ struct avl_node *node;
+ unsigned long p;
+ void *lp, *rp;
+
+ node = avl_get_node_ptr((unsigned long)ptr);
+
+ pos = avl_ptr_to_offset(ptr);
+ idx = pos/BITS_PER_LONG;
+
+ p = node->mask[idx] >> (pos%BITS_PER_LONG);
+
+ if ((p & 1)) {
+ printk(KERN_ERR "%p.%p: Broken pointer: value: %lx, pos: %u, idx: %u, mask: %lx, p: %lx.\n",
+ node, ptr, val, pos, idx, node->mask[idx], p);
+ BUG();
+ }
+
+ avl_fill_bits(node->mask, ARRAY_SIZE(node->mask), pos, sbits, 1);
+
+ lp = rp = NULL;
+ rbits = lbits = 0;
+
+ idx = (pos+sbits)/BITS_PER_LONG;
+ p = (pos+sbits)%BITS_PER_LONG;
+
+ if ((node->mask[idx] >> p) & 1) {
+ lbits = avl_count_set_up(node->mask, ARRAY_SIZE(node->mask), pos+sbits);
+ if (lbits) {
+ lp = (void *)(val + (pos + sbits)*AVL_MIN_SIZE);
+ }
+ }
+
+ if (pos) {
+ idx = (pos-1)/BITS_PER_LONG;
+ p = (pos-1)%BITS_PER_LONG;
+ if ((node->mask[idx] >> p) & 1) {
+ rbits = avl_count_set_down(node->mask, pos-1);
+ if (rbits) {
+ rp = (void *)(val + (pos-rbits)*AVL_MIN_SIZE);
+ }
+ }
+ }
+
+ avl_combine(node, lp, lbits, rp, rbits, ptr, sbits, cpu);
+}
+
+/*
+ * Free memory region of given size.
+ * If freeing CPU is not the same as allocation one, chunk will
+ * be placed into list of to-be-freed objects on allocation CPU,
+ * otherwise chunk will be freed and combined with neighbours.
+ * Must be called with disabled interrupts.
+ */
+static void __avl_free(void *ptr, unsigned int size)
+{
+ int cpu = avl_get_cpu_ptr((unsigned long)ptr);
+
+ if (cpu != smp_processor_id()) {
+ struct avl_free_list *l, *this = ptr;
+ struct avl_allocator_data *alloc = &avl_allocator[cpu];
+
+ this->cpu = smp_processor_id();
+ this->size = size;
+
+ spin_lock(&alloc->avl_free_lock);
+ l = alloc->avl_free_list_head;
+ alloc->avl_free_list_head = this;
+ this->next = l;
+ spin_unlock(&alloc->avl_free_lock);
+ return;
+ }
+
+ __avl_free_local(ptr, size);
+}
+
+/*
+ * Free memory region of given size.
+ */
+void avl_free(void *ptr, unsigned int size)
+{
+ unsigned long flags;
+ struct avl_free_list *l;
+ struct avl_allocator_data *alloc;
+
+ local_irq_save(flags);
+ __avl_free(ptr, size);
+
+ alloc = &avl_allocator[smp_processor_id()];
+
+ while (alloc->avl_free_list_head) {
+ spin_lock(&alloc->avl_free_lock);
+ l = alloc->avl_free_list_head;
+ alloc->avl_free_list_head = l->next;
+ spin_unlock(&alloc->avl_free_lock);
+ __avl_free_local(l, l->size);
+ };
+ local_irq_restore(flags);
+}
+
+/*
+ * Initialize per-cpu allocator data.
+ */
+static int avl_init_cpu(int cpu)
+{
+ unsigned int i;
+ struct avl_allocator_data *alloc = &avl_allocator[cpu];
+ struct avl_node_entry *entry;
+
+ spin_lock_init(&alloc->avl_free_lock);
+
+ alloc->avl_container_array = kzalloc(sizeof(struct list_head) * AVL_CONTAINER_ARRAY_SIZE, GFP_KERNEL);
+ if (!alloc->avl_container_array)
+ goto err_out_exit;
+
+ for (i=0; i<AVL_CONTAINER_ARRAY_SIZE; ++i)
+ INIT_LIST_HEAD(&alloc->avl_container_array[i]);
+
+ entry = avl_node_entry_alloc(GFP_KERNEL, AVL_ORDER);
+ if (!entry)
+ goto err_out_free_container;
+
+ avl_node_entry_commit(entry, cpu);
+
+ return 0;
+
+err_out_free_container:
+ kfree(alloc->avl_container_array);
+err_out_exit:
+ return -ENOMEM;
+}
+
+/*
+ * Initialize network allocator.
+ */
+int avl_init(void)
+{
+ int err, cpu;
+
+ for_each_possible_cpu(cpu) {
+ err = avl_init_cpu(cpu);
+ if (err)
+ goto err_out;
+ }
+
+ printk(KERN_INFO "Network tree allocator has been initialized.\n");
+ return 0;
+
+err_out:
+ panic("Failed to initialize network allocator.\n");
+
+ return -ENOMEM;
+}
diff --git a/net/core/alloc/avl.h b/net/core/alloc/avl.h
new file mode 100644
index 0000000..600b66a
--- /dev/null
+++ b/net/core/alloc/avl.h
@@ -0,0 +1,117 @@
+/*
+ * avl.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHAAVLBILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __AVL_H
+#define __AVL_H
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <asm/page.h>
+
+//#define AVL_DEBUG
+
+#ifdef AVL_DEBUG
+#define ulog(f, a...) printk(f, ##a)
+#else
+#define ulog(f, a...)
+#endif
+
+/*
+ * Network tree allocator variables.
+ */
+
+#define AVL_ALIGN_SIZE L1_CACHE_BYTES
+#define AVL_ALIGN(x) ALIGN(x, AVL_ALIGN_SIZE)
+
+#define AVL_ORDER 3 /* Maximum allocation order */
+#define AVL_BITS 3 /* Must cover maximum number of pages used for allocation pools */
+
+#define AVL_NODES_ON_PAGE (PAGE_SIZE/sizeof(struct avl_node))
+#define AVL_NODE_NUM (1UL<<AVL_BITS)
+#define AVL_NODE_PAGES ((AVL_NODE_NUM+AVL_NODES_ON_PAGE-1)/AVL_NODES_ON_PAGE)
+
+#define AVL_MIN_SIZE AVL_ALIGN_SIZE
+#define AVL_MAX_SIZE ((1<<AVL_ORDER) << PAGE_SHIFT)
+
+#define AVL_CONTAINER_ARRAY_SIZE (AVL_MAX_SIZE/AVL_MIN_SIZE)
+
+/*
+ * Meta-information container for each contiguous block used in allocation.
+ * @value - start address of the contiguous block.
+ * @mask - bitmask of free and empty chunks [1 - free, 0 - used].
+ */
+struct avl_node
+{
+ unsigned long value;
+ DECLARE_BITMAP(mask, AVL_MAX_SIZE/AVL_MIN_SIZE);
+};
+
+/*
+ * Free chunks are dereferenced into this structure and placed into LIFO list.
+ */
+
+struct avl_container
+{
+ void *ptr;
+ struct list_head centry;
+};
+
+/*
+ * When freeing happens on different than allocation CPU,
+ * chunk is dereferenced into this structure and placed into
+ * single-linked list in allocation CPU private area.
+ */
+
+struct avl_free_list
+{
+ struct avl_free_list *next;
+ unsigned int size;
+ unsigned int cpu;
+};
+
+/*
+ * Each array of nodes is places into dynamically grown list.
+ * @avl_node_num - number of nodes in @avl_node_array
+ * @avl_node_start - index of the first node
+ * @avl_node_array - array of nodes (linked into pages)
+ */
+
+struct avl_node_entry
+{
+ unsigned int avl_node_order, avl_node_num;
+ struct avl_node **avl_node_array;
+};
+
+/*
+ * Main per-cpu allocator structure.
+ * @avl_container_array - array of lists of free chunks indexed by size of the elements
+ * @avl_free_list_head - single-linked list of objects, which were started to be freed on different CPU
+ * @avl_free_lock - lock protecting avl_free_list_head
+ */
+struct avl_allocator_data
+{
+ struct list_head *avl_container_array;
+ struct avl_free_list *avl_free_list_head;
+ spinlock_t avl_free_lock;
+};
+
+
+#endif /* __AVL_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 022d889..d10af88 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -156,7 +156,7 @@ struct sk_buff *__alloc_skb(unsigned int

/* Get the DATA. Size must match skb_add_mtu(). */
size = SKB_DATA_ALIGN(size);
- data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ data = avl_alloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

@@ -223,7 +223,7 @@ struct sk_buff *alloc_skb_from_cache(kme

/* Get the DATA. */
size = SKB_DATA_ALIGN(size);
- data = kmem_cache_alloc(cp, gfp_mask);
+ data = avl_alloc(size, gfp_mask);
if (!data)
goto nodata;

@@ -313,7 +313,7 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);

- kfree(skb->head);
+ avl_free(skb->head, skb->end - skb->head + sizeof(struct skb_shared_info));
}
}

@@ -688,7 +688,7 @@ int pskb_expand_head(struct sk_buff *skb

size = SKB_DATA_ALIGN(size);

- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ data = avl_alloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

@@ -2057,6 +2057,9 @@ void __init skb_init(void)
NULL, NULL);
if (!skbuff_fclone_cache)
panic("cannot create skbuff cache");
+
+ if (avl_init())
+ panic("Failed to initialize network tree allocator.\n");
}

EXPORT_SYMBOL(___pskb_trim);


--
Evgeniy Polyakov

2006-08-16 08:48:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 09:35:46AM +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 15, 2006 at 10:21:22PM +0200, Arnd Bergmann ([email protected]) wrote:
> > Am Monday 14 August 2006 13:04 schrieb Evgeniy Polyakov:
> > > ?* full per CPU allocation and freeing (objects are never freed on
> > > ????????different CPU)
> >
> > Many of your data structures are per cpu, but your underlying allocations
> > are all using regular kzalloc/__get_free_page/__get_free_pages functions.
> > Shouldn't these be converted to calls to kmalloc_node and alloc_pages_node
> > in order to get better locality on NUMA systems?
> >
> > OTOH, we have recently experimented with doing the dev_alloc_skb calls
> > with affinity to the NUMA node that holds the actual network adapter, and
> > got significant improvements on the Cell blade server. That of course
> > may be a conflicting goal since it would mean having per-cpu per-node
> > page pools if any CPU is supposed to be able to allocate pages for use
> > as DMA buffers on any node.
>
> Doesn't alloc_pages() automatically switches to alloc_pages_node() or
> alloc_pages_current()?

That's not what's wanted. If you have a slow interconnect you always want
to allocate memory on the node the network device is attached to.

2006-08-16 09:04:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([email protected]) wrote:
> > Doesn't alloc_pages() automatically switches to alloc_pages_node() or
> > alloc_pages_current()?
>
> That's not what's wanted. If you have a slow interconnect you always want
> to allocate memory on the node the network device is attached to.

There is drawback here - if data was allocated on CPU wheere NIC is
"closer" and then processed on different CPU it will cost more than
in case where buffer was allocated on CPU where it will be processed.

But from other point of view, most of the adapters preallocate set of
skbs, and with msi-x help there will be a possibility to bind irq and
processing to the CPU where data was origianlly allocated.

So I would like to know how to determine which node should be used for
allocation. Changes of __get_user_pages() to alloc_pages_node() are
trivial.

--
Evgeniy Polyakov

2006-08-16 09:05:50

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 16 Aug 2006 13:00:31 +0400

> So I would like to know how to determine which node should be used for
> allocation. Changes of __get_user_pages() to alloc_pages_node() are
> trivial.

netdev_alloc_skb() knows the netdevice, and therefore you can
obtain the "struct device;" referenced inside of the netdev,
and therefore you can determine the node using the struct
device.

Christophe is working on adding support for this using existing
allocator.

2006-08-16 09:10:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Wed, 16 Aug 2006 13:00:31 +0400
>
> > So I would like to know how to determine which node should be used for
> > allocation. Changes of __get_user_pages() to alloc_pages_node() are
> > trivial.
>
> netdev_alloc_skb() knows the netdevice, and therefore you can
> obtain the "struct device;" referenced inside of the netdev,
> and therefore you can determine the node using the struct
> device.

It's not that easy unfortunately. I did what you describe above in my
first prototype but then found out the hard way that the struct device
in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case
the kernel will crash on you becuase you can only get the node infortmation
for pci devices. My current patchkit adds an "int node" member to struct
net_device instead. I can repost the patchkit ontop of -mm (which is
the required slab memory leak tracking changes) if anyone cares.

2006-08-16 09:32:51

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([email protected]) wrote:
> On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
> > From: Evgeniy Polyakov <[email protected]>
> > Date: Wed, 16 Aug 2006 13:00:31 +0400
> >
> > > So I would like to know how to determine which node should be used for
> > > allocation. Changes of __get_user_pages() to alloc_pages_node() are
> > > trivial.
> >
> > netdev_alloc_skb() knows the netdevice, and therefore you can
> > obtain the "struct device;" referenced inside of the netdev,
> > and therefore you can determine the node using the struct
> > device.
>
> It's not that easy unfortunately. I did what you describe above in my
> first prototype but then found out the hard way that the struct device
> in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case
> the kernel will crash on you becuase you can only get the node infortmation
> for pci devices. My current patchkit adds an "int node" member to struct
> net_device instead. I can repost the patchkit ontop of -mm (which is
> the required slab memory leak tracking changes) if anyone cares.

Can we check device->bus_type or device->driver->bus against
&pci_bus_type for that?

--
Evgeniy Polyakov

2006-08-16 09:38:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 01:32:02PM +0400, Evgeniy Polyakov wrote:
> On Wed, Aug 16, 2006 at 10:10:29AM +0100, Christoph Hellwig ([email protected]) wrote:
> > On Wed, Aug 16, 2006 at 02:05:03AM -0700, David Miller wrote:
> > > From: Evgeniy Polyakov <[email protected]>
> > > Date: Wed, 16 Aug 2006 13:00:31 +0400
> > >
> > > > So I would like to know how to determine which node should be used for
> > > > allocation. Changes of __get_user_pages() to alloc_pages_node() are
> > > > trivial.
> > >
> > > netdev_alloc_skb() knows the netdevice, and therefore you can
> > > obtain the "struct device;" referenced inside of the netdev,
> > > and therefore you can determine the node using the struct
> > > device.
> >
> > It's not that easy unfortunately. I did what you describe above in my
> > first prototype but then found out the hard way that the struct device
> > in the netdevice can be a non-pci one, e.g. for pcmcia. Im that case
> > the kernel will crash on you becuase you can only get the node infortmation
> > for pci devices. My current patchkit adds an "int node" member to struct
> > net_device instead. I can repost the patchkit ontop of -mm (which is
> > the required slab memory leak tracking changes) if anyone cares.
>
> Can we check device->bus_type or device->driver->bus against
> &pci_bus_type for that?

We could, but I'd rather waste 4 bytes in struct net_device than having
such ugly warts in common code.

2006-08-16 09:40:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Christoph Hellwig <[email protected]>
Date: Wed, 16 Aug 2006 10:38:37 +0100

> We could, but I'd rather waste 4 bytes in struct net_device than
> having such ugly warts in common code.

Why not instead have struct device store some default node value?
The node decision will be sub-optimal on non-pci but it won't crash.

2006-08-16 09:42:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 01:00:31PM +0400, Evgeniy Polyakov wrote:
> On Wed, Aug 16, 2006 at 09:48:08AM +0100, Christoph Hellwig ([email protected]) wrote:
> > > Doesn't alloc_pages() automatically switches to alloc_pages_node() or
> > > alloc_pages_current()?
> >
> > That's not what's wanted. If you have a slow interconnect you always want
> > to allocate memory on the node the network device is attached to.
>
> There is drawback here - if data was allocated on CPU wheere NIC is
> "closer" and then processed on different CPU it will cost more than
> in case where buffer was allocated on CPU where it will be processed.
>
> But from other point of view, most of the adapters preallocate set of
> skbs, and with msi-x help there will be a possibility to bind irq and
> processing to the CPU where data was origianlly allocated.

The case we've benchmarked (spidernet) is the common preallocated case.
For allocate on demand I'd expect the slab allocator to get things right.
We do have the irq on the right node, not through MSI but due to the odd
interreupt architecture of the Cell blades.

> So I would like to know how to determine which node should be used for
> allocation. Changes of __get_user_pages() to alloc_pages_node() are
> trivial.

The patches I have add the node field to struct net_device and use it.
It's set via alloc_netdev_node, a function I add and for the normal case
of PCI adapters the node arguments comes from pcibus_to_node(). It's
arguable we should add a alloc_foodeve_pdev variant that hids that detail,
but I'm not entirely sure about whether it's worth the effort.

2006-08-16 09:44:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 02:40:08AM -0700, David Miller wrote:
> From: Christoph Hellwig <[email protected]>
> Date: Wed, 16 Aug 2006 10:38:37 +0100
>
> > We could, but I'd rather waste 4 bytes in struct net_device than
> > having such ugly warts in common code.
>
> Why not instead have struct device store some default node value?
> The node decision will be sub-optimal on non-pci but it won't crash.

Right now we don't even have the node stored in the pci_dev structure but
only arch-specific accessor functions/macros. We could change those to
take a struct device instead and make them return -1 for everything non-pci
as we already do in architectures that don't support those helpers. -1
means 'any node' for all common allocators.

2006-08-16 11:27:28

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wednesday 16 August 2006 11:00, Evgeniy Polyakov wrote:
> There is drawback here - if data was allocated on CPU wheere NIC is
> "closer" and then processed on different CPU it will cost more than
> in case where buffer was allocated on CPU where it will be processed.
>
> But from other point of view, most of the adapters preallocate set of
> skbs, and with msi-x help there will be a possibility to bind irq and
> processing to the CPU where data was origianlly allocated.
>
> So I would like to know how to determine which node should be used for
> allocation. Changes of __get_user_pages() to alloc_pages_node() are
> trivial.

There are two separate memory areas here: Your own metadata used by the
allocator and the memory used for skb data.

avl_node_array[cpu] and avl_container_array[cpu] are only designed to
be accessed only by the local cpu, so these should be done like

avl_node_array[cpu] = kmalloc_node(AVL_NODE_PAGES * sizeof(void *),
GFP_KERNEL, cpu_to_node(cpu));

or you could make the whole array DEFINE_PER_CPU(void *, which would
waste some space in the kernel object file.

Now for the actual pages you get with __get_free_pages(), doing the
same (alloc_pages_node), will help accessing your avl_container
members, but may not be the best solution for getting the data
next to the network adapter.

Arnd <><

2006-08-16 12:03:55

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 01:27:02PM +0200, Arnd Bergmann ([email protected]) wrote:
> On Wednesday 16 August 2006 11:00, Evgeniy Polyakov wrote:
> > There is drawback here - if data was allocated on CPU wheere NIC is
> > "closer" and then processed on different CPU it will cost more than
> > in case where buffer was allocated on CPU where it will be processed.
> >
> > But from other point of view, most of the adapters preallocate set of
> > skbs, and with msi-x help there will be a possibility to bind irq and
> > processing to the CPU where data was origianlly allocated.
> >
> > So I would like to know how to determine which node should be used for
> > allocation. Changes of __get_user_pages() to alloc_pages_node() are
> > trivial.
>
> There are two separate memory areas here: Your own metadata used by the
> allocator and the memory used for skb data.
>
> avl_node_array[cpu] and avl_container_array[cpu] are only designed to
> be accessed only by the local cpu, so these should be done like
>
> avl_node_array[cpu] = kmalloc_node(AVL_NODE_PAGES * sizeof(void *),
> GFP_KERNEL, cpu_to_node(cpu));
>
> or you could make the whole array DEFINE_PER_CPU(void *, which would
> waste some space in the kernel object file.
>
> Now for the actual pages you get with __get_free_pages(), doing the
> same (alloc_pages_node), will help accessing your avl_container
> members, but may not be the best solution for getting the data
> next to the network adapter.

I can create it with numa_node_id() right now and later, if there will
exsist some helper to match netdev->node, it can be used instead.

> Arnd <><

--
Evgeniy Polyakov

2006-08-16 12:26:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, 16 Aug 2006 09:48:08 +0100
Christoph Hellwig <[email protected]> wrote:

> On Wed, Aug 16, 2006 at 09:35:46AM +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 15, 2006 at 10:21:22PM +0200, Arnd Bergmann ([email protected]) wrote:
> > > Am Monday 14 August 2006 13:04 schrieb Evgeniy Polyakov:
> > > > ?* full per CPU allocation and freeing (objects are never freed on
> > > > ????????different CPU)
> > >
> > > Many of your data structures are per cpu, but your underlying allocations
> > > are all using regular kzalloc/__get_free_page/__get_free_pages functions.
> > > Shouldn't these be converted to calls to kmalloc_node and alloc_pages_node
> > > in order to get better locality on NUMA systems?
> > >
> > > OTOH, we have recently experimented with doing the dev_alloc_skb calls
> > > with affinity to the NUMA node that holds the actual network adapter, and
> > > got significant improvements on the Cell blade server. That of course
> > > may be a conflicting goal since it would mean having per-cpu per-node
> > > page pools if any CPU is supposed to be able to allocate pages for use
> > > as DMA buffers on any node.
> >
> > Doesn't alloc_pages() automatically switches to alloc_pages_node() or
> > alloc_pages_current()?
>
> That's not what's wanted. If you have a slow interconnect you always want
> to allocate memory on the node the network device is attached to.

That's not true on all NUMA systems (that they have a slow interconnect)
I think on x86-64 I would prefer if it was distributed evenly or maybe even
on the CPU who is finally going to process it.

-Andi "not all NUMA is an Altix"

2006-08-16 16:57:43

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH2 1/1] network memory allocator.

IMHO the network memory allocator is being a little too focused on one problem,
rather than looking at a general enhancement.

Have you looked into something like the talloc used by Samba (and others)?
http://talloc.samba.org/
http://samba.org/ftp/unpacked/samba4/source/lib/talloc/talloc_guide.txt

By having a context, we could do better resource tracking and also cleanup
would be easier on removal.

2006-08-16 19:28:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH2 1/1] network memory allocator.

On Wed, Aug 16, 2006 at 09:57:12AM -0700, Stephen Hemminger ([email protected]) wrote:
> IMHO the network memory allocator is being a little too focused on one problem,
> rather than looking at a general enhancement.
>
> Have you looked into something like the talloc used by Samba (and others)?
> http://talloc.samba.org/
> http://samba.org/ftp/unpacked/samba4/source/lib/talloc/talloc_guide.txt
>
> By having a context, we could do better resource tracking and also cleanup
> would be easier on removal.

Yes, I saw it - it is slow (not that big overhead, but it definitely not
the case where we can slow things down more).
Netwrok tree allocator can be used by other users too without any
problems ,mmu-less systems will greatly benefit from it.
There is nothing which prevent other than network cases, so I see no
problems there.

--
Evgeniy Polyakov

2006-08-18 02:30:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Wed, 16 Aug 2006, Andi Kleen wrote:

> That's not true on all NUMA systems (that they have a slow interconnect)
> I think on x86-64 I would prefer if it was distributed evenly or maybe even
> on the CPU who is finally going to process it.
>
> -Andi "not all NUMA is an Altix"

The Altix NUMA interconnect has the same speed as far as I can recall as
Hypertransport. It is the distance (real physical cable length) that
creates latencies for huge systems. Sadly the Hypertransport is designed
to stay on the motherboard. Hypertransport can only be said to be fast
because its only used for tinzy winzy systems of a few processors. Are
you saying that the design limitations of Hypertransport are an
advantage?


2006-08-18 08:24:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Friday 18 August 2006 04:25, Christoph Lameter wrote:
> On Wed, 16 Aug 2006, Andi Kleen wrote:
> > That's not true on all NUMA systems (that they have a slow interconnect)
> > I think on x86-64 I would prefer if it was distributed evenly or maybe
> > even on the CPU who is finally going to process it.
> >
> > -Andi "not all NUMA is an Altix"
>
> The Altix NUMA interconnect has the same speed as far as I can recall as
> Hypertransport. It is the distance (real physical cable length) that
> creates latencies for huge systems. Sadly the Hypertransport is designed
> to stay on the motherboard. Hypertransport can only be said to be fast
> because its only used for tinzy winzy systems of a few processors. Are
> you saying that the design limitations of Hypertransport are an
> advantage?

Sorry, didn't want to state anything particular about advantages
or disadvantages of different interconnects. I just wanted to say
that there are a lot of NUMA systems out there which have a very low
NUMA factor (for whatever reason, including them being quite small)
and that they should be considered for NUMA optimizations too.

So if you really want strict IO placement at least allow an easy way
to turn it off even when CONFIG_NUMA is defined.

BTW there are large x86-64 NUMA systems that don't use HyperTransport
and have a varying NUMA factor, and also even HyperTransport
based systems have a widely varying NUMA factor depending on machine
size and hop distance (2-8 sockets and larger systems are in development)

So ideal would be something dynamic to turn on/off io placement, maybe based
on node_distance() again, with the threshold tweakable per architecture?

Also I must say it's still not quite clear to me if it's better to place
network packets on the node the device is connected to or on the
node which contains the CPU who processes the packet data
For RX this can be three different nodes in the worst case
(CPU processing is often split on different CPUs between softirq
and user context), for TX two. Do you have some experience that shows
that a particular placement is better than the other?

-Andi

2006-08-18 08:52:10

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

From: Andi Kleen <[email protected]>
Date: Fri, 18 Aug 2006 11:29:14 +0200

> So ideal would be something dynamic to turn on/off io placement, maybe based
> on node_distance() again, with the threshold tweakable per architecture?

We have this ugly 'hashdist' thing, let's remove the __initdata tag
on it, give it a useful name, and let architectures set it as
they deem appropriate.

2006-08-18 17:08:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/1] network memory allocator.

On Fri, 18 Aug 2006, Andi Kleen wrote:

> Also I must say it's still not quite clear to me if it's better to place
> network packets on the node the device is connected to or on the
> node which contains the CPU who processes the packet data
> For RX this can be three different nodes in the worst case
> (CPU processing is often split on different CPUs between softirq
> and user context), for TX two. Do you have some experience that shows
> that a particular placement is better than the other?

The more nodes are involved the more numa traffic and the slower the
overall performance. It is best to place all control information on the
node of the network card. The actual data is read via DMA and one may
place that local to the executing process. The DMA transfer will then have
to cross the NUMA interlink but that DMA transfer is a linear and
predictable stream that can often be optimized by hardware. If you
would create the data on the network node then you would have off
node overhead when creating the data (random acces to cache lines?) which
is likely worse.

One should be able to place the process near the node that has the network
card. Our machines typically have multiple network cards. It would be best
if the network stack could choose the one that is nearest to the process.