2006-08-08 19:34:06

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD


From: Daniel Phillips <[email protected]>

Recently, Peter Zijlstra and I have been busily collaborating on a
solution to the memory deadlock problem described here:

http://lwn.net/Articles/144273/
"Kernel Summit 2005: Convergence of network and storage paths"

We believe that an approach very much like today's patch set is
necessary for NBD, iSCSI, AoE or the like ever to work reliably.
We further believe that a properly working version of at least one of
these subsystems is critical to the viability of Linux as a modern
storage platform.

Today's patch set builds on a patch I posted a while back:

http://lwn.net/Articles/146061/
"Net vm deadlock fix (preliminary)"

Peter Zijlstra got the ball rolling again by posting this patch:

http://lkml.org/lkml/2006/7/7/164

Which I gently flamed:

http://lkml.org/lkml/2006/7/7/245

Peter turned out to be right. Any elevator other than NOOP is wrong for
NBD on the client side. It turns out that something in the elevator
itself deadlocks under heavy load. This is a bug: io_sched waits
forever under certain conditions. Maybe it is another memory deadlock,
who knows. It is for sure a bug for another day, in a piece of machinery
that we have now discarded as far as the NBD client is concerned.

However, the mere possibility that Peter's deadlock might be in the
network receive path was enough to motivate us to get down to work and
close that subtle hole. To be sure, the network receive path deadlock
is not so much a deadlock as a severe performance bug. When it bites,
valuable packets will be lost just when the system is least able to
afford the loss. Under rare circumstances a true deadlock will occur.

Our immediate objective is to prevent that rare deadlock and to go
further: we also prevent the performance bug in all but the most exotic
of circumstances. And we go further yet by attemping to keep unrelated
traffic flowing more or less smoothly on the same socket as the block IO
traffic, even under conditions of heavy memory stress.

Peter will be doing the heavy lifting on this patch from here on, so I
suppose we can call this a changing of the hats.

The problem:

Sometimes our Linux VMM needs to allocate memory in order to free
memory, for example when writing dirty file pages to disk or writing
process pages to swap. In the case of dirty file pages we may need to
read in some file metadata in order to know where to write a given page.
We avoid deadlock in this case by providing the filesystem access to a
special "memalloc" reserve of pages whenever it is doing work on behalf
of the VMM. This is implemented via a PF_MEMALLOC process flag to
tell alloc_pages that it may satisfy a memory allocation request by
dipping into the memalloc reserve if it must. A PF_MEMALLOC
process is thus able to complete its work without recursing into
the memory-freeing code and risking deadlock. This simple strategy
has served us well for years.

A difficulty that has arisen in recent years is that sometimes code
involved in writing out pages needs to rely on some other process than
itself to help it do its work. Such a "memhelper" process does not
automatically inherit the PF_MEMALLOC process flag because it is
not called directly, and ends up competing for the same memory as any
other process. There is clear and present danger that such a process
may block forever waiting for memory to be freed, a deadlock. Clearly
we must either get rid of such processes entirely by refactoring code,
or where that is impossible, we must provide access to a special memory
reserve in order to avoid falling into the deep black pit of vm
recursion. This is not particularly hard. We just need to:

a) Run such a process in PF_MEMALLOC mode

and

b) Throttle the activity of the process so that it never exceeds
some pre-calculated threshold of memory usage, and thus does not
exhaust the memalloc reserve.

Other solutions such as mempool (similarly combined with throttling) are
certainly possible, but the memalloc solution is particularly simple and
obviously correct.

Unfortunately, a particularly nasty form of memory deadlock arises from
the fact that receive side of the network stack is also a sort of
separate process. If we operate a block device remotely over the
network as many of us do these days, we must provide reserve memory to
the network receive path in order to avoid deadlock. But we cannot
provide such reserve memory willy nilly to every network packet
arriving, otherwise the reserve may be exhausted just as essential
memory-freeing protocol traffic shows up. This is the dread network
memory deadlock that has plagued such subsystems as the network block
device since time immemorial.

Our solution:

Today we consider only the network receive aspect of the memory deadlock
problem, because it is by far the most difficult manifestation and
because a solution to just this aspect is all that stands in the way of
a reliable network block device.

We observe that the crux of the problem is not knowing whether to
provide reserve memory to an incoming packet or not. We therefore
provide reserve memory to every incoming IP packet up to the point where
we can decode the IP header and learn whether the packet belongs to a
socket known to be carrying block IO traffic. If this short-hop reserve
memory has fallen below some threshold then we will drop incoming
network traffic that is not involved in network block IO. We never
drop block IO traffic unless there is a bug in the throttling scheme
such that the memory reserve becomes completely depleted,
which would prevent the network interface's dma buffer ring from
being replenished.

Following rule (b) above, we throttle all network block IO traffic to
avoid overloading our device-level reserve. Then we set the reserve to
be higher than the sum of maximum possible unidentified packets and the
amount of in-flight block IO permitted by our throttling scheme.

The interval over which unclassified packets exist is actually quite
small: it exists only from the device packet receive interrupt to where
the owning socket is identified in ip_local_deliver_finish. Just to be
safe, we allow a hundred or so packets to be in flight over this small
interval. If by some collosal accident we actually exceeded that number
then we would not deadlock, we would just risk dropping a packet. We
still have some large number of block IO packets happily in flight, and
that must be so or we could not have exhausted the reserve. Thus, the
VMM continues to make forward progress, moreover it continues to
progress at nearly its full throughput.

We believe that a suitable choice of reserve size and throttling limit
for any given configuration will permit completely reliable remote block
IO operation even with unrelated traffic sharing the network interface.
Further refinements to this work can be expected to focus on the
following aspects:

1) Calculate reserve sizes and throttling limits accurately
and automatically in response to configuration changes,

2) Incorporate possible additional memory consumers such as
netfilter into the scheme.

3) Place a tight bound on the amount of traffic that can be
in flight between the receive interrupt and protocol
identification, thus allowing reserve requirements to be
lowered.

4) Change the skb memory resource accounting from per skb to
per page for greater accuracy.

How it works

(Heavy reading ahead, please skip this if you do not intend to analyze
the code.) Let us just jump in right the middle of things: a network
interface such as e1000 DMAs its latest packet to one of the sk_bufs in
its receive ring then generates a receive interrupt. At the end of the
interrupt the driver attempts to refill the sk_buf ring using an atomic
allocation of a new sk_buf. Nothing bad happens if such a refill
attempt fails once, but if there are many failures in a row then the
interface will eventually hit the wall and run out of buffers, forcing
incoming packets to be dropped and severely degrading the performance of
such protocols as TCP.

We have changed the sk_buf allocation api slightly to allow the
allocator to act more intelligently. It first tries to allocate an
sk_buf normally, then if that fails it takes special action by redoing
the allocation from the PF_MEMALLOC emergency memory reserve. We
introduce a new GPF_MEMALLOC flag to allow the emergency allocation from
an interrupt. Under these severe conditions we take care to perform
each sk_buf allocation as a full page, not from the slab, which avoids
some nasty fragmentation issues. This is arguably a little wasteful,
however in the grand scheme of things it is just a minor blip.

We place a bound on the number of such emergency sk_bufs that can be in
flight. This bound is set to be larger than the sum of the maximum
block IO traffic on the interface and the maximum number of packets
that can possibly have been received at the interface but not yet
classified as to protocol or socket. The latter is a somewhat hard to
pin down quantity, but we know from the code structure that it cannot be
very large. So we make the emergency pool way, way bigger than we think
this could ever be.

Each sk_buf, emergency-allocated or otherwise, proceeds up through
the protocol layers, now in a soft irq work queue rather than a true
interrupt. On SMP machines there can be some queuing behaviour here,
so more than one packet can be in flight on this section of the packet
receive path simultaneously. It probably will not be more than a
handful, so we allow for something like 20 or 50 in this state.

As soon as the sk_buf hits the protocol decode handler, suddenly we are
able to tell which protocol is involved, and which socket. We then
check to see how deep the interface is into its reserve and drop (or, in
a future version, optionally hold) non-block IO traffic at this point.
Actually, we have such a generous reserve and such a nice way of getting
at it that we may never drop any packets here either.

Having pushed the sk_buf onto some socket's incoming queue, the soft
irq has done its work. Now, an emergency sk_buf may continue to live an
aribitrarily long time, since the NBD driver may be delayed an
arbitrarily long time for one reason or another. But this is OK because
we have throttled the virtual block device to some maximum amount of IO
that can possibly be in flight, and made sure that the limit is high
enough to ensure that not only are we making forward progress, but we
are making forward progress as fast as the remote disk and network can
handle it. This is exactly what we want in a low memory situation.

Eventually, the NBD client pulls the sk_buf off the incoming queue,
copies the data somewhere, and deallocates the sk_buf. In the case of
an emergency sk_buf, this ensures that the share of the PF_MEMALLOC
reserve dedicated to a particular NBD device will never be exceeded.
Then the birds come out and angel choirs sing (sometimes).

Some implementation details:

Departing somewhat from recent kernel fashion trends, we have built this
patch set around the traditional memalloc reserve and not the newer,
fancier mempool api. This is because the memalloc approach is
considerably more lightweight and just as obviously correct. It works
because we are able to express all our resource requirements in units of
zero order pages, precisely what the memalloc reserve provides us. We
think that less is more.

Arguably, when coding mistakes are made mempool will make the problem
obvious more quickly than the global memalloc reserve, which gives no
clue of which user went over its limit. This is most probably a reason
to write some fancy resource leak detection tools, not to throw out the
memalloc idea.

I did make a special effort to eliminate one of the problems with the
traditional memalloc reserve: how big should it be? We now make it
resizable, and make it the responsibility of whichever module uses the
functionality to expand the global reserve and shrink it as necessary.
For example, the global reserve may be increased when we initialize a
network block device and reduced by the same amount if the device is
destroyed. Or it might be increased in some module's init function and
reduced on exit.

I introduced a new allocation flag, GFP_MEMALLOC, which is the same as
GFP_ATOMIC but provides access to memalloc reserves for an atomic
user like an interrupt just as PF_MEMALLOC does for a task in normal
context.

Other implementation details are apparent from the code, and if they
are not then we will happily amend the patches with comments and answer
questions here.

The spectre of IP fragmentation has been raised. Because the reserve
safety factor is so large, we do not think that anything special needs
to be done to handle fragmentation under this scheme, however we are
open to suggestions and further analysis.

Credits

I would like to thank the following gentlemen for aiding this work in
one way or another:

Peter Zijlstra, who had the good grace to pick up and improve an
existing patch instead of reinventing his own as is traditional.

Rik van Riel, for inciting the two of us to finish the patch, and also
for providing a very nice wiki page discussing the problem.

Jon Corbet, for keeping score and explaining my earlier work better
than I ever could.

Joseph Dries, who realized exactly why this work is important and
enouraged me to complete it.

Andrew Morton, Ted Ts'o and Peter Anvin for listening patiently to
my early ramblings about how I would solve this problem, which
were sometimes way off the mark.

Google Inc, for giving me one of the best hacking jobs in the world
and for not being evil.

Further Reading

Memory deadlock thread from kernel-summit-discuss

http://thread.gmane.org/gmane.linux.network/24165
A Fascinating discussion of the problem prior to the 2005 kernel
summit.

Netdev threads

Early versions of the patch and an embarrassing mistake I made as I
learned about the network packet delivery path.

http://marc.theaimsgroup.com/?l=linux-netdev&m=112305260502201&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=112331377806222&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=112336517131046&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=112349600610884&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=112355867312022&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=112379949818293&w=2

Jon Corbet's writeup of the original patch

"Toward more robust network-based block I/O"
http://lwn.net/Articles/146689/

Rik van Riel's network memory deadlock wiki

http://linux-mm.org/NetworkStorageDeadlock


2006-08-08 19:34:47

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 1/9] pfn_to_kaddr() for UML


Update UML with a proper 'pfn_to_kaddr()' definition, the VM deadlock
avoidance framework uses it.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
include/asm-um/page.h | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/include/asm-um/page.h
===================================================================
--- linux-2.6.orig/include/asm-um/page.h
+++ linux-2.6/include/asm-um/page.h
@@ -111,6 +111,8 @@ extern unsigned long uml_physmem;
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#define virt_addr_valid(v) pfn_valid(phys_to_pfn(__pa(v)))

+#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
+
extern struct page *arch_validate(struct page *page, gfp_t mask, int order);
#define HAVE_ARCH_VALIDATE

2006-08-08 19:34:51

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 4/9] e100 driver conversion


Update the driver to make use of the netdev_alloc_skb() API and the
NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
drivers/net/e100.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/net/e100.c
===================================================================
--- linux-2.6.orig/drivers/net/e100.c
+++ linux-2.6/drivers/net/e100.c
@@ -1763,7 +1763,7 @@ static inline void e100_start_receiver(s
#define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
{
- if(!(rx->skb = dev_alloc_skb(RFD_BUF_LEN + NET_IP_ALIGN)))
+ if(!(rx->skb = netdev_alloc_skb(nic->netdev, RFD_BUF_LEN + NET_IP_ALIGN)))
return -ENOMEM;

/* Align, init, and map the RFD. */
@@ -2143,7 +2143,7 @@ static int e100_loopback_test(struct nic

e100_start_receiver(nic, NULL);

- if(!(skb = dev_alloc_skb(ETH_DATA_LEN))) {
+ if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
err = -ENOMEM;
goto err_loopback_none;
}
@@ -2573,6 +2573,7 @@ static int __devinit e100_probe(struct p
#ifdef CONFIG_NET_POLL_CONTROLLER
netdev->poll_controller = e100_netpoll;
#endif
+ netdev->features |= NETIF_F_MEMALLOC;
strcpy(netdev->name, pci_name(pdev));

nic = netdev_priv(netdev);

2006-08-08 19:34:41

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 3/9] e1000 driver conversion


Update the driver to make use of the NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
drivers/net/e1000/e1000_main.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/net/e1000/e1000_main.c
===================================================================
--- linux-2.6.orig/drivers/net/e1000/e1000_main.c
+++ linux-2.6/drivers/net/e1000/e1000_main.c
@@ -822,7 +822,7 @@ e1000_probe(struct pci_dev *pdev,
if (pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;

- netdev->features |= NETIF_F_LLTX;
+ netdev->features |= NETIF_F_LLTX | NETIF_F_MEMALLOC;

adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);

@@ -4020,8 +4020,6 @@ e1000_alloc_rx_buffers(struct e1000_adap
*/
skb_reserve(skb, NET_IP_ALIGN);

- skb->dev = netdev;
-
buffer_info->skb = skb;
buffer_info->length = adapter->rx_buffer_len;
map_skb:
@@ -4099,8 +4097,11 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
for (j = 0; j < PS_PAGE_BUFFERS; j++) {
if (j < adapter->rx_ps_pages) {
if (likely(!ps_page->ps_page[j])) {
+ /* Perhaps we should alloc the skb first
+ * and use something like sk_buff_gfp().
+ */
ps_page->ps_page[j] =
- alloc_page(GFP_ATOMIC);
+ alloc_page(GFP_ATOMIC | __GFP_MEMALLOC);
if (unlikely(!ps_page->ps_page[j])) {
adapter->alloc_rx_buff_failed++;
goto no_buffers;
@@ -4135,8 +4136,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
*/
skb_reserve(skb, NET_IP_ALIGN);

- skb->dev = netdev;
-
buffer_info->skb = skb;
buffer_info->length = adapter->rx_ps_bsize0;
buffer_info->dma = pci_map_single(pdev, skb->data,

2006-08-08 19:35:45

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 8/9] 3c59x driver conversion


Update the driver to make use of the netdev_alloc_skb() API and the
NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
drivers/net/3c59x.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/drivers/net/3c59x.c
===================================================================
--- linux-2.6.orig/drivers/net/3c59x.c
+++ linux-2.6/drivers/net/3c59x.c
@@ -1383,6 +1383,8 @@ static int __devinit vortex_probe1(struc
(dev->features & NETIF_F_IP_CSUM) ? "en":"dis");
}

+ dev->features |= NETIF_F_MEMALLOC;
+
dev->stop = vortex_close;
dev->get_stats = vortex_get_stats;
#ifdef CONFIG_PCI
@@ -1680,7 +1682,7 @@ vortex_open(struct net_device *dev)
vp->rx_ring[i].next = cpu_to_le32(vp->rx_ring_dma + sizeof(struct boom_rx_desc) * (i+1));
vp->rx_ring[i].status = 0; /* Clear complete bit. */
vp->rx_ring[i].length = cpu_to_le32(PKT_BUF_SZ | LAST_FRAG);
- skb = dev_alloc_skb(PKT_BUF_SZ);
+ skb = netdev_alloc_skb(dev, PKT_BUF_SZ);
vp->rx_skbuff[i] = skb;
if (skb == NULL)
break; /* Bad news! */
@@ -2405,7 +2407,7 @@ static int vortex_rx(struct net_device *
int pkt_len = rx_status & 0x1fff;
struct sk_buff *skb;

- skb = dev_alloc_skb(pkt_len + 5);
+ skb = netdev_alloc_skb(dev, pkt_len + 5);
if (vortex_debug > 4)
printk(KERN_DEBUG "Receiving packet size %d status %4.4x.\n",
pkt_len, rx_status);
@@ -2486,7 +2488,7 @@ boomerang_rx(struct net_device *dev)

/* Check if the packet is long enough to just accept without
copying to a properly sized skbuff. */
- if (pkt_len < rx_copybreak && (skb = dev_alloc_skb(pkt_len + 2)) != 0) {
+ if (pkt_len < rx_copybreak && (skb = netdev_alloc_skb(dev, pkt_len + 2)) != 0) {
skb->dev = dev;
skb_reserve(skb, 2); /* Align IP on 16 byte boundaries */
pci_dma_sync_single_for_cpu(VORTEX_PCI(vp), dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE);
@@ -2525,7 +2527,7 @@ boomerang_rx(struct net_device *dev)
struct sk_buff *skb;
entry = vp->dirty_rx % RX_RING_SIZE;
if (vp->rx_skbuff[entry] == NULL) {
- skb = dev_alloc_skb(PKT_BUF_SZ);
+ skb = netdev_alloc_skb(dev, PKT_BUF_SZ);
if (skb == NULL) {
static unsigned long last_jif;
if (time_after(jiffies, last_jif + 10 * HZ)) {

2006-08-08 19:35:38

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 7/9] UML eth driver conversion


Update the driver to make use of the netdev_alloc_skb() API and the
NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
arch/um/drivers/net_kern.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/um/drivers/net_kern.c
===================================================================
--- linux-2.6.orig/arch/um/drivers/net_kern.c
+++ linux-2.6/arch/um/drivers/net_kern.c
@@ -43,7 +43,7 @@ static int uml_net_rx(struct net_device
struct sk_buff *skb;

/* If we can't allocate memory, try again next round. */
- skb = dev_alloc_skb(dev->mtu);
+ skb = netdev_alloc_skb(dev, dev->mtu);
if (skb == NULL) {
lp->stats.rx_dropped++;
return 0;
@@ -377,6 +377,7 @@ static int eth_configure(int n, void *in
dev->ethtool_ops = &uml_net_ethtool_ops;
dev->watchdog_timeo = (HZ >> 1);
dev->irq = UM_ETH_IRQ;
+ dev->features |= NETIF_F_MEMALLOC;

rtnl_lock();
err = register_netdevice(dev);

2006-08-08 19:36:04

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 9/9] deadlock prevention for NBD



Use sk_set_memalloc() on the nbd socket.

Limit each request to 1 page, so that the request throttling also limits the
number of in-flight pages and force the IO scheduler to NOOP as anything else
doesn't make sense anyway.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
block/elevator.c | 5 +++++
block/ll_rw_blk.c | 12 ++++++++++--
drivers/block/nbd.c | 11 ++++++++++-
include/linux/blkdev.h | 9 +++++++++
4 files changed, 34 insertions(+), 3 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1899,6 +1899,14 @@ EXPORT_SYMBOL(blk_init_queue);
request_queue_t *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
+ return blk_init_queue_node_elv(rfn, lock, node_id, NULL);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
+
+request_queue_t *
+blk_init_queue_node_elv(request_fn_proc *rfn, spinlock_t *lock, int node_id,
+ char *elv_name)
+{
request_queue_t *q = blk_alloc_queue_node(GFP_KERNEL, node_id);

if (!q)
@@ -1939,7 +1947,7 @@ blk_init_queue_node(request_fn_proc *rfn
/*
* all done
*/
- if (!elevator_init(q, NULL)) {
+ if (!elevator_init(q, elv_name)) {
blk_queue_congestion_threshold(q);
return q;
}
@@ -1947,7 +1955,7 @@ blk_init_queue_node(request_fn_proc *rfn
blk_put_queue(q);
return NULL;
}
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_queue_node_elv);

int blk_get_queue(request_queue_t *q)
{
Index: linux-2.6/drivers/block/nbd.c
===================================================================
--- linux-2.6.orig/drivers/block/nbd.c
+++ linux-2.6/drivers/block/nbd.c
@@ -361,6 +361,10 @@ static void nbd_do_it(struct nbd_device

BUG_ON(lo->magic != LO_MAGIC);

+ if (sk_set_memalloc(lo->sock->sk))
+ printk(KERN_WARNING
+ "failed to set SO_MEMALLOC on NBD socket\n");
+
while ((req = nbd_read_stat(lo)) != NULL)
nbd_end_request(req);
return;
@@ -628,11 +632,16 @@ static int __init nbd_init(void)
* every gendisk to have its very own request_queue struct.
* These structs are big so we dynamically allocate them.
*/
- disk->queue = blk_init_queue(do_nbd_request, &nbd_lock);
+ disk->queue = blk_init_queue_node_elv(do_nbd_request,
+ &nbd_lock, -1, "noop");
if (!disk->queue) {
put_disk(disk);
goto out;
}
+ blk_queue_pin_elevator(disk->queue);
+ blk_queue_max_segment_size(disk->queue, PAGE_SIZE);
+ blk_queue_max_hw_segments(disk->queue, 1);
+ blk_queue_max_phys_segments(disk->queue, 1);
}

if (register_blkdev(NBD_MAJOR, "nbd")) {
Index: linux-2.6/include/linux/blkdev.h
===================================================================
--- linux-2.6.orig/include/linux/blkdev.h
+++ linux-2.6/include/linux/blkdev.h
@@ -444,6 +444,12 @@ struct request_queue
#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */
#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_ELVPINNED 9 /* pin the current elevator */
+
+static inline void blk_queue_pin_elevator(struct request_queue *q)
+{
+ set_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags);
+}

enum {
/*
@@ -696,6 +702,9 @@ static inline void elv_dispatch_add_tail
/*
* Access functions for manipulating queue properties
*/
+extern request_queue_t *blk_init_queue_node_elv(request_fn_proc *rfn,
+ spinlock_t *lock, int node_id,
+ char *elv_name);
extern request_queue_t *blk_init_queue_node(request_fn_proc *rfn,
spinlock_t *lock, int node_id);
extern request_queue_t *blk_init_queue(request_fn_proc *, spinlock_t *);
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -861,6 +861,11 @@ ssize_t elv_iosched_store(request_queue_
size_t len;
struct elevator_type *e;

+ if (test_bit(QUEUE_FLAG_ELVPINNED, &q->queue_flags)) {
+ printk(KERN_ERR "elevator: cannot switch elevator, pinned\n");
+ return count;
+ }
+
elevator_name[sizeof(elevator_name) - 1] = '\0';
strncpy(elevator_name, name, sizeof(elevator_name) - 1);
len = strlen(elevator_name);

2006-08-08 19:37:48

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 2/9] deadlock prevention core


The core of the VM deadlock avoidance framework.

>From the 'user' side of things it provides a function to mark a 'struct sock'
as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
the receive side.

>From the net_device side of things, the extra 'struct net_device *' argument
to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage.
Converted drivers will make use of this new API and will set NETIF_F_MEMALLOC
to indicate the driver fully supports this feature.

When a SOCK_MEMALLOC socket is marked, the device is checked for this feature
and tries to increase the memalloc pool; if both succeed, the device is marked
with IFF_MEMALLOC, indicating to {,__}netdev_alloc_skb() that it is OK to dip
into the memalloc pool.

Memalloc sk_buff allocations are not done from the SLAB but are done using
alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem()
can do the right thing.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
include/linux/gfp.h | 3 -
include/linux/if.h | 1
include/linux/mmzone.h | 1
include/linux/netdevice.h | 48 ++++++++++++++-----
include/linux/skbuff.h | 3 -
include/net/sock.h | 8 +++
mm/page_alloc.c | 29 ++++++++++-
net/core/dev.c | 1
net/core/skbuff.c | 114 +++++++++++++++++++++++++++++++++++++++++++---
net/core/sock.c | 54 +++++++++++++++++++++
net/ethernet/eth.c | 1
net/ipv4/af_inet.c | 14 +++++
net/ipv4/icmp.c | 5 ++
net/ipv4/tcp_ipv4.c | 6 ++
net/ipv4/udp.c | 11 ++++
15 files changed, 274 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);

#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6/include/linux/netdevice.h
===================================================================
--- linux-2.6.orig/include/linux/netdevice.h
+++ linux-2.6/include/linux/netdevice.h
@@ -298,18 +298,22 @@ struct net_device

/* Net device features */
unsigned long features;
-#define NETIF_F_SG 1 /* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
-#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
-#define NETIF_F_GSO 2048 /* Enable software GSO. */
-#define NETIF_F_LLTX 4096 /* LockLess TX */
+#define NETIF_F_SG 0x0001 /* Scatter/gather IO. */
+#define NETIF_F_IP_CSUM 0x0002 /* Can checksum only TCP/UDP over IPv4. */
+#define NETIF_F_NO_CSUM 0x0004 /* Does not require checksum. F.e. loopack. */
+#define NETIF_F_HW_CSUM 0x0008 /* Can checksum all the packets. */
+
+#define NETIF_F_HIGHDMA 0x0010 /* Can DMA to high memory. */
+#define NETIF_F_FRAGLIST 0x0020 /* Scatter/gather IO. */
+#define NETIF_F_HW_VLAN_TX 0x0040 /* Transmit VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_RX 0x0080 /* Receive VLAN hw acceleration */
+
+#define NETIF_F_HW_VLAN_FILTER 0x0100 /* Receive filtering on VLAN */
+#define NETIF_F_VLAN_CHALLENGED 0x0200 /* Device cannot handle VLAN packets */
+#define NETIF_F_GSO 0x0400 /* Enable software GSO. */
+#define NETIF_F_LLTX 0x0800 /* LockLess TX */
+
+#define NETIF_F_MEMALLOC 0x1000 /* Supports {SOCK,__GFP}_MEMALLOC */

/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
@@ -409,6 +413,12 @@ struct net_device
struct Qdisc *qdisc_sleeping;
struct list_head qdisc_list;
unsigned long tx_queue_len; /* Max frames per queue allowed */
+ int rx_reserve;
+ atomic_t rx_reserve_used;
+
+ atomic_t memalloc_socks;
+ unsigned long memalloc_reserve;
+ spinlock_t memalloc_lock;

/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
@@ -686,6 +696,20 @@ static inline void dev_kfree_skb_irq(str
*/
extern void dev_kfree_skb_any(struct sk_buff *skb);

+/*
+ * Support for critical network IO under low memory conditions
+ */
+static inline int dev_reserve_used(struct net_device *dev)
+{
+ return atomic_read(&dev->rx_reserve_used);
+}
+
+static inline void dev_unreserve_skb(struct net_device *dev)
+{
+ if (atomic_dec_return(&dev->rx_reserve_used) < 0)
+ atomic_inc(&dev->rx_reserve_used);
+}
+
#define HAVE_NETIF_RX 1
extern int netif_rx(struct sk_buff *skb);
extern int netif_rx_ni(struct sk_buff *skb);
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ memalloc:1;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -391,6 +391,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* protocol can use memalloc reserve */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,13 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}

+static inline int sk_is_memalloc(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+extern int sk_set_memalloc(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);

static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+int var_free_kbytes;

unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
@@ -970,8 +971,8 @@ restart:

/* This allocation should allow future memory freeing. */

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
@@ -2196,7 +2197,8 @@ static void setup_per_zone_lowmem_reserv
*/
void setup_per_zone_pages_min(void)
{
- unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+ >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -2248,6 +2250,27 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
}

+int adjust_memalloc_reserve(int pages)
+{
+ int kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+ if (kbytes < 0)
+ return -EINVAL;
+ var_free_kbytes = kbytes;
+ setup_per_zone_pages_min();
+ if (pages > 0) {
+ int i;
+ pg_data_t *pgdat;
+ for_each_online_pgdat(pgdat) {
+ for (i = 0; i < pgdat->nr_zones; ++i)
+ wakeup_kswapd(&pgdat->node_zones[i], 0);
+ }
+ }
+ printk(KERN_DEBUG "RX reserve: %d\n", var_free_kbytes);
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
/*
* Initialise min_free_kbytes.
*
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -43,6 +43,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/interrupt.h>
#include <linux/in.h>
#include <linux/inet.h>
@@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
*
*/

+#define ceiling_log2(x) fls((x) - 1)
+
/**
* __alloc_skb - allocate a network buffer
* @size: size to allocate
@@ -147,6 +150,49 @@ struct sk_buff *__alloc_skb(unsigned int
struct sk_buff *skb;
u8 *data;

+ size = SKB_DATA_ALIGN(size);
+
+ if (gfp_mask & __GFP_MEMALLOC) {
+ /*
+ * We have to do higher order allocations for icky jumbo
+ * frame drivers :-(
+ * They really should be migrated to scater/gather DMA
+ * and use skb fragments.
+ */
+ unsigned int data_offset =
+ sizeof(struct sk_buff) + sizeof(unsigned int);
+ unsigned long length = size + data_offset +
+ sizeof(struct skb_shared_info);
+ unsigned int pages;
+ unsigned int order;
+ struct page *page;
+ void *kaddr;
+
+ /*
+ * force fclone alloc in order to fudge a lacking in skb_clone().
+ */
+ fclone = 1;
+ if (fclone) {
+ data_offset += sizeof(struct sk_buff) + sizeof(atomic_t);
+ length += sizeof(struct sk_buff) + sizeof(atomic_t);
+ }
+ pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ order = ceiling_log2(pages);
+
+ skb = NULL;
+ if (!(page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order)))
+ goto out;
+
+ kaddr = pfn_to_kaddr(page_to_pfn(page));
+ skb = (struct sk_buff *)kaddr;
+
+ *((unsigned int *)(kaddr + data_offset -
+ sizeof(unsigned int))) = order;
+ data = (u8 *)(kaddr + data_offset);
+
+ goto allocated;
+ }
+
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

/* Get the HEAD */
@@ -155,12 +201,13 @@ struct sk_buff *__alloc_skb(unsigned int
goto out;

/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

+allocated:
memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->memalloc = !!(gfp_mask & __GFP_MEMALLOC);
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -185,6 +232,7 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);

child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->memalloc = skb->memalloc;
}
out:
return skb;
@@ -250,7 +298,7 @@ nodata:
}

/**
- * __netdev_alloc_skb - allocate an skbuff for rx on a specific device
+ * ___netdev_alloc_skb - allocate an skbuff for rx on a specific device
* @dev: network device to receive on
* @length: length to allocate
* @gfp_mask: get_free_pages mask, passed to alloc_skb
@@ -262,7 +310,7 @@ nodata:
*
* %NULL is returned if there is no free memory.
*/
-struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+static struct sk_buff *___netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask)
{
struct sk_buff *skb;
@@ -273,6 +321,34 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
}

+struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+ unsigned length, gfp_t gfp_mask)
+{
+ struct sk_buff *skb;
+
+ if (dev && (dev->flags & IFF_MEMALLOC)) {
+ WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
+ gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
+
+ if ((skb = ___netdev_alloc_skb(dev, length,
+ gfp_mask | __GFP_NOMEMALLOC)))
+ goto done;
+ if (dev_reserve_used(dev) >= dev->rx_reserve)
+ goto out;
+ if (!(skb = ___netdev_alloc_skb(dev, length,
+ gfp_mask | __GFP_MEMALLOC)))
+ goto out;
+ atomic_inc(&dev->rx_reserve_used);
+ } else
+ if (!(skb = ___netdev_alloc_skb(dev, length, gfp_mask)))
+ goto out;
+
+done:
+ skb->dev = dev;
+out:
+ return skb;
+}
+
static void skb_drop_list(struct sk_buff **listp)
{
struct sk_buff *list = *listp;
@@ -313,10 +389,19 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);

- kfree(skb->head);
+ if (!skb->memalloc)
+ kfree(skb->head);
}
}

+static void free_skb_pages(struct kmem_cache *cache, void *objp)
+{
+ struct sk_buff *skb = (struct sk_buff *)objp;
+ unsigned int order =
+ *(unsigned int *)(skb->head - sizeof(unsigned int));
+ free_pages((unsigned long)skb, order);
+}
+
/*
* Free an skbuff by memory without cleaning the state.
*/
@@ -324,17 +409,21 @@ void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
+ void (*free_skb)(struct kmem_cache *, void *);

skb_release_data(skb);
+
+ free_skb = skb->memalloc ? free_skb_pages : kmem_cache_free;
+
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
- kmem_cache_free(skbuff_head_cache, skb);
+ free_skb(skbuff_head_cache, skb);
break;

case SKB_FCLONE_ORIG:
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, skb);
+ free_skb(skbuff_fclone_cache, skb);
break;

case SKB_FCLONE_CLONE:
@@ -347,7 +436,7 @@ void kfree_skbmem(struct sk_buff *skb)
skb->fclone = SKB_FCLONE_UNAVAILABLE;

if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, other);
+ free_skb(skbuff_fclone_cache, other);
break;
};
}
@@ -363,6 +452,8 @@ void kfree_skbmem(struct sk_buff *skb)

void __kfree_skb(struct sk_buff *skb)
{
+ struct net_device *dev = skb->dev;
+
dst_release(skb->dst);
#ifdef CONFIG_XFRM
secpath_put(skb->sp);
@@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
#endif

kfree_skbmem(skb);
+ if (dev && (dev->flags & IFF_MEMALLOC))
+ dev_unreserve_skb(dev);
}

/**
@@ -434,10 +527,15 @@ struct sk_buff *skb_clone(struct sk_buff
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ /*
+ * should we special-case skb->memalloc cloning?
+ * for now fudge it by forcing fast-clone alloc.
+ */
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
n->fclone = SKB_FCLONE_UNAVAILABLE;
+ n->memalloc = 0;
}

#define C(x) n->x = skb->x
@@ -686,6 +784,8 @@ int pskb_expand_head(struct sk_buff *skb
if (skb_shared(skb))
BUG();

+ BUG_ON(skb->memalloc);
+
size = SKB_DATA_ALIGN(size);

data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
Index: linux-2.6/net/ethernet/eth.c
===================================================================
--- linux-2.6.orig/net/ethernet/eth.c
+++ linux-2.6/net/ethernet/eth.c
@@ -275,6 +275,7 @@ void ether_setup(struct net_device *dev)
dev->mtu = ETH_DATA_LEN;
dev->addr_len = ETH_ALEN;
dev->tx_queue_len = 1000; /* Ethernet wants good queues */
+ dev->rx_reserve = 384;
dev->flags = IFF_BROADCAST|IFF_MULTICAST;

memset(dev->broadcast,0xFF, ETH_ALEN);
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c
+++ linux-2.6/net/ipv4/icmp.c
@@ -938,6 +938,11 @@ int icmp_rcv(struct sk_buff *skb)
goto error;
}

+ if (unlikely(dev_reserve_used(skb->dev))) {
+ dev_unreserve_skb(skb->dev);
+ goto drop;
+ }
+
if (!pskb_pull(skb, sizeof(struct icmphdr)))
goto error;

Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c
+++ linux-2.6/net/ipv4/tcp_ipv4.c
@@ -1093,6 +1093,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;

+ if (unlikely(dev_reserve_used(skb->dev))) {
+ dev_unreserve_skb(skb->dev);
+ if (!sk_is_memalloc(sk))
+ goto discard_and_relse;
+ }
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -1136,7 +1136,15 @@ int udp_rcv(struct sk_buff *skb)
sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);

if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ if (unlikely(dev_reserve_used(skb->dev))) {
+ dev_unreserve_skb(skb->dev);
+ if (!sk_is_memalloc(sk))
+ goto drop_noncritical;
+ }
+
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1155,7 @@ int udp_rcv(struct sk_buff *skb)
return 0;
}

+drop_noncritical:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset(skb);
Index: linux-2.6/include/linux/if.h
===================================================================
--- linux-2.6.orig/include/linux/if.h
+++ linux-2.6/include/linux/if.h
@@ -49,6 +49,7 @@

#define IFF_LOWER_UP 0x10000 /* driver signals L1 up */
#define IFF_DORMANT 0x20000 /* driver signals dormant */
+#define IFF_MEMALLOC 0x40000 /* driver has a reserve allocated */

#define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|\
IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2900,6 +2900,7 @@ int register_netdevice(struct net_device
#ifdef CONFIG_NET_CLS_ACT
spin_lock_init(&dev->ingress_lock);
#endif
+ spin_lock_init(&dev->memalloc_lock);

ret = alloc_divert_blk(dev);
if (ret)
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -131,6 +131,20 @@ static DEFINE_SPINLOCK(inetsw_lock);
void inet_sock_destruct(struct sock *sk)
{
struct inet_sock *inet = inet_sk(sk);
+ struct net_device *dev = ip_dev_find(inet->rcv_saddr);
+
+ if (dev && (dev->flags & IFF_MEMALLOC) &&
+ sk_is_memalloc(sk) &&
+ (atomic_read(&dev->memalloc_socks) == 1)) {
+ spin_lock(&dev->memalloc_lock);
+ if (atomic_dec_and_test(&dev->memalloc_socks)) {
+ dev->flags &= ~IFF_MEMALLOC;
+ WARN_ON(dev_reserve_used(dev));
+ atomic_set(&dev->rx_reserve_used, 0);
+ adjust_memalloc_reserve(-dev->memalloc_reserve);
+ }
+ spin_unlock(&dev->memalloc_lock);
+ }

__skb_queue_purge(&sk->sk_receive_queue);
__skb_queue_purge(&sk->sk_error_queue);
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -111,6 +111,7 @@
#include <linux/poll.h>
#include <linux/tcp.h>
#include <linux/init.h>
+#include <linux/inetdevice.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -195,6 +196,59 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+#define ceiling_log2(x) fls((x) - 1)
+
+static inline unsigned int skb_pages(unsigned int mtu)
+{
+ unsigned int pages = (mtu + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ unsigned int order = ceiling_log2(pages);
+ pages = 1 << order;
+ if (pages > 1) ++pages;
+
+ return pages;
+}
+
+int sk_set_memalloc(struct sock *sk)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct net_device *dev = ip_dev_find(inet->rcv_saddr);
+ int err = 0;
+
+ if (!dev)
+ return -ENODEV;
+
+ if (!(dev->features & NETIF_F_MEMALLOC)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ if (atomic_read(&dev->memalloc_socks) == 0) {
+ spin_lock(&dev->memalloc_lock);
+ if (atomic_read(&dev->memalloc_socks) == 0) {
+ dev->memalloc_reserve =
+ dev->rx_reserve * skb_pages(dev->mtu);
+ err = adjust_memalloc_reserve(dev->memalloc_reserve);
+ if (err) {
+ spin_unlock(&dev->memalloc_lock);
+ printk(KERN_WARNING
+ "%s: Unable to allocate RX reserve, error: %d\n",
+ dev->name, err);
+ goto out;
+ }
+ sock_set_flag(sk, SOCK_MEMALLOC);
+ dev->flags |= IFF_MEMALLOC;
+ }
+ atomic_inc(&dev->memalloc_socks);
+ spin_unlock(&dev->memalloc_lock);
+ } else
+ atomic_inc(&dev->memalloc_socks);
+
+out:
+ dev_put(dev);
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;

2006-08-08 19:36:52

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 5/9] r8169 driver conversion


Update the driver to make use of the netdev_alloc_skb() API and the
NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
drivers/net/r8169.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/net/r8169.c
===================================================================
--- linux-2.6.orig/drivers/net/r8169.c
+++ linux-2.6/drivers/net/r8169.c
@@ -1480,6 +1480,8 @@ rtl8169_init_board(struct pci_dev *pdev,
}
}

+ dev->features |= NETIF_F_MEMALLOC;
+
pci_set_master(pdev);

/* ioremap MMIO region */
@@ -1909,14 +1911,15 @@ static inline void rtl8169_map_to_asic(s
rtl8169_mark_to_asic(desc, rx_buf_sz);
}

-static int rtl8169_alloc_rx_skb(struct pci_dev *pdev, struct sk_buff **sk_buff,
+static int rtl8169_alloc_rx_skb(struct net_device *dev, struct pci_dev *pdev,
+ struct sk_buff **sk_buff,
struct RxDesc *desc, int rx_buf_sz)
{
struct sk_buff *skb;
dma_addr_t mapping;
int ret = 0;

- skb = dev_alloc_skb(rx_buf_sz + NET_IP_ALIGN);
+ skb = netdev_alloc_skb(dev, rx_buf_sz + NET_IP_ALIGN);
if (!skb)
goto err_out;

@@ -1960,7 +1963,7 @@ static u32 rtl8169_rx_fill(struct rtl816
if (tp->Rx_skbuff[i])
continue;

- ret = rtl8169_alloc_rx_skb(tp->pci_dev, tp->Rx_skbuff + i,
+ ret = rtl8169_alloc_rx_skb(dev, tp->pci_dev, tp->Rx_skbuff + i,
tp->RxDescArray + i, tp->rx_buf_sz);
if (ret < 0)
break;
@@ -2371,7 +2374,8 @@ static inline void rtl8169_rx_csum(struc
skb->ip_summed = CHECKSUM_NONE;
}

-static inline int rtl8169_try_rx_copy(struct sk_buff **sk_buff, int pkt_size,
+static inline int rtl8169_try_rx_copy(struct net_device *dev,
+ struct sk_buff **sk_buff, int pkt_size,
struct RxDesc *desc, int rx_buf_sz)
{
int ret = -1;
@@ -2379,7 +2383,7 @@ static inline int rtl8169_try_rx_copy(st
if (pkt_size < rx_copybreak) {
struct sk_buff *skb;

- skb = dev_alloc_skb(pkt_size + NET_IP_ALIGN);
+ skb = netdev_alloc_skb(dev, pkt_size + NET_IP_ALIGN);
if (skb) {
skb_reserve(skb, NET_IP_ALIGN);
eth_copy_and_sum(skb, sk_buff[0]->data, pkt_size, 0);
@@ -2452,7 +2456,7 @@ rtl8169_rx_interrupt(struct net_device *
le64_to_cpu(desc->addr), tp->rx_buf_sz,
PCI_DMA_FROMDEVICE);

- if (rtl8169_try_rx_copy(&skb, pkt_size, desc,
+ if (rtl8169_try_rx_copy(dev, &skb, pkt_size, desc,
tp->rx_buf_sz)) {
pci_action = pci_unmap_single;
tp->Rx_skbuff[entry] = NULL;

2006-08-08 19:36:52

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 6/9] tg3 driver conversion


Update the driver to make use of the NETIF_F_MEMALLOC feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
drivers/net/tg3.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/drivers/net/tg3.c
===================================================================
--- linux-2.6.orig/drivers/net/tg3.c
+++ linux-2.6/drivers/net/tg3.c
@@ -11643,6 +11643,8 @@ static int __devinit tg3_init_one(struct
*/
pci_save_state(tp->pdev);

+ dev->features |= NETIF_F_MEMALLOC;
+
err = register_netdev(dev);
if (err) {
printk(KERN_ERR PFX "Cannot register net device, "

2006-08-08 20:14:56

by Kok, Auke

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/9] e100 driver conversion

Peter Zijlstra wrote:
> Update the driver to make use of the netdev_alloc_skb() API and the
> NETIF_F_MEMALLOC feature.

this should be done in two separate patches. I should take care of the netdev_alloc_skb()
part too for e100 (which I've already queued internally), also since ixgb still needs it.

do you have any plans to visit ixgb for this change too?

Cheers,

Auke


> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Daniel Phillips <[email protected]>
>
> ---
> drivers/net/e100.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/drivers/net/e100.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/e100.c
> +++ linux-2.6/drivers/net/e100.c
> @@ -1763,7 +1763,7 @@ static inline void e100_start_receiver(s
> #define RFD_BUF_LEN (sizeof(struct rfd) + VLAN_ETH_FRAME_LEN)
> static int e100_rx_alloc_skb(struct nic *nic, struct rx *rx)
> {
> - if(!(rx->skb = dev_alloc_skb(RFD_BUF_LEN + NET_IP_ALIGN)))
> + if(!(rx->skb = netdev_alloc_skb(nic->netdev, RFD_BUF_LEN + NET_IP_ALIGN)))
> return -ENOMEM;
>
> /* Align, init, and map the RFD. */
> @@ -2143,7 +2143,7 @@ static int e100_loopback_test(struct nic
>
> e100_start_receiver(nic, NULL);
>
> - if(!(skb = dev_alloc_skb(ETH_DATA_LEN))) {
> + if(!(skb = netdev_alloc_skb(nic->netdev, ETH_DATA_LEN))) {
> err = -ENOMEM;
> goto err_loopback_none;
> }
> @@ -2573,6 +2573,7 @@ static int __devinit e100_probe(struct p
> #ifdef CONFIG_NET_POLL_CONTROLLER
> netdev->poll_controller = e100_netpoll;
> #endif
> + netdev->features |= NETIF_F_MEMALLOC;
> strcpy(netdev->name, pci_name(pdev));
>
> nic = netdev_priv(netdev);
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-08-08 20:18:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/9] e100 driver conversion

On Tue, 2006-08-08 at 13:13 -0700, Auke Kok wrote:
> Peter Zijlstra wrote:
> > Update the driver to make use of the netdev_alloc_skb() API and the
> > NETIF_F_MEMALLOC feature.
>
> this should be done in two separate patches. I should take care of the netdev_alloc_skb()
> part too for e100 (which I've already queued internally), also since ixgb still needs it.
>
> do you have any plans to visit ixgb for this change too?

Well, all drivers are queued, these were just the ones I have hardware
for in running systems (except wireless).

Since this patch-set is essentially a RFC, your patch will likely hit
mainline ere this one, at that point I'll rebase.

For future patches I'll split up in two if people are so inclined.


2006-08-08 20:51:43

by Kok, Auke

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/9] e1000 driver conversion

Peter Zijlstra wrote:
> Update the driver to make use of the NETIF_F_MEMALLOC feature.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Daniel Phillips <[email protected]>
>
> ---
> drivers/net/e1000/e1000_main.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> Index: linux-2.6/drivers/net/e1000/e1000_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/e1000/e1000_main.c
> +++ linux-2.6/drivers/net/e1000/e1000_main.c
> @@ -4020,8 +4020,6 @@ e1000_alloc_rx_buffers(struct e1000_adap
> */
> skb_reserve(skb, NET_IP_ALIGN);
>
> - skb->dev = netdev;
> -
> buffer_info->skb = skb;
> buffer_info->length = adapter->rx_buffer_len;
> map_skb:
> @@ -4135,8 +4136,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
> */
> skb_reserve(skb, NET_IP_ALIGN);
>
> - skb->dev = netdev;
> -
> buffer_info->skb = skb;
> buffer_info->length = adapter->rx_ps_bsize0;
> buffer_info->dma = pci_map_single(pdev, skb->data,
> -

can we really delete these??

Cheers,

Auke

2006-08-08 20:57:31

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Tue, 08 Aug 2006 21:33:45 +0200
Peter Zijlstra <[email protected]> wrote:

>
> The core of the VM deadlock avoidance framework.
>
> From the 'user' side of things it provides a function to mark a 'struct sock'
> as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
> the receive side.
>
> From the net_device side of things, the extra 'struct net_device *' argument
> to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage.
> Converted drivers will make use of this new API and will set NETIF_F_MEMALLOC
> to indicate the driver fully supports this feature.
>
> When a SOCK_MEMALLOC socket is marked, the device is checked for this feature
> and tries to increase the memalloc pool; if both succeed, the device is marked
> with IFF_MEMALLOC, indicating to {,__}netdev_alloc_skb() that it is OK to dip
> into the memalloc pool.
>
> Memalloc sk_buff allocations are not done from the SLAB but are done using
> alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem()
> can do the right thing.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Daniel Phillips <[email protected]>
>

How much of this is just building special case support for large allocations
for jumbo frames? Wouldn't it make more sense to just fix those drivers to
do scatter and add the support hooks for that?

2006-08-08 21:06:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Tue, 2006-08-08 at 13:57 -0700, Stephen Hemminger wrote:
> On Tue, 08 Aug 2006 21:33:45 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> >
> > The core of the VM deadlock avoidance framework.
> >
> > From the 'user' side of things it provides a function to mark a 'struct sock'
> > as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
> > the receive side.
> >
> > From the net_device side of things, the extra 'struct net_device *' argument
> > to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage.
> > Converted drivers will make use of this new API and will set NETIF_F_MEMALLOC
> > to indicate the driver fully supports this feature.
> >
> > When a SOCK_MEMALLOC socket is marked, the device is checked for this feature
> > and tries to increase the memalloc pool; if both succeed, the device is marked
> > with IFF_MEMALLOC, indicating to {,__}netdev_alloc_skb() that it is OK to dip
> > into the memalloc pool.
> >
> > Memalloc sk_buff allocations are not done from the SLAB but are done using
> > alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem()
> > can do the right thing.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Daniel Phillips <[email protected]>
> >
>
> How much of this is just building special case support for large allocations
> for jumbo frames? Wouldn't it make more sense to just fix those drivers to
> do scatter and add the support hooks for that?

Only some of the horrors in __alloc_skb(), esp those related to the
order argument. OTOH, yes I would very much like all the jumbo capable
driver to do proper scather/gather on fragments, alas drivers are not my
storng point.

If someone (preferably the maintainers) will contribute patches....

2006-08-08 21:15:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/9] e1000 driver conversion

On Tue, 2006-08-08 at 13:50 -0700, Auke Kok wrote:
> Peter Zijlstra wrote:
> > Update the driver to make use of the NETIF_F_MEMALLOC feature.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Daniel Phillips <[email protected]>
> >
> > ---
> > drivers/net/e1000/e1000_main.c | 11 +++++------
> > 1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > Index: linux-2.6/drivers/net/e1000/e1000_main.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/net/e1000/e1000_main.c
> > +++ linux-2.6/drivers/net/e1000/e1000_main.c
> > @@ -4020,8 +4020,6 @@ e1000_alloc_rx_buffers(struct e1000_adap
> > */
> > skb_reserve(skb, NET_IP_ALIGN);
> >
> > - skb->dev = netdev;
> > -
> > buffer_info->skb = skb;
> > buffer_info->length = adapter->rx_buffer_len;
> > map_skb:
> > @@ -4135,8 +4136,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
> > */
> > skb_reserve(skb, NET_IP_ALIGN);
> >
> > - skb->dev = netdev;
> > -
> > buffer_info->skb = skb;
> > buffer_info->length = adapter->rx_ps_bsize0;
> > buffer_info->dma = pci_map_single(pdev, skb->data,
> > -
>
> can we really delete these??

The new {,__}netdev_alloc_skb() will set it when the allocation
succeeds.

2006-08-08 21:17:13

by Thomas Graf

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

* Peter Zijlstra <[email protected]> 2006-08-08 21:33
> +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
> + unsigned length, gfp_t gfp_mask)
> +{
> + struct sk_buff *skb;
> +
> + if (dev && (dev->flags & IFF_MEMALLOC)) {
> + WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
> + gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
> +
> + if ((skb = ___netdev_alloc_skb(dev, length,
> + gfp_mask | __GFP_NOMEMALLOC)))
> + goto done;
> + if (dev_reserve_used(dev) >= dev->rx_reserve)
> + goto out;
> + if (!(skb = ___netdev_alloc_skb(dev, length,
> + gfp_mask | __GFP_MEMALLOC)))
> + goto out;
> + atomic_inc(&dev->rx_reserve_used);
> + } else
> + if (!(skb = ___netdev_alloc_skb(dev, length, gfp_mask)))
> + goto out;
> +
> +done:
> + skb->dev = dev;
> +out:
> + return skb;
> +}
> +

> void __kfree_skb(struct sk_buff *skb)
> {
> + struct net_device *dev = skb->dev;
> +
> dst_release(skb->dst);
> #ifdef CONFIG_XFRM
> secpath_put(skb->sp);
> @@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
> #endif
>
> kfree_skbmem(skb);
> + if (dev && (dev->flags & IFF_MEMALLOC))
> + dev_unreserve_skb(dev);
> }

skb->dev is not guaranteed to still point to the "allocating" device
once the skb is freed again so reserve/unreserve isn't symmetric.
You'd need skb->alloc_dev or something.

2006-08-08 22:10:17

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core


I think the new atomic operation that will seemingly occur on every
device SKB free is unacceptable.

You also cannot modify netdev->flags in the lockless manner in which
you do, it must be done with the appropriate locking, such as holding
the RTNL semaphore.

2006-08-08 22:32:07

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/9] e1000 driver conversion

From: Auke Kok <[email protected]>
Date: Tue, 08 Aug 2006 13:50:33 -0700

> can we really delete these??

netdev_alloc_skb() does it for you

2006-08-08 22:43:25

by Kok, Auke

[permalink] [raw]
Subject: Re: [RFC][PATCH 3/9] e1000 driver conversion

David Miller wrote:
> From: Auke Kok <[email protected]>
> Date: Tue, 08 Aug 2006 13:50:33 -0700
>
>> can we really delete these??
>
> netdev_alloc_skb() does it for you

yeah I didn't spot that patch #2 in that series introduces that code - my bad.
Thanks.

Auke

2006-08-08 23:07:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

Peter Zijlstra wrote:
> Update the driver to make use of the netdev_alloc_skb() API and the
> NETIF_F_MEMALLOC feature.

NETIF_F_MEMALLOC does not exist in the upstream tree... nor should it, IMO.

netdev_alloc_skb() is in the tree, and that's fine.

Jeff



2006-08-09 00:25:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Indan Zupancic wrote:
> Hello,
> Saw the patch on lkml, and wondered about some things.
> On Tue, August 8, 2006 21:33, Peter Zijlstra said:
>
>>+static inline void dev_unreserve_skb(struct net_device *dev)
>>+{
>>+ if (atomic_dec_return(&dev->rx_reserve_used) < 0)
>>+ atomic_inc(&dev->rx_reserve_used);
>>+}
>>+
>
> This looks strange. Is it possible for rx_reserve_used to become negative?
> A quick look at the code says no, except in the one case where there isn't a
> "if (unlikely(dev_reserve_used(skb->dev)))" check:

Yes, you can see what I'm trying to do there, I was short an atomic op to
do it, My ugly solution may well be... probably is racy. Let's rewrite
it with something better. We want the atomic op that some people call
"monus": decrement unless zero.

>>@@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
>> #endif
>>
>> kfree_skbmem(skb);
>>+ if (dev && (dev->flags & IFF_MEMALLOC))
>>+ dev_unreserve_skb(dev);
>> }
>
> So it seems that that < 0 check in dev_unreserve_skb() was only added to handle
> this case (though there seems to be a race between those two atomic ops).
> Isn't it better to remove that check and just do:
>
> if (dev && (dev->flags & IFF_MEMALLOC) && dev_reserve_used(dev))

Seems to me that unreserve is also called from each protocol handler.
Agreed, the < 0 check is fairly sickening.

> The use of atomic seems a bit dubious. Either it's necessary, in which case
> changes depending on tests seem unsafe as they're not atomic and something
> crucial could change between the read and the check, or normal reads and writes
> combined with barriers would be sufficient. All in all it seems better to
> move that "if (unlikely(dev_reserve_used(skb->dev)))" check into
> dev_unreserve_skb(), and make the whole atomic if necessary. Then let
> dev_unreserve_skb() return wether rx_reserve_used was positive.
> Getting rid of dev_reserve_used() and using the atomic_read directly might be
> better, as it is set with the bare atomic instructions too and rarely used without
> dev_unreserve_skb().

Barriers should work for this reserve accounting, but is that better than
an atomic op? I don't know, let's let the barrier mavens opine.

IMHO the cleanest thing to do is code up "monus", in fact I dimly recall
somebody already added something similar.

Side note: please don't be shy, just reply-all in future so the discussion
stays public. Part of what we do is try to share our development process
so people see not only what we have done, but why we did it. (And sometimes
so they can see what dumb mistakes we make, but I won't get into that...)

I beg for forgiveness in advance having taken the liberty of CCing this
reply to lkml.

Regards,

Daniel

2006-08-09 01:34:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Stephen Hemminger wrote:
> How much of this is just building special case support for large allocations
> for jumbo frames? Wouldn't it make more sense to just fix those drivers to
> do scatter and add the support hooks for that?

Short answer: none of it is. If it happens to handle jumbo frames nicely
that is mainly a lucky accident, and we do need to check that they actually
works.

Minor rant: the whole skb_alloc familly has degenerated into an unholy
mess and could use some rethinking. I believe the current patch gets as
far as three _'s at the beginning of a function, this shows it is high
time to reroll the api.

Regards,

Daniel

2006-08-09 01:34:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Thomas Graf wrote:
> skb->dev is not guaranteed to still point to the "allocating" device
> once the skb is freed again so reserve/unreserve isn't symmetric.
> You'd need skb->alloc_dev or something.

Can you please characterize the conditions under which skb->dev changes
after the alloc? Are there writings on this subtlety?

Regards,

Daniel

2006-08-09 01:36:05

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Hi Dave,

David Miller wrote:
> I think the new atomic operation that will seemingly occur on every
> device SKB free is unacceptable.

Alternate suggestion?

> You also cannot modify netdev->flags in the lockless manner in which
> you do, it must be done with the appropriate locking, such as holding
> the RTNL semaphore.

Thanks for the catch.

Regards,

Daniel



2006-08-09 01:38:36

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Daniel Phillips <[email protected]>
Date: Tue, 08 Aug 2006 18:33:20 -0700

> Minor rant: the whole skb_alloc familly has degenerated into an unholy
> mess and could use some rethinking. I believe the current patch gets as
> far as three _'s at the beginning of a function, this shows it is high
> time to reroll the api.

I think it is merely an expression of how dynamic are the operations
that people want to perform on SKBs, and how important it is for
performance to implement COW semantics for the data.

2006-08-09 01:39:55

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Daniel Phillips <[email protected]>
Date: Tue, 08 Aug 2006 18:34:43 -0700

> Can you please characterize the conditions under which skb->dev changes
> after the alloc? Are there writings on this subtlety?

The packet scheduler and classifier can redirect packets to different
devices, and can the netfilter layer.

The setting of skb->dev is wholly transient and you cannot rely upon
it to be the same as when you set it on allocation.

Even simple things like the bonding device change skb->dev on every
receive.

I think you need to study the networking stack a little more before
you continue to play in this delicate area :-)

2006-08-09 01:42:29

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Daniel Phillips <[email protected]>
Date: Tue, 08 Aug 2006 18:35:42 -0700

> David Miller wrote:
> > I think the new atomic operation that will seemingly occur on every
> > device SKB free is unacceptable.
>
> Alternate suggestion?

Sorry, I have none. But you're unlikely to get your changes
considered seriously unless you can avoid any new overhead your patch
has which is of this level.

We're busy trying to make these data structures smaller, and eliminate
atomic operations, as much as possible. Therefore anything which adds
new datastructure elements and new atomic operations will be met with
fierce resistence unless it results an equal or greater shrink of
datastructures elsewhere or removes atomic operations elsewhere in
the critical path.

2006-08-09 05:44:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
>From: Daniel Phillips <[email protected]>
>>David Miller wrote:
>>>I think the new atomic operation that will seemingly occur on every
>>>device SKB free is unacceptable.
>>
>>Alternate suggestion?
>
> Sorry, I have none. But you're unlikely to get your changes
> considered seriously unless you can avoid any new overhead your patch
> has which is of this level.

We just skip anything new unless the socket is actively carrying block
IO traffic, in which case we pay a miniscule price to avoid severe
performance artifacts or in the worst case, deadlock. So in this design
the new atomic operation does not occur on every device SKP free.

All atomic ops sit behind the cheap test:

(dev->flags & IFF_MEMALLOC)

or if any have escaped that is just an oversight. Peter?

> We're busy trying to make these data structures smaller, and eliminate
> atomic operations, as much as possible. Therefore anything which adds
> new datastructure elements and new atomic operations will be met with
> fierce resistence unless it results an equal or greater shrink of
> datastructures elsewhere or removes atomic operations elsewhere in
> the critical path.

Right now we have a problem because our network stack cannot support
block IO reliably. Without that, Linux is no enterprise storage
platform.

Regards,

Daniel

2006-08-09 05:47:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> http://lwn.net/Articles/144273/
> "Kernel Summit 2005: Convergence of network and storage paths"
>
> We believe that an approach very much like today's patch set is
> necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> We further believe that a properly working version of at least one of
> these subsystems is critical to the viability of Linux as a modern
> storage platform.

There is another approach for that - do not use slab allocator for
network dataflow at all. It automatically has all you pros amd if
implemented correctly can have a lot of additional usefull and
high-performance features like full zero-copy and total fragmentation
avoidance.

--
Evgeniy Polyakov

2006-08-09 05:47:30

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
> From: Daniel Phillips <[email protected]>
>>Can you please characterize the conditions under which skb->dev changes
>>after the alloc? Are there writings on this subtlety?
>
> The packet scheduler and classifier can redirect packets to different
> devices, and can the netfilter layer.
>
> The setting of skb->dev is wholly transient and you cannot rely upon
> it to be the same as when you set it on allocation.
>
> Even simple things like the bonding device change skb->dev on every
> receive.

Thankyou, this is easily fixed.

> I think you need to study the networking stack a little more before
> you continue to play in this delicate area :-)

The VM deadlock is also delicate. Perhaps we can work together.

Regards,

Daniel

2006-08-09 05:51:39

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

Jeff Garzik wrote:
> Peter Zijlstra wrote:
>> Update the driver to make use of the netdev_alloc_skb() API and the
>> NETIF_F_MEMALLOC feature.
>
> NETIF_F_MEMALLOC does not exist in the upstream tree... nor should it,
> IMO.

Elaborate please. Do you think that all drivers should be updated to
fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
If so, I trust you will help audit for it?

> netdev_alloc_skb() is in the tree, and that's fine.

2006-08-09 05:52:53

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([email protected]) wrote:
>
>> http://lwn.net/Articles/144273/
>> "Kernel Summit 2005: Convergence of network and storage paths"
>>
>>We believe that an approach very much like today's patch set is
>>necessary for NBD, iSCSI, AoE or the like ever to work reliably.
>>We further believe that a properly working version of at least one of
>>these subsystems is critical to the viability of Linux as a modern
>>storage platform.
>
> There is another approach for that - do not use slab allocator for
> network dataflow at all. It automatically has all you pros amd if
> implemented correctly can have a lot of additional usefull and
> high-performance features like full zero-copy and total fragmentation
> avoidance.

Agreed. But probably more intrusive than davem would be happy with
at this point.

Regards,

Daniel

2006-08-09 05:53:55

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 9 Aug 2006 09:46:48 +0400

> There is another approach for that - do not use slab allocator for
> network dataflow at all. It automatically has all you pros amd if
> implemented correctly can have a lot of additional usefull and
> high-performance features like full zero-copy and total fragmentation
> avoidance.

Free advertisement for your network tree allocator Evgeniy? :-)

2006-08-09 05:55:38

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

From: Daniel Phillips <[email protected]>
Date: Tue, 08 Aug 2006 22:51:20 -0700

> Elaborate please. Do you think that all drivers should be updated to
> fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
> If so, I trust you will help audit for it?

I think he's saying that he doesn't think your code is yet a
reasonable way to solve the problem, and therefore doesn't belong
upstream.

2006-08-09 05:55:56

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Tue, Aug 08, 2006 at 10:53:55PM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Wed, 9 Aug 2006 09:46:48 +0400
>
> > There is another approach for that - do not use slab allocator for
> > network dataflow at all. It automatically has all you pros amd if
> > implemented correctly can have a lot of additional usefull and
> > high-performance features like full zero-copy and total fragmentation
> > avoidance.
>
> Free advertisement for your network tree allocator Evgeniy? :-)

He-he, some kind of :)

--
Evgeniy Polyakov

2006-08-09 05:56:53

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Daniel Phillips <[email protected]>
Date: Tue, 08 Aug 2006 22:52:34 -0700

> Agreed. But probably more intrusive than davem would be happy with
> at this point.

I'm much more happy with Evgeniy's network tree allocator, which has a
real design and well thought our long term consequences, than your
work.

2006-08-09 06:30:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

David Miller wrote:
> From: Daniel Phillips <[email protected]>
> Date: Tue, 08 Aug 2006 22:51:20 -0700
>
>> Elaborate please. Do you think that all drivers should be updated to
>> fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
>> If so, I trust you will help audit for it?
>
> I think he's saying that he doesn't think your code is yet a
> reasonable way to solve the problem, and therefore doesn't belong
> upstream.

Pretty much. It is completely non-sensical to add NETIF_F_MEMALLOC,
when it should be blindingly obvious that every net driver will be
allocating memory, and every net driver could potentially be used with
NBD and similar situations.

Jeff


2006-08-09 07:01:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Tue, 2006-08-08 at 22:44 -0700, Daniel Phillips wrote:
> David Miller wrote:
> >From: Daniel Phillips <[email protected]>
> >>David Miller wrote:
> >>>I think the new atomic operation that will seemingly occur on every
> >>>device SKB free is unacceptable.
> >>
> >>Alternate suggestion?
> >
> > Sorry, I have none. But you're unlikely to get your changes
> > considered seriously unless you can avoid any new overhead your patch
> > has which is of this level.
>
> We just skip anything new unless the socket is actively carrying block
> IO traffic, in which case we pay a miniscule price to avoid severe
> performance artifacts or in the worst case, deadlock. So in this design
> the new atomic operation does not occur on every device SKP free.
>
> All atomic ops sit behind the cheap test:
>
> (dev->flags & IFF_MEMALLOC)
>
> or if any have escaped that is just an oversight. Peter?

That should be so indeed. Except on the allocation path ofcourse, there
it only occurs when the first allocation fails.

> > We're busy trying to make these data structures smaller, and eliminate
> > atomic operations, as much as possible. Therefore anything which adds
> > new datastructure elements and new atomic operations will be met with
> > fierce resistence unless it results an equal or greater shrink of
> > datastructures elsewhere or removes atomic operations elsewhere in
> > the critical path.
>
> Right now we have a problem because our network stack cannot support
> block IO reliably. Without that, Linux is no enterprise storage
> platform.

Indeed, surely not all wanted new features come with zero cost. If its a
hard condition that all new features remove data and operations progress
is going to be challenging.

2006-08-09 07:04:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

On Wed, 2006-08-09 at 02:30 -0400, Jeff Garzik wrote:
> David Miller wrote:
> > From: Daniel Phillips <[email protected]>
> > Date: Tue, 08 Aug 2006 22:51:20 -0700
> >
> >> Elaborate please. Do you think that all drivers should be updated to
> >> fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
> >> If so, I trust you will help audit for it?
> >
> > I think he's saying that he doesn't think your code is yet a
> > reasonable way to solve the problem, and therefore doesn't belong
> > upstream.
>
> Pretty much. It is completely non-sensical to add NETIF_F_MEMALLOC,
> when it should be blindingly obvious that every net driver will be
> allocating memory, and every net driver could potentially be used with
> NBD and similar situations.

Sure, but until every single driver is converted I'd like to warn people
about the fact that their setups is not up to expectations. Iff all
drivers are converted I'll be the forst to submit a patch that removes
the feature flag.

2006-08-09 07:20:34

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

Peter Zijlstra wrote:
> On Wed, 2006-08-09 at 02:30 -0400, Jeff Garzik wrote:
>> David Miller wrote:
>>> From: Daniel Phillips <[email protected]>
>>> Date: Tue, 08 Aug 2006 22:51:20 -0700
>>>
>>>> Elaborate please. Do you think that all drivers should be updated to
>>>> fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
>>>> If so, I trust you will help audit for it?
>>> I think he's saying that he doesn't think your code is yet a
>>> reasonable way to solve the problem, and therefore doesn't belong
>>> upstream.
>> Pretty much. It is completely non-sensical to add NETIF_F_MEMALLOC,
>> when it should be blindingly obvious that every net driver will be
>> allocating memory, and every net driver could potentially be used with
>> NBD and similar situations.
>
> Sure, but until every single driver is converted I'd like to warn people
> about the fact that their setups is not up to expectations. Iff all
> drivers are converted I'll be the forst to submit a patch that removes
> the feature flag.

A temporary-for-years flag is not a good approach. The flag is not
_needed_ for technical reasons, but for supposed user expectation reasons.

Rather, just go ahead and convert drivers to netdev_alloc_skb() where
people care. If someone suddenly gets a burr up their ass about the
sunlance or epic100 driver deadlocking on NBD, then they can convert it
or complain loudly themselves.

Overall, a good solution needs to be uniform across all net drivers.
NETIF_F_MEMALLOC is just _encouraging_ people to be slackers and delay
converting other drivers, creating two classes of drivers, the "haves"
and the "have nots".

Just make a big netdev_alloc_skb() patch that converts most users.
netdev_alloc_skb() is a good thing to use, because it builds an
association with struct net_device and the allocation.

Jeff



P.S. Since netdev_alloc_skb() calls skb_reserve(), you need to take
that into account. That's a bug in current patches.

2006-08-09 12:03:20

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, August 9, 2006 2:25, Daniel Phillips said:
> Indan Zupancic wrote:
>> Hello,
>> Saw the patch on lkml, and wondered about some things.
>> On Tue, August 8, 2006 21:33, Peter Zijlstra said:
>>
>>>+static inline void dev_unreserve_skb(struct net_device *dev)
>>>+{
>>>+ if (atomic_dec_return(&dev->rx_reserve_used) < 0)
>>>+ atomic_inc(&dev->);
>>>+}
>>>+
>>
>> This looks strange. Is it possible for rx_reserve_used to become negative?
>> A quick look at the code says no, except in the one case where there isn't a
>> "if (unlikely(dev_reserve_used(skb->dev)))" check:
>
> Yes, you can see what I'm trying to do there, I was short an atomic op to
> do it, My ugly solution may well be... probably is racy. Let's rewrite
> it with something better. We want the atomic op that some people call
> "monus": decrement unless zero.

Currently atomic_inc_not_zero(), atomic_add_unless() and atomic_cmpxchg()
exist, so making an atomic_dec_not_zero() should be easy.

>>>@@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
>>> #endif
>>>
>>> kfree_skbmem(skb);
>>>+ if (dev && (dev->flags & IFF_MEMALLOC))
>>>+ dev_unreserve_skb(dev);
>>> }
>>
>> So it seems that that < 0 check in dev_unreserve_skb() was only added to handle
>> this case (though there seems to be a race between those two atomic ops).
>> Isn't it better to remove that check and just do:
>>
>> if (dev && (dev->flags & IFF_MEMALLOC) && dev_reserve_used(dev))
>
> Seems to me that unreserve is also called from each protocol handler.
> Agreed, the < 0 check is fairly sickening.

Yes, but those all do that (racy) "if (unlikely(dev_reserve_used(skb->dev)))"
check. Adding another racy check doesn't improve the situation.


>> The use of atomic seems a bit dubious. Either it's necessary, in which case
>> changes depending on tests seem unsafe as they're not atomic and something
>> crucial could change between the read and the check, or normal reads and writes
>> combined with barriers would be sufficient. All in all it seems better to
>> move that "if (unlikely(dev_reserve_used(skb->dev)))" check into
>> dev_unreserve_skb(), and make the whole atomic if necessary. Then let
>> dev_unreserve_skb() return wether rx_reserve_used was positive.
>> Getting rid of dev_reserve_used() and using the atomic_read directly might be
>> better, as it is set with the bare atomic instructions too and rarely used without
>> dev_unreserve_skb().
>
> Barriers should work for this reserve accounting, but is that better than
> an atomic op? I don't know, let's let the barrier mavens opine.

The atomic ops are fine, but if you do two atomic ops then that as a whole
isn't atomic any more and often racy. That's what seems to be the case
here.

> IMHO the cleanest thing to do is code up "monus", in fact I dimly recall
> somebody already added something similar.

I'm not sure, to me it looks like dev_unreserve_skb() is always called
without really knowing if it is justified or not, or else there wouldn't
be a chance that the counter became negative. So avoiding the negative
reserve usage seems like papering over bad accounting.

The assumption made seems to be that if there's reserve used, then it must
be us using it, and it's unreserved. So it appears that either that
assumption is wrong, and we can unreserve for others while we never
reserved for ourselves, or it is correct, in which case it probably makes
more sense to check for the IFF_MEMALLOC flag.

All in all it seems like a per skb flag which tells us if this skb was the
one reserving anything is missing. Or rx_reserve_used must be updated for
all in flight skbs whenever the IFF_MEMALLOC flag changes, so that we can
be sure that the accounting works correctly. Oh wait, isn't that what the
memalloc flag is for? So shouldn't it be sufficient to check only with
sk_is_memalloc()? That avoids lots of checks and should guarantee that the
accounting is correct, except in the case when the IFF_MEMALLOC flag is
cleared and the counter is set to zero manually. Can't that be avoided and
just let it decrease to zero naturally? So checking IFF_MEMALLOC for new
skbs and use sk_is_memalloc() for existing ones seems workable, if I'm not
missing anything (I think I do).

> Side note: please don't be shy, just reply-all in future so the discussion
> stays public. Part of what we do is try to share our development process
> so people see not only what we have done, but why we did it. (And sometimes
> so they can see what dumb mistakes we make, but I won't get into that...)

Yes, I know, but as I don't have much kernel programming experience I
didn't want to add unnecessary noise.

> I beg for forgiveness in advance having taken the liberty of CCing this
> reply to lkml.

No problem.

Greetings,

Indan


2006-08-09 12:41:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, 2006-08-09 at 09:46 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> > http://lwn.net/Articles/144273/
> > "Kernel Summit 2005: Convergence of network and storage paths"
> >
> > We believe that an approach very much like today's patch set is
> > necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> > We further believe that a properly working version of at least one of
> > these subsystems is critical to the viability of Linux as a modern
> > storage platform.
>
> There is another approach for that - do not use slab allocator for
> network dataflow at all. It automatically has all you pros amd if
> implemented correctly can have a lot of additional usefull and
> high-performance features like full zero-copy and total fragmentation
> avoidance.

On your site where you explain the Network Tree Allocator:

http://tservice.net.ru/~s0mbre/blog/devel/networking/nta/index.html

You only test the fragmentation scenario with the full scale of sizes.
Fragmentation will look different if you use a limited number of sizes
that share no factors (other than the block size); try 19, 37 and 79
blocks with 1:1:1 ratio.

Also, I have yet to see how you will do full zero-copy receives; full
zero-copy would mean getting the data from driver DMA to user-space
without
a single copy. The to user-space part almost requires that each packet
live
on its own page.

As for the VM deadlock avoidance; I see no zero overhead allocation path
-
you do not want to deadlock your allocator. I see no critical resource
isolation (our SOCK_MEMALLOC). Without these things your allocator might
improve the status quo but it will not aid in avoiding the deadlock we
try
to tackle here.



2006-08-09 12:58:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
> On Wed, August 9, 2006 2:25, Daniel Phillips said:
> > .... We want the atomic op that some people call
> > "monus": decrement unless zero.
>
> Currently atomic_inc_not_zero(), atomic_add_unless() and atomic_cmpxchg()
> exist, so making an atomic_dec_not_zero() should be easy.

atomic_add_unless() - will nicely do, thanks.

> I'm not sure, to me it looks like dev_unreserve_skb() is always called
> without really knowing if it is justified or not, or else there wouldn't
> be a chance that the counter became negative. So avoiding the negative
> reserve usage seems like papering over bad accounting.

It was indeed called too often it seems, once when deciding to drop the
skb
and again then actually freeing the skb.

> The assumption made seems to be that if there's reserve used, then it must
> be us using it, and it's unreserved. So it appears that either that
> assumption is wrong, and we can unreserve for others while we never
> reserved for ourselves, or it is correct, in which case it probably makes
> more sense to check for the IFF_MEMALLOC flag.

Changed it to only dec_not_zero on free for IFF_MEMALLOC devices.
I'm thinking of making kfree_skbmem -> skb_release_data return whether
they
released the actual data and also depend on that.

> All in all it seems like a per skb flag which tells us if this skb was the
> one reserving anything is missing.

struct sk_buff::memalloc

However the idea is that freeing non memalloc skbs also returns memory
(albeit
to the slab and not the free page list).

> Or rx_reserve_used must be updated for
> all in flight skbs whenever the IFF_MEMALLOC flag changes, so that we can
> be sure that the accounting works correctly.

Yes

> Oh wait, isn't that what the
> memalloc flag is for? So shouldn't it be sufficient to check only with
> sk_is_memalloc()?

See previous comment.

> That avoids lots of checks and should guarantee that the
> accounting is correct, except in the case when the IFF_MEMALLOC flag is
> cleared and the counter is set to zero manually. Can't that be avoided and
> just let it decrease to zero naturally?

That would put the atomic op on the free path unconditionally, I think
davem
gets nightmares from that.

Thanks

2006-08-09 13:08:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, Aug 09, 2006 at 02:37:20PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Wed, 2006-08-09 at 09:46 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > http://lwn.net/Articles/144273/
> > > "Kernel Summit 2005: Convergence of network and storage paths"
> > >
> > > We believe that an approach very much like today's patch set is
> > > necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> > > We further believe that a properly working version of at least one of
> > > these subsystems is critical to the viability of Linux as a modern
> > > storage platform.
> >
> > There is another approach for that - do not use slab allocator for
> > network dataflow at all. It automatically has all you pros amd if
> > implemented correctly can have a lot of additional usefull and
> > high-performance features like full zero-copy and total fragmentation
> > avoidance.
>
> On your site where you explain the Network Tree Allocator:
>
> http://tservice.net.ru/~s0mbre/blog/devel/networking/nta/index.html
>
> You only test the fragmentation scenario with the full scale of sizes.
> Fragmentation will look different if you use a limited number of sizes
> that share no factors (other than the block size); try 19, 37 and 79
> blocks with 1:1:1 ratio.

19, 37 and 79 will be rounded by SLAB to 32, 64 and 128 bytes, with NTA it
will be 32, 64 and 96 bytes. NTA wins in each allocation which is not
power-of-two (I use 32 bytes alignemnt, as the smallest one which SLAB
uses). And as you saw in the blog, network tree allocator is faster
than SLAB one, although it can have different side effects which are not
yet 100% discovered.

> Also, I have yet to see how you will do full zero-copy receives; full
> zero-copy would mean getting the data from driver DMA to user-space
> without
> a single copy. The to user-space part almost requires that each packet
> live
> on its own page.

Each page can easily have several packets inside.

> As for the VM deadlock avoidance; I see no zero overhead allocation path
> - you do not want to deadlock your allocator. I see no critical resource
> isolation (our SOCK_MEMALLOC). Without these things your allocator might
> improve the status quo but it will not aid in avoiding the deadlock we
> try to tackle here.

Because such reservation is not needed at all.
SLAB OOM can be handled by reserving pool using SOCK_MEMALLOC and
similar hacks, and different allocator, which obviously work with own
pool of pages, can not suffer from SLAB problems.

You say "critical resource isolation", but it is not the case - consider
NFS over UDP - remote side will not stop sending just because receiving
socket code drops data due to OOM, or IPsec or compression, which can
requires reallocation. There is no "critical resource isolation", since
reserved pool _must_ be used by everyone in the kernel network stack.

And as you saw fragmentation issues are handled very good in NTA, just
consider usual packet with data with 1500 MTU - 500 bytes are wasted.
If you use jumbo frames... it is posible to end up with 32k allocation
for 9k jumbo frame with some hardware.

--
Evgeniy Polyakov

2006-08-09 13:19:23

by Thomas Graf

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

* Daniel Phillips <[email protected]> 2006-08-08 22:47
> David Miller wrote:
> >From: Daniel Phillips <[email protected]>
> >>Can you please characterize the conditions under which skb->dev changes
> >>after the alloc? Are there writings on this subtlety?
> >
> >The packet scheduler and classifier can redirect packets to different
> >devices, and can the netfilter layer.
> >
> >The setting of skb->dev is wholly transient and you cannot rely upon
> >it to be the same as when you set it on allocation.
> >
> >Even simple things like the bonding device change skb->dev on every
> >receive.
>
> Thankyou, this is easily fixed.

It's not that simple, in order to just fix the most obvious case
being packet forwarding when skb->dev changes its meaning from
device the packet is coming from to device the packet will be leaving
on is difficult.

You can't unreserve at that point so you need to keep the original
skb->dev. Since the packet is mostly likely queued before freeing
you will lose the refcnt on the original skb->dev. Keeping a
refcnt just for this memalloc stuff is out of question. Even keeping
the ifindex on a best effort basis is unlikely an option, sk_buff is
way overweight already.

2006-08-09 13:37:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, 2006-08-09 at 17:07 +0400, Evgeniy Polyakov wrote:
> On Wed, Aug 09, 2006 at 02:37:20PM +0200, Peter Zijlstra ([email protected]) wrote:
> > On Wed, 2006-08-09 at 09:46 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > > http://lwn.net/Articles/144273/
> > > > "Kernel Summit 2005: Convergence of network and storage paths"
> > > >
> > > > We believe that an approach very much like today's patch set is
> > > > necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> > > > We further believe that a properly working version of at least one of
> > > > these subsystems is critical to the viability of Linux as a modern
> > > > storage platform.
> > >
> > > There is another approach for that - do not use slab allocator for
> > > network dataflow at all. It automatically has all you pros amd if
> > > implemented correctly can have a lot of additional usefull and
> > > high-performance features like full zero-copy and total fragmentation
> > > avoidance.
> >
> > On your site where you explain the Network Tree Allocator:
> >
> > http://tservice.net.ru/~s0mbre/blog/devel/networking/nta/index.html
> >
> > You only test the fragmentation scenario with the full scale of sizes.
> > Fragmentation will look different if you use a limited number of sizes
> > that share no factors (other than the block size); try 19, 37 and 79
> > blocks with 1:1:1 ratio.
^^^^^^

> 19, 37 and 79 will be rounded by SLAB to 32, 64 and 128 bytes, with NTA it
> will be 32, 64 and 96 bytes. NTA wins in each allocation which is not
> power-of-two (I use 32 bytes alignemnt, as the smallest one which SLAB
> uses). And as you saw in the blog, network tree allocator is faster
> than SLAB one, although it can have different side effects which are not
> yet 100% discovered.

So that would end up being 19*32 = 608 bytes, etc..
As for speed, sure.

> > Also, I have yet to see how you will do full zero-copy receives; full
> > zero-copy would mean getting the data from driver DMA to user-space
> > without
> > a single copy. The to user-space part almost requires that each packet
> > live
> > on its own page.
>
> Each page can easily have several packets inside.

For sure, the problem is: do you know for which user-space process a
packet
is going to be before you receive it?

> > As for the VM deadlock avoidance; I see no zero overhead allocation path
> > - you do not want to deadlock your allocator. I see no critical resource
> > isolation (our SOCK_MEMALLOC). Without these things your allocator might
> > improve the status quo but it will not aid in avoiding the deadlock we
> > try to tackle here.
>
> Because such reservation is not needed at all.
> SLAB OOM can be handled by reserving pool using SOCK_MEMALLOC and
> similar hacks, and different allocator, which obviously work with own
> pool of pages, can not suffer from SLAB problems.
>
> You say "critical resource isolation", but it is not the case - consider
> NFS over UDP - remote side will not stop sending just because receiving
> socket code drops data due to OOM, or IPsec or compression, which can
> requires reallocation. There is no "critical resource isolation", since
> reserved pool _must_ be used by everyone in the kernel network stack.

The idea is to drop all !NFS packets (or even more specific only keep
those
NFS packets that belong to the critical mount), and everybody doing
critical
IO over layered networks like IPSec or other tunnel constructs asks for
trouble - Just DON'T do that.

Dropping these non-essential packets makes sure the reserve memory
doesn't
get stuck in some random blocked user-space process, hence you can make
progress.

> And as you saw fragmentation issues are handled very good in NTA, just
> consider usual packet with data with 1500 MTU - 500 bytes are wasted.
> If you use jumbo frames... it is posible to end up with 32k allocation
> for 9k jumbo frame with some hardware.

Sure, SLAB does suck at some things, and I don't argue that NTA will
not
improve. Its just that 'total fragmentation avoidance' it too strong
and
this deadlock avoidance needs more.

2006-08-09 13:49:16

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, August 9, 2006 14:54, Peter Zijlstra said:
> On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
>> That avoids lots of checks and should guarantee that the
>> accounting is correct, except in the case when the IFF_MEMALLOC flag is
>> cleared and the counter is set to zero manually. Can't that be avoided and
>> just let it decrease to zero naturally?
>
> That would put the atomic op on the free path unconditionally, I think
> davem gets nightmares from that.

I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
to unconditionally decrement the reserved usage only when memalloc is true
on the free path. That way all skbs that increased the reserve also decrease
it, and the counter should never go below zero.

Also as far as I can see it should be possible to replace all atomic
"if (unlikely(dev_reserve_used(skb->dev)))" checks witha check if
memalloc is set. That should make davem happy, as there aren't any
atomic instructions left in hot paths.

If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.
When the skb is being destroyed it doesn't matter if IFF_MEMALLOC is set
or not, only if that skb used reserves and thus only the memalloc flag
needs to be checked. This means that changing the IFF_MEMALLOC doesn't
affect in-flight skbs but only newly created ones, and there's no need to
update in-flight skbs whenever the flag is changed as all should go well.

+int sk_set_memalloc(struct sock *sk)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct net_device *dev = ip_dev_find(inet->rcv_saddr);
+ int err = 0;
+
+ if (!dev)
+ return -ENODEV;
+
+ if (!(dev->features & NETIF_F_MEMALLOC)) {
+ err = -EPERM;
+ goto out;
+ }
+
+ if (atomic_read(&dev->memalloc_socks) == 0) {
+ spin_lock(&dev->memalloc_lock);
+ if (atomic_read(&dev->memalloc_socks) == 0) {
+ dev->memalloc_reserve =
+ dev->rx_reserve * skb_pages(dev->mtu);
+ err = adjust_memalloc_reserve(dev->memalloc_reserve);
+ if (err) {
+ spin_unlock(&dev->memalloc_lock);
+ printk(KERN_WARNING
+ "%s: Unable to allocate RX reserve, error: %d\n",
+ dev->name, err);
+ goto out;
+ }
+ sock_set_flag(sk, SOCK_MEMALLOC);
+ dev->flags |= IFF_MEMALLOC;
+ }
+ atomic_inc(&dev->memalloc_socks);
+ spin_unlock(&dev->memalloc_lock);
+ } else
+ atomic_inc(&dev->memalloc_socks);
+
+out:
+ dev_put(dev);
+ return err;
+}

It seems that here SOCK_MEMALLOC is only set on the first socket.
Shouldn't it be set on all sockets instead?

Greetings,

Indan


2006-08-09 14:04:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 15:48 +0200, Indan Zupancic wrote:
> On Wed, August 9, 2006 14:54, Peter Zijlstra said:
> > On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
> >> That avoids lots of checks and should guarantee that the
> >> accounting is correct, except in the case when the IFF_MEMALLOC flag is
> >> cleared and the counter is set to zero manually. Can't that be avoided and
> >> just let it decrease to zero naturally?
> >
> > That would put the atomic op on the free path unconditionally, I think
> > davem gets nightmares from that.
>
> I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
> to unconditionally decrement the reserved usage only when memalloc is true
> on the free path. That way all skbs that increased the reserve also decrease
> it, and the counter should never go below zero.

OK, so far so good, except we loose the notion of getting memory back
from
regular skbs.

> Also as far as I can see it should be possible to replace all atomic
> "if (unlikely(dev_reserve_used(skb->dev)))" checks witha check if
> memalloc is set. That should make davem happy, as there aren't any
> atomic instructions left in hot paths.

dev_reserve_used() uses atomic_read() which isn't actually a LOCK'ed
instruction, so that should not matter.

> If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.

Not quite, if IFF_MEMALLOC is set new skbs _could_ get memalloc set. We
only
fall back to alloc_pages() if the regular path fails to alloc. If the
skb
is backed by a page (as opposed to kmem_cache fluff) sk_buff::memalloc
is set.

> When the skb is being destroyed it doesn't matter if IFF_MEMALLOC is set
> or not, only if that skb used reserves and thus only the memalloc flag
> needs to be checked. This means that changing the IFF_MEMALLOC doesn't
> affect in-flight skbs but only newly created ones, and there's no need to
> update in-flight skbs whenever the flag is changed as all should go well.

Your reasoning is sound, except for these two point above...

<snip code>

> It seems that here SOCK_MEMALLOC is only set on the first socket.
> Shouldn't it be set on all sockets instead?

Ouch, where was my brain... Thanks!

Also, I've been thinking (more pain), should I not up the reserve for
each
SOCK_MEMALLOC socket.

2006-08-09 14:11:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 15:19 +0200, Thomas Graf wrote:
> * Daniel Phillips <[email protected]> 2006-08-08 22:47
> > David Miller wrote:
> > >From: Daniel Phillips <[email protected]>
> > >>Can you please characterize the conditions under which skb->dev changes
> > >>after the alloc? Are there writings on this subtlety?
> > >
> > >The packet scheduler and classifier can redirect packets to different
> > >devices, and can the netfilter layer.
> > >
> > >The setting of skb->dev is wholly transient and you cannot rely upon
> > >it to be the same as when you set it on allocation.
> > >
> > >Even simple things like the bonding device change skb->dev on every
> > >receive.
> >
> > Thankyou, this is easily fixed.
>
> It's not that simple, in order to just fix the most obvious case
> being packet forwarding when skb->dev changes its meaning from
> device the packet is coming from to device the packet will be leaving
> on is difficult.
>
> You can't unreserve at that point so you need to keep the original
> skb->dev. Since the packet is mostly likely queued before freeing
> you will lose the refcnt on the original skb->dev. Keeping a
> refcnt just for this memalloc stuff is out of question. Even keeping
> the ifindex on a best effort basis is unlikely an option, sk_buff is
> way overweight already.

I think Daniel was thinking of adding struct net_device *
sk_buff::alloc_dev,
I know I was after reading the first few mails. However if adding a
field
there is strict no-no....

/me takes a look at struct sk_buff

Hmm, what does sk_buff::input_dev do? That seems to store the initial
device?


2006-08-09 16:10:21

by Peter Zijlstra

[permalink] [raw]
Subject: -v2 [RFC][PATCH 2/9] deadlock prevention core


The core of the VM deadlock avoidance framework.

>From the 'user' side of things it provides a function to mark a 'struct sock'
as SOCK_MEMALLOC, meaning this socket may dip into the memalloc reserves on
the receive side.

>From the net_device side of things, the extra 'struct net_device *' argument
to {,__}netdev_alloc_skb() is used to attribute/account the memalloc usage.
When netdev_alloc_skb() finds it cannot allocate a struct sk_buff the regular
way it will grab some memory from the memalloc reserve.

Drivers that have been converted to the netdev_alloc_skb() family will
automatically receive this feature.

Network paths will drop !SOCK_MEMALLOC packets ASAP when reserve is being used.

Memalloc sk_buff allocations are not done from the SLAB but are done using
alloc_pages(). sk_buff::memalloc records this exception so that kfree_skbmem()
can do the right thing. NOTE this does not play very nice with skb_clone()


Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Daniel Phillips <[email protected]>

---
include/linux/gfp.h | 3 -
include/linux/mmzone.h | 1
include/linux/netdevice.h | 41 +++++++++++-----
include/linux/skbuff.h | 3 -
include/net/sock.h | 8 +++
mm/page_alloc.c | 38 ++++++++++++++-
net/core/dev.c | 40 ++++++++++++++++
net/core/skbuff.c | 112 +++++++++++++++++++++++++++++++++++++++++++---
net/core/sock.c | 18 +++++++
net/ethernet/eth.c | 1
net/ipv4/af_inet.c | 4 +
net/ipv4/icmp.c | 3 +
net/ipv4/tcp_ipv4.c | 3 +
net/ipv4/udp.c | 8 ++-
14 files changed, 258 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_MEMALLOC ((__force gfp_t)0x40000u) /* Use emergency reserves */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_MEMALLOC)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h
+++ linux-2.6/include/linux/mmzone.h
@@ -420,6 +420,7 @@ int percpu_pagelist_fraction_sysctl_hand
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
struct file *, void __user *, size_t *, loff_t *);
+int adjust_memalloc_reserve(int bytes);

#include <linux/topology.h>
/* Returns the number of the current Node. */
Index: linux-2.6/include/linux/netdevice.h
===================================================================
--- linux-2.6.orig/include/linux/netdevice.h
+++ linux-2.6/include/linux/netdevice.h
@@ -298,18 +298,20 @@ struct net_device

/* Net device features */
unsigned long features;
-#define NETIF_F_SG 1 /* Scatter/gather IO. */
-#define NETIF_F_IP_CSUM 2 /* Can checksum only TCP/UDP over IPv4. */
-#define NETIF_F_NO_CSUM 4 /* Does not require checksum. F.e. loopack. */
-#define NETIF_F_HW_CSUM 8 /* Can checksum all the packets. */
-#define NETIF_F_HIGHDMA 32 /* Can DMA to high memory. */
-#define NETIF_F_FRAGLIST 64 /* Scatter/gather IO. */
-#define NETIF_F_HW_VLAN_TX 128 /* Transmit VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */
-#define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */
-#define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */
-#define NETIF_F_GSO 2048 /* Enable software GSO. */
-#define NETIF_F_LLTX 4096 /* LockLess TX */
+#define NETIF_F_SG 0x0001 /* Scatter/gather IO. */
+#define NETIF_F_IP_CSUM 0x0002 /* Can checksum only TCP/UDP over IPv4. */
+#define NETIF_F_NO_CSUM 0x0004 /* Does not require checksum. F.e. loopack. */
+#define NETIF_F_HW_CSUM 0x0008 /* Can checksum all the packets. */
+
+#define NETIF_F_HIGHDMA 0x0010 /* Can DMA to high memory. */
+#define NETIF_F_FRAGLIST 0x0020 /* Scatter/gather IO. */
+#define NETIF_F_HW_VLAN_TX 0x0040 /* Transmit VLAN hw acceleration */
+#define NETIF_F_HW_VLAN_RX 0x0080 /* Receive VLAN hw acceleration */
+
+#define NETIF_F_HW_VLAN_FILTER 0x0100 /* Receive filtering on VLAN */
+#define NETIF_F_VLAN_CHALLENGED 0x0200 /* Device cannot handle VLAN packets */
+#define NETIF_F_GSO 0x0400 /* Enable software GSO. */
+#define NETIF_F_LLTX 0x0800 /* LockLess TX */

/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
@@ -409,6 +411,12 @@ struct net_device
struct Qdisc *qdisc_sleeping;
struct list_head qdisc_list;
unsigned long tx_queue_len; /* Max frames per queue allowed */
+ int rx_reserve;
+ atomic_t rx_reserve_used;
+
+ int memalloc_socks;
+ unsigned long memalloc_reserve;
+ spinlock_t memalloc_lock; /* could use any odd spinlock? */

/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
@@ -576,6 +584,7 @@ extern struct net_device *__dev_get_by_n
extern int dev_alloc_name(struct net_device *dev, const char *name);
extern int dev_open(struct net_device *dev);
extern int dev_close(struct net_device *dev);
+extern int dev_adjust_memalloc(struct net_device *dev, int a);
extern int dev_queue_xmit(struct sk_buff *skb);
extern int register_netdevice(struct net_device *dev);
extern int unregister_netdevice(struct net_device *dev);
@@ -686,6 +695,14 @@ static inline void dev_kfree_skb_irq(str
*/
extern void dev_kfree_skb_any(struct sk_buff *skb);

+/*
+ * Support for critical network IO under low memory conditions
+ */
+static inline int dev_reserve_used(struct net_device *dev)
+{
+ return atomic_read(&dev->rx_reserve_used);
+}
+
#define HAVE_NETIF_RX 1
extern int netif_rx(struct sk_buff *skb);
extern int netif_rx_ni(struct sk_buff *skb);
Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -282,7 +282,8 @@ struct sk_buff {
nfctinfo:3;
__u8 pkt_type:3,
fclone:2,
- ipvs_property:1;
+ ipvs_property:1,
+ memalloc:1;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
Index: linux-2.6/include/net/sock.h
===================================================================
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -391,6 +391,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* protocol can use memalloc reserve */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -413,6 +414,13 @@ static inline int sock_flag(struct sock
return test_bit(flag, &sk->sk_flags);
}

+static inline int sk_is_memalloc(struct sock *sk)
+{
+ return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+extern int sk_set_memalloc(struct sock *sk);
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -82,6 +82,7 @@ EXPORT_SYMBOL(zone_table);

static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
int min_free_kbytes = 1024;
+int var_free_kbytes;

unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
@@ -970,8 +971,8 @@ restart:

/* This allocation should allow future memory freeing. */

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
@@ -2196,7 +2197,8 @@ static void setup_per_zone_lowmem_reserv
*/
void setup_per_zone_pages_min(void)
{
- unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+ unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+ >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -2248,6 +2250,36 @@ void setup_per_zone_pages_min(void)
calculate_totalreserve_pages();
}

+int adjust_memalloc_reserve(int pages)
+{
+ static DEFINE_SPINLOCK(var_free_lock);
+ unsigned long flags;
+ int kbytes;
+ int err = 0;
+
+ spin_lock_irqsave(&var_free_lock, flags);
+
+ kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+ if (kbytes < 0) {
+ err = -EINVAL;
+ goto unlock;
+ }
+ var_free_kbytes = kbytes;
+ setup_per_zone_pages_min();
+ if (pages > 0) {
+ struct zone *zone;
+ for_each_zone(zone)
+ wakeup_kswapd(zone, 0);
+ }
+ printk(KERN_DEBUG "RX reserve: %d\n", var_free_kbytes);
+
+unlock:
+ spin_unlock_irqrestore(&var_free_lock, flags);
+ return err;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
/*
* Initialise min_free_kbytes.
*
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -43,6 +43,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
#include <linux/interrupt.h>
#include <linux/in.h>
#include <linux/inet.h>
@@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
*
*/

+#define ceiling_log2(x) fls((x) - 1)
+
/**
* __alloc_skb - allocate a network buffer
* @size: size to allocate
@@ -147,6 +150,49 @@ struct sk_buff *__alloc_skb(unsigned int
struct sk_buff *skb;
u8 *data;

+ size = SKB_DATA_ALIGN(size);
+
+ if (gfp_mask & __GFP_MEMALLOC) {
+ /*
+ * We have to do higher order allocations for icky jumbo
+ * frame drivers :-(
+ * They really should be migrated to scather/gather DMA
+ * and use skb fragments.
+ */
+ unsigned int data_offset =
+ sizeof(struct sk_buff) + sizeof(unsigned int);
+ unsigned long length = size + data_offset +
+ sizeof(struct skb_shared_info);
+ unsigned int pages;
+ unsigned int order;
+ struct page *page;
+ void *kaddr;
+
+ /*
+ * force fclone alloc in order to fudge a lacking in skb_clone().
+ */
+ fclone = 1;
+ if (fclone) {
+ data_offset += sizeof(struct sk_buff) + sizeof(atomic_t);
+ length += sizeof(struct sk_buff) + sizeof(atomic_t);
+ }
+ pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ order = ceiling_log2(pages);
+
+ skb = NULL;
+ if (!(page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order)))
+ goto out;
+
+ kaddr = pfn_to_kaddr(page_to_pfn(page));
+ skb = (struct sk_buff *)kaddr;
+
+ *((unsigned int *)(kaddr + data_offset -
+ sizeof(unsigned int))) = order;
+ data = (u8 *)(kaddr + data_offset);
+
+ goto allocated;
+ }
+
cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

/* Get the HEAD */
@@ -155,12 +201,13 @@ struct sk_buff *__alloc_skb(unsigned int
goto out;

/* Get the DATA. Size must match skb_add_mtu(). */
- size = SKB_DATA_ALIGN(size);
data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;

+allocated:
memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->memalloc = !!(gfp_mask & __GFP_MEMALLOC);
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -185,6 +232,7 @@ struct sk_buff *__alloc_skb(unsigned int
atomic_set(fclone_ref, 1);

child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->memalloc = skb->memalloc;
}
out:
return skb;
@@ -250,7 +298,7 @@ nodata:
}

/**
- * __netdev_alloc_skb - allocate an skbuff for rx on a specific device
+ * ___netdev_alloc_skb - allocate an skbuff for rx on a specific device
* @dev: network device to receive on
* @length: length to allocate
* @gfp_mask: get_free_pages mask, passed to alloc_skb
@@ -262,7 +310,7 @@ nodata:
*
* %NULL is returned if there is no free memory.
*/
-struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+static struct sk_buff *___netdev_alloc_skb(struct net_device *dev,
unsigned int length, gfp_t gfp_mask)
{
struct sk_buff *skb;
@@ -273,6 +321,31 @@ struct sk_buff *__netdev_alloc_skb(struc
return skb;
}

+struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+ unsigned length, gfp_t gfp_mask)
+{
+ struct sk_buff *skb;
+
+ WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
+ gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
+
+ if ((skb = ___netdev_alloc_skb(dev, length,
+ gfp_mask | __GFP_NOMEMALLOC)))
+ goto done;
+
+ if (dev_reserve_used(dev) >= dev->rx_reserve * dev->memalloc_socks)
+ goto out;
+ if (!(skb = ___netdev_alloc_skb(dev, length,
+ gfp_mask | __GFP_MEMALLOC)))
+ goto out;
+ atomic_inc(&dev->rx_reserve_used);
+
+done:
+ skb->input_dev = skb->dev = dev;
+out:
+ return skb;
+}
+
static void skb_drop_list(struct sk_buff **listp)
{
struct sk_buff *list = *listp;
@@ -313,10 +386,23 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);

- kfree(skb->head);
+ if (!skb->memalloc)
+ kfree(skb->head);
+ skb->head = NULL;
}
}

+static void free_skb_pages(struct kmem_cache *cache, void *objp)
+{
+ struct sk_buff *skb = (struct sk_buff *)objp;
+ struct net_device *dev = skb->input_dev;
+ unsigned int order =
+ *(unsigned int *)(skb->head - sizeof(unsigned int));
+ if (!skb->head)
+ atomic_dec(&dev->rx_reserve_used);
+ free_pages((unsigned long)skb, order);
+}
+
/*
* Free an skbuff by memory without cleaning the state.
*/
@@ -324,17 +410,21 @@ void kfree_skbmem(struct sk_buff *skb)
{
struct sk_buff *other;
atomic_t *fclone_ref;
+ void (*free_skb)(struct kmem_cache *, void *);

skb_release_data(skb);
+
+ free_skb = skb->memalloc ? free_skb_pages : kmem_cache_free;
+
switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
- kmem_cache_free(skbuff_head_cache, skb);
+ free_skb(skbuff_head_cache, skb);
break;

case SKB_FCLONE_ORIG:
fclone_ref = (atomic_t *) (skb + 2);
if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, skb);
+ free_skb(skbuff_fclone_cache, skb);
break;

case SKB_FCLONE_CLONE:
@@ -347,7 +437,7 @@ void kfree_skbmem(struct sk_buff *skb)
skb->fclone = SKB_FCLONE_UNAVAILABLE;

if (atomic_dec_and_test(fclone_ref))
- kmem_cache_free(skbuff_fclone_cache, other);
+ free_skb(skbuff_fclone_cache, other);
break;
};
}
@@ -434,6 +524,12 @@ struct sk_buff *skb_clone(struct sk_buff
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ /*
+ * should we special-case skb->memalloc cloning?
+ * for now fudge it by forcing fast-clone alloc.
+ */
+ BUG_ON(skb->memalloc);
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -686,6 +782,8 @@ int pskb_expand_head(struct sk_buff *skb
if (skb_shared(skb))
BUG();

+ BUG_ON(skb->memalloc);
+
size = SKB_DATA_ALIGN(size);

data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
Index: linux-2.6/net/ethernet/eth.c
===================================================================
--- linux-2.6.orig/net/ethernet/eth.c
+++ linux-2.6/net/ethernet/eth.c
@@ -275,6 +275,7 @@ void ether_setup(struct net_device *dev)
dev->mtu = ETH_DATA_LEN;
dev->addr_len = ETH_ALEN;
dev->tx_queue_len = 1000; /* Ethernet wants good queues */
+ dev->rx_reserve = 384;
dev->flags = IFF_BROADCAST|IFF_MULTICAST;

memset(dev->broadcast,0xFF, ETH_ALEN);
Index: linux-2.6/net/ipv4/icmp.c
===================================================================
--- linux-2.6.orig/net/ipv4/icmp.c
+++ linux-2.6/net/ipv4/icmp.c
@@ -938,6 +938,9 @@ int icmp_rcv(struct sk_buff *skb)
goto error;
}

+ if (unlikely(dev_reserve_used(skb->input_dev)))
+ goto drop;
+
if (!pskb_pull(skb, sizeof(struct icmphdr)))
goto error;

Index: linux-2.6/net/ipv4/tcp_ipv4.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp_ipv4.c
+++ linux-2.6/net/ipv4/tcp_ipv4.c
@@ -1093,6 +1093,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;

+ if (unlikely(dev_reserve_used(skb->input_dev) && !sk_is_memalloc(sk)))
+ goto discard_and_relse;
+
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c
+++ linux-2.6/net/ipv4/udp.c
@@ -1136,7 +1136,12 @@ int udp_rcv(struct sk_buff *skb)
sk = udp_v4_lookup(saddr, uh->source, daddr, uh->dest, skb->dev->ifindex);

if (sk != NULL) {
- int ret = udp_queue_rcv_skb(sk, skb);
+ int ret;
+
+ if (unlikely(dev_reserve_used(skb->input_dev) && !sk_is_memalloc(sk)))
+ goto drop_noncritical;
+
+ ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

/* a return value > 0 means to resubmit the input, but
@@ -1147,6 +1152,7 @@ int udp_rcv(struct sk_buff *skb)
return 0;
}

+drop_noncritical:
if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb))
goto drop;
nf_reset(skb);
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -938,6 +938,45 @@ int dev_close(struct net_device *dev)
return 0;
}

+#define ceiling_log2(x) fls((x) - 1)
+
+static inline unsigned int skb_pages(unsigned int mtu)
+{
+ unsigned int pages = (mtu + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ unsigned int order = ceiling_log2(pages);
+ pages = 1 << order;
+ if (pages > 1) ++pages;
+
+ return pages;
+}
+
+int dev_adjust_memalloc(struct net_device *dev, int a)
+{
+ unsigned long flags;
+ unsigned long reserve;
+ int err;
+
+ spin_lock_irqsave(&dev->memalloc_lock, flags);
+
+ dev->memalloc_socks += a;
+ BUG_ON(dev->memalloc_socks < 0);
+
+ reserve = dev->memalloc_socks * dev->rx_reserve * skb_pages(dev->mtu);
+ err = adjust_memalloc_reserve(reserve - dev->memalloc_reserve);
+ if (err) {
+ printk(KERN_WARNING
+ "%s: Unable to change RX reserve to: %lu, error: %d\n",
+ dev->name, reserve, err);
+ goto unlock;
+ }
+ dev->memalloc_reserve = reserve;
+
+unlock:
+ spin_unlock_irqrestore(&dev->memalloc_lock, flags);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(dev_adjust_memalloc);

/*
* Device change register/unregister. These are not inline or static
@@ -2900,6 +2939,7 @@ int register_netdevice(struct net_device
#ifdef CONFIG_NET_CLS_ACT
spin_lock_init(&dev->ingress_lock);
#endif
+ spin_lock_init(&dev->memalloc_lock);

ret = alloc_divert_blk(dev);
if (ret)
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c
+++ linux-2.6/net/ipv4/af_inet.c
@@ -131,6 +131,10 @@ static DEFINE_SPINLOCK(inetsw_lock);
void inet_sock_destruct(struct sock *sk)
{
struct inet_sock *inet = inet_sk(sk);
+ struct net_device *dev = ip_dev_find(inet->rcv_saddr);
+
+ if (dev && sk_is_memalloc(sk))
+ dev_adjust_memalloc(dev, -1);

__skb_queue_purge(&sk->sk_receive_queue);
__skb_queue_purge(&sk->sk_error_queue);
Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -111,6 +111,7 @@
#include <linux/poll.h>
#include <linux/tcp.h>
#include <linux/init.h>
+#include <linux/inetdevice.h>

#include <asm/uaccess.h>
#include <asm/system.h>
@@ -195,6 +196,23 @@ __u32 sysctl_rmem_default = SK_RMEM_MAX;
/* Maximal space eaten by iovec or ancilliary data plus some space */
int sysctl_optmem_max = sizeof(unsigned long)*(2*UIO_MAXIOV + 512);

+int sk_set_memalloc(struct sock *sk)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct net_device *dev = ip_dev_find(inet->rcv_saddr);
+ int err = 0;
+
+ if (!dev)
+ return -ENODEV;
+
+ if (!(err = dev_adjust_memalloc(dev, 1)))
+ sock_set_flag(sk, SOCK_MEMALLOC);
+
+ dev_put(dev);
+ return err;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
struct timeval tv;


2006-08-09 16:17:57

by Thomas Graf

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

* Peter Zijlstra <[email protected]> 2006-08-09 16:07
> I think Daniel was thinking of adding struct net_device *
> sk_buff::alloc_dev,
> I know I was after reading the first few mails. However if adding a
> field
> there is strict no-no....
>
> /me takes a look at struct sk_buff
>
> Hmm, what does sk_buff::input_dev do? That seems to store the initial
> device?

No, skb->input_dev is used when redirecting packets around in the
stack and may change. Even if it would keep its value the reference
to the netdevice is not valid anymore when you free the skb as the
skb was queued and the refcnt acquired in __netifx_rx_schedule()
has been released again thus making it possible for the netdevice
to disappear.

2006-08-09 16:24:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 18:18 +0200, Thomas Graf wrote:
> * Peter Zijlstra <[email protected]> 2006-08-09 16:07
> > I think Daniel was thinking of adding struct net_device *
> > sk_buff::alloc_dev,
> > I know I was after reading the first few mails. However if adding a
> > field
> > there is strict no-no....
> >
> > /me takes a look at struct sk_buff
> >
> > Hmm, what does sk_buff::input_dev do? That seems to store the initial
> > device?
>
> No, skb->input_dev is used when redirecting packets around in the
> stack and may change. Even if it would keep its value the reference
> to the netdevice is not valid anymore when you free the skb as the
> skb was queued and the refcnt acquired in __netifx_rx_schedule()
> has been released again thus making it possible for the netdevice
> to disappear.

Bah, tricky stuff that.

disregards this part from -v2 then :-(


2006-08-09 18:34:18

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, August 9, 2006 16:00, Peter Zijlstra said:
> On Wed, 2006-08-09 at 15:48 +0200, Indan Zupancic wrote:
>> On Wed, August 9, 2006 14:54, Peter Zijlstra said:
>> > On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
>> >> That avoids lots of checks and should guarantee that the
>> >> accounting is correct, except in the case when the IFF_MEMALLOC flag is
>> >> cleared and the counter is set to zero manually. Can't that be avoided and
>> >> just let it decrease to zero naturally?
>> >
>> > That would put the atomic op on the free path unconditionally, I think
>> > davem gets nightmares from that.
>>
>> I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
>> to unconditionally decrement the reserved usage only when memalloc is true
>> on the free path. That way all skbs that increased the reserve also decrease
>> it, and the counter should never go below zero.
>
> OK, so far so good, except we loose the notion of getting memory back
> from regular skbs.

I don't understand this, regular skbs don't have anything to do with
rx_reserve_used as far as I can see. I'm only talking about keeping
that field up to date and correct. rx_reserve_used is only increased
by a skb when memalloc is set to true on that skb, so only if that field
is set rx_reserve_used needs to be reduced when the skb is freed.

Why is it needed for the protocol specific code to call dev_unreserve_skb?

Only problem is if the device can change. rx_reserve_used should probably
be updated when that happens, as a skb can't use reserved memory on a device
it was moved away from. (right?)

>> Also as far as I can see it should be possible to replace all atomic
>> "if (unlikely(dev_reserve_used(skb->dev)))" checks witha check if
>> memalloc is set. That should make davem happy, as there aren't any
>> atomic instructions left in hot paths.
>
> dev_reserve_used() uses atomic_read() which isn't actually a LOCK'ed
> instruction, so that should not matter.

Perhaps, but the main reason to check memalloc instead of using
dev_reserve_used is because the latter doesn't tell which skb did the
reservation.

>> If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.
>
> Not quite, if IFF_MEMALLOC is set new skbs _could_ get memalloc set. We
> only fall back to alloc_pages() if the regular path fails to alloc. If the
> skb is backed by a page (as opposed to kmem_cache fluff) sk_buff::memalloc
> is set.

Yes, true. But doesn't matter for the rx_reserve_used accounting, as long as
memalloc set means that it did increase rx_reserve_used.

> Also, I've been thinking (more pain), should I not up the reserve for
> each SOCK_MEMALLOC socket.

Up rx_reserve_used or the total ammount of reserved memory? Probably 'no' for
both though, as it's either device specific or skb dependent.

I'm slowly getting a clearer image of the big picture, I'll take another look
when you post the updated code.

Greetings,

Indan


2006-08-09 19:30:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, Aug 09, 2006 at 03:32:33PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > > > http://lwn.net/Articles/144273/
> > > > > "Kernel Summit 2005: Convergence of network and storage paths"
> > > > >
> > > > > We believe that an approach very much like today's patch set is
> > > > > necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> > > > > We further believe that a properly working version of at least one of
> > > > > these subsystems is critical to the viability of Linux as a modern
> > > > > storage platform.
> > > >
> > > > There is another approach for that - do not use slab allocator for
> > > > network dataflow at all. It automatically has all you pros amd if
> > > > implemented correctly can have a lot of additional usefull and
> > > > high-performance features like full zero-copy and total fragmentation
> > > > avoidance.
> > >
> > > On your site where you explain the Network Tree Allocator:
> > >
> > > http://tservice.net.ru/~s0mbre/blog/devel/networking/nta/index.html
> > >
> > > You only test the fragmentation scenario with the full scale of sizes.
> > > Fragmentation will look different if you use a limited number of sizes
> > > that share no factors (other than the block size); try 19, 37 and 79
> > > blocks with 1:1:1 ratio.
> ^^^^^^
>
> > 19, 37 and 79 will be rounded by SLAB to 32, 64 and 128 bytes, with NTA it
> > will be 32, 64 and 96 bytes. NTA wins in each allocation which is not
> > power-of-two (I use 32 bytes alignemnt, as the smallest one which SLAB
> > uses). And as you saw in the blog, network tree allocator is faster
> > than SLAB one, although it can have different side effects which are not
> > yet 100% discovered.
>
> So that would end up being 19*32 = 608 bytes, etc..
> As for speed, sure.

NTA aligns data to 32bytes, SLAB to power of two, so overhead becomes
extremely large for unaligned access for almost every sizes.

> > > Also, I have yet to see how you will do full zero-copy receives; full
> > > zero-copy would mean getting the data from driver DMA to user-space
> > > without
> > > a single copy. The to user-space part almost requires that each packet
> > > live
> > > on its own page.
> >
> > Each page can easily have several packets inside.
>
> For sure, the problem is: do you know for which user-space process a
> packet
> is going to be before you receive it?

That is not a problem for sniffer, if some process wants that data it
can be copied, it can be checked if neighbour packet is for the same
socket and do not copy in that case - there are a lot of things which
can be done, when data _is_ in place. NTA allows zero-copy access to
the whole network traffic, how system is going to work with it is
another task.

> > > As for the VM deadlock avoidance; I see no zero overhead allocation path
> > > - you do not want to deadlock your allocator. I see no critical resource
> > > isolation (our SOCK_MEMALLOC). Without these things your allocator might
> > > improve the status quo but it will not aid in avoiding the deadlock we
> > > try to tackle here.
> >
> > Because such reservation is not needed at all.
> > SLAB OOM can be handled by reserving pool using SOCK_MEMALLOC and
> > similar hacks, and different allocator, which obviously work with own
> > pool of pages, can not suffer from SLAB problems.
> >
> > You say "critical resource isolation", but it is not the case - consider
> > NFS over UDP - remote side will not stop sending just because receiving
> > socket code drops data due to OOM, or IPsec or compression, which can
> > requires reallocation. There is no "critical resource isolation", since
> > reserved pool _must_ be used by everyone in the kernel network stack.
>
> The idea is to drop all !NFS packets (or even more specific only keep
> those
> NFS packets that belong to the critical mount), and everybody doing
> critical
> IO over layered networks like IPSec or other tunnel constructs asks for
> trouble - Just DON'T do that.

Well, such approach does not scale well - either it must be generic
enough to fully solve the problem, or solve it at least at the most
casees, but when you turn anything off - that is not what is expected
I suppose.

> Dropping these non-essential packets makes sure the reserve memory
> doesn't
> get stuck in some random blocked user-space process, hence you can make
> progress.

It still requires to receive and analyze the packet before deciding to
drop it. Different allocator is just next step in idea of reserved pool.

> > And as you saw fragmentation issues are handled very good in NTA, just
> > consider usual packet with data with 1500 MTU - 500 bytes are wasted.
> > If you use jumbo frames... it is posible to end up with 32k allocation
> > for 9k jumbo frame with some hardware.
>
> Sure, SLAB does suck at some things, and I don't argue that NTA will
> not
> improve. Its just that 'total fragmentation avoidance' it too strong
> and
> this deadlock avoidance needs more.

Well, you approach seems to solve the problem.
It's extrapolation ends up in different allocator, which solves the
problem too.
So... Different people - different opinions.

--
Evgeniy Polyakov

2006-08-09 20:01:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 20:34 +0200, Indan Zupancic wrote:
> On Wed, August 9, 2006 16:00, Peter Zijlstra said:
> > On Wed, 2006-08-09 at 15:48 +0200, Indan Zupancic wrote:
> >> On Wed, August 9, 2006 14:54, Peter Zijlstra said:
> >> > On Wed, 2006-08-09 at 14:02 +0200, Indan Zupancic wrote:
> >> >> That avoids lots of checks and should guarantee that the
> >> >> accounting is correct, except in the case when the IFF_MEMALLOC flag is
> >> >> cleared and the counter is set to zero manually. Can't that be avoided and
> >> >> just let it decrease to zero naturally?
> >> >
> >> > That would put the atomic op on the free path unconditionally, I think
> >> > davem gets nightmares from that.
> >>
> >> I confused SOCK_MEMALLOC with sk_buff::memalloc, sorry. What I meant was
> >> to unconditionally decrement the reserved usage only when memalloc is true
> >> on the free path. That way all skbs that increased the reserve also decrease
> >> it, and the counter should never go below zero.
> >
> > OK, so far so good, except we loose the notion of getting memory back
> > from regular skbs.
>
> I don't understand this, regular skbs don't have anything to do with
> rx_reserve_used as far as I can see. I'm only talking about keeping
> that field up to date and correct. rx_reserve_used is only increased
> by a skb when memalloc is set to true on that skb, so only if that field
> is set rx_reserve_used needs to be reduced when the skb is freed.

I know what you ment, and if you've looked at -v2 you'll see that I've
done this, basically because its easier. However the thought behind the
other semantics is, any skb freed will reduce memory pressure.

> Why is it needed for the protocol specific code to call dev_unreserve_skb?

It uses this to get an indication of memory pressure; if we have
memalloc'ed skbs memory pressure must be high, hence we must drop all
non critical packets. But you are right in that this is a problematic
area; the mapping from skb to device is non trivial.

Your suggestion of testing skb->memalloc might work just as good; indeed
if we have regressed into the fallback allocator we know we have
pressure.

> Only problem is if the device can change. rx_reserve_used should probably
> be updated when that happens, as a skb can't use reserved memory on a device
> it was moved away from. (right?)

Well yes, this is a problem, only today have I understood how volatile
the mapping actually is. I think you are right in that transferring the
accounting from the old to the new device is correct solution.

However this brings us the problem of limiting the fallback allocator;
currently this is done in __netdev_alloc_skb where rx_reserve_used it
compared against rx_reserve. If we transfer accounting away this will
not work anymore. I'll have to think about this case, perhaps we already
have a problem here.

> >> Also as far as I can see it should be possible to replace all atomic
> >> "if (unlikely(dev_reserve_used(skb->dev)))" checks witha check if
> >> memalloc is set. That should make davem happy, as there aren't any
> >> atomic instructions left in hot paths.
> >
> > dev_reserve_used() uses atomic_read() which isn't actually a LOCK'ed
> > instruction, so that should not matter.
>
> Perhaps, but the main reason to check memalloc instead of using
> dev_reserve_used is because the latter doesn't tell which skb did the
> reservation.

Very good point indeed.

> >> If IFF_MEMALLOC is set new skbs set memalloc and increase the reserve.
> >
> > Not quite, if IFF_MEMALLOC is set new skbs _could_ get memalloc set. We
> > only fall back to alloc_pages() if the regular path fails to alloc. If the
> > skb is backed by a page (as opposed to kmem_cache fluff) sk_buff::memalloc
> > is set.
>
> Yes, true. But doesn't matter for the rx_reserve_used accounting, as long as
> memalloc set means that it did increase rx_reserve_used.
>
> > Also, I've been thinking (more pain), should I not up the reserve for
> > each SOCK_MEMALLOC socket.
>
> Up rx_reserve_used or the total ammount of reserved memory? Probably 'no' for
> both though, as it's either device specific or skb dependent.

I came up with yes, if for each socket you gain a request queue, the
number of in-flight pages is proportional to the number of sockets.

2006-08-09 20:19:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 21:45 +0200, Peter Zijlstra wrote:
> On Wed, 2006-08-09 at 20:34 +0200, Indan Zupancic wrote:

> > Why is it needed for the protocol specific code to call dev_unreserve_skb?
>
> It uses this to get an indication of memory pressure; if we have
> memalloc'ed skbs memory pressure must be high, hence we must drop all
> non critical packets. But you are right in that this is a problematic
> area; the mapping from skb to device is non trivial.
>
> Your suggestion of testing skb->memalloc might work just as good; indeed
> if we have regressed into the fallback allocator we know we have
> pressure.
>
> > Only problem is if the device can change. rx_reserve_used should probably
> > be updated when that happens, as a skb can't use reserved memory on a device
> > it was moved away from. (right?)
>
> Well yes, this is a problem, only today have I understood how volatile
> the mapping actually is. I think you are right in that transferring the
> accounting from the old to the new device is correct solution.
>
> However this brings us the problem of limiting the fallback allocator;
> currently this is done in __netdev_alloc_skb where rx_reserve_used it
> compared against rx_reserve. If we transfer accounting away this will
> not work anymore. I'll have to think about this case, perhaps we already
> have a problem here.

Humm, if we do not use dev_reserve_used in the protocols anymore, the
only place that still uses it is the fallback limit. If we could solve
that in another way we might be able to get rid of it all together. That
would save the pain of managing the accounting transferal on skb::dev
assignments too.

Daniel, any ideas? I'm fighting to stay awake... ;-)

2006-08-09 23:54:50

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Peter Zijlstra <[email protected]>
Date: Wed, 09 Aug 2006 15:32:33 +0200

> The idea is to drop all !NFS packets (or even more specific only
> keep those NFS packets that belong to the critical mount), and
> everybody doing critical IO over layered networks like IPSec or
> other tunnel constructs asks for trouble - Just DON'T do that.

People are doing I/O over IP exactly for it's ubiquity and
flexibility. It seems a major limitation of the design if you cancel
out major components of this flexibility.

I really can't take this work seriously when I see things like this
being said.

2006-08-09 23:58:56

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Peter Zijlstra <[email protected]>
Date: Wed, 09 Aug 2006 16:07:20 +0200

> Hmm, what does sk_buff::input_dev do? That seems to store the initial
> device?

You can run grep on the tree just as easily as I can which is what I
did to answer this question. It only takes a few seconds of your
time to grep the source tree for things like "skb->input_dev", so
would you please do that before asking more questions like this?

It does store the initial device, but as Thomas tried so hard to
explain to you guys these device pointers in the skb are transient and
you cannot refer to them outside of packet receive processing.

The reason is that there is no refcounting performed on these devices
when they are attached to the skb, for performance reasons, and thus
the device can be downed, the module for it removed, etc. long before
the skb is freed up.

2006-08-10 00:01:35

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Peter Zijlstra <[email protected]>
Date: Wed, 09 Aug 2006 18:19:54 +0200

> disregards this part from -v2 then :-(

And please don't do arbitrary cleanups in your patches like
how you reformatted all of the NETIF_F_* macro values.

Do things like that as a seperate change in your set of
patches so it's easier for people to review your work.

2006-08-10 01:23:27

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, August 9, 2006 21:45, Peter Zijlstra said:
> On Wed, 2006-08-09 at 20:34 +0200, Indan Zupancic wrote:
>> Why is it needed for the protocol specific code to call dev_unreserve_skb?
>
> It uses this to get an indication of memory pressure; if we have
> memalloc'ed skbs memory pressure must be high, hence we must drop all
> non critical packets. But you are right in that this is a problematic
> area; the mapping from skb to device is non trivial.
>
> Your suggestion of testing skb->memalloc might work just as good; indeed
> if we have regressed into the fallback allocator we know we have
> pressure.

You seem to have explained dev_reserve_used usage, not the dev_unreserve_skb calls.
But I've just found -v2 and see that they're gone now, great. -v2 looks much better.

>> Only problem is if the device can change. rx_reserve_used should probably
>> be updated when that happens, as a skb can't use reserved memory on a device
>> it was moved away from. (right?)
>
> Well yes, this is a problem, only today have I understood how volatile
> the mapping actually is. I think you are right in that transferring the
> accounting from the old to the new device is correct solution.
>
> However this brings us the problem of limiting the fallback allocator;
> currently this is done in __netdev_alloc_skb where rx_reserve_used it
> compared against rx_reserve. If we transfer accounting away this will
> not work anymore. I'll have to think about this case, perhaps we already
> have a problem here.

The point of the reservations is to avoid deadlocks, and they're always big
enough to hold all in-flight skbs, right? So what about solving the whole
device problem by using a global counter and limit instead of per device?

The question is whether traffic on one device can starve traffic on other
devices or not, and how big a problem that is. It probably can, tricky stuff.
Though getting rid of the per device stuff would simplify a lot...

>> > Also, I've been thinking (more pain), should I not up the reserve for
>> > each SOCK_MEMALLOC socket.
>>
>> Up rx_reserve_used or the total ammount of reserved memory? Probably 'no' for
>> both though, as it's either device specific or skb dependent.
>
> I came up with yes, if for each socket you gain a request queue, the
> number of in-flight pages is proportional to the number of sockets.

Yes, seems so.

Good night,

Indan


2006-08-10 04:25:03

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 09 Aug 2006 16:07:20 +0200
>
>
>> Hmm, what does sk_buff::input_dev do? That seems to store the initial
>> device?
>>
>
> You can run grep on the tree just as easily as I can which is what I
> did to answer this question. It only takes a few seconds of your
> time to grep the source tree for things like "skb->input_dev", so
> would you please do that before asking more questions like this?
>
C'mon cscope is your friend for this.

2006-08-10 06:11:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 09 Aug 2006 15:32:33 +0200
>
> > The idea is to drop all !NFS packets (or even more specific only
> > keep those NFS packets that belong to the critical mount), and
> > everybody doing critical IO over layered networks like IPSec or
> > other tunnel constructs asks for trouble - Just DON'T do that.
>
> People are doing I/O over IP exactly for it's ubiquity and
> flexibility. It seems a major limitation of the design if you cancel
> out major components of this flexibility.

We're not, that was a bit of my own frustration leaking out; I think
this whole push to IP based storage is a bit silly. I'm just not going
to help the admin who's server just hangs because his VPN key expired.

Running critical resources remotely like this is tricky, and every
hop/layer you put in between increases the risk of something going bad.
The only setup I think even remotely sane is a dedicated network in the
very same room - not unlike FC but cheaper (which I think is the whole
push behind this, eth is cheap)


2006-08-10 06:30:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 2006-08-09 at 16:58 -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 09 Aug 2006 16:07:20 +0200
>
> > Hmm, what does sk_buff::input_dev do? That seems to store the initial
> > device?
>
> You can run grep on the tree just as easily as I can which is what I
> did to answer this question. It only takes a few seconds of your
> time to grep the source tree for things like "skb->input_dev", so
> would you please do that before asking more questions like this?

That is exactly what I did, but I wanted a bit of confirmation. Sorry if
it
offends you, but I'm a bit new to this network thing.

> It does store the initial device, but as Thomas tried so hard to
> explain to you guys these device pointers in the skb are transient and
> you cannot refer to them outside of packet receive processing.

Yes, I understood that after Thomas' last mail.

> The reason is that there is no refcounting performed on these devices
> when they are attached to the skb, for performance reasons, and thus
> the device can be downed, the module for it removed, etc. long before
> the skb is freed up.

I understood that, thanks.

2006-08-11 02:37:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Thomas Graf wrote:

> skb->dev is not guaranteed to still point to the "allocating" device
> once the skb is freed again so reserve/unreserve isn't symmetric.
> You'd need skb->alloc_dev or something.

There's another consequence of this property of the network
stack.

Every network interface must be able to fall back to these
MEMALLOC allocations, because the memory critical socket
could be on another network interface. Hence, we cannot
know which network interfaces should (not) be marked MEMALLOC.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2006-08-12 03:43:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Peter Zijlstra wrote:

>> You say "critical resource isolation", but it is not the case - consider
>> NFS over UDP - remote side will not stop sending just because receiving
>> socket code drops data due to OOM, or IPsec or compression, which can
>> requires reallocation. There is no "critical resource isolation", since
>> reserved pool _must_ be used by everyone in the kernel network stack.

> The idea is to drop all !NFS packets (or even more specific only keep
> those NFS packets that belong to the critical mount), and everybody
> doing critical IO over layered networks like IPSec or other tunnel
> constructs asks for trouble - Just DON'T do that.

The only problem with things like IPSec is renegotiation, which
can take up memory right at the time you don't have any extra
memory available.

Decrypting individual IPSec packets during normal operation and
then dropping the ones for non-critical sockets should work just
fine.

The problem is layered networks over TCP, where you have to
process the packets in-order and may have no choice but to hold
onto data for non-critical sockets, at least for a while.

> Dropping these non-essential packets makes sure the reserve memory
> doesn't get stuck in some random blocked user-space process, hence
> you can make progress.

In short:
- every incoming packet needs to be received at the packet level
- when memory is low, we only deliver data to memory critical sockets
- packets to other sockets get dropped, so the memory can be reused
for receiving other packets, including the packets needed for the
memory critical sockets to make progress

Forwarding packets while in low memory mode should not be a problem
at all, since forwarded packets get freed quickly.

The memory pool for receiving packets does not need much accounting
of any kind, since every packet will end up coming from that pool
when normal allocations start failing. Maybe Evgeniy's allocator
can do something smarter internally, and mark skbuffs as MEMALLOC
when the number of available skbuffs is getting low?

Part (most?) of the problem space is explained here:

http://linux-mm.org/NetworkStorageDeadlock

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2006-08-12 08:47:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Fri, Aug 11, 2006 at 11:42:50PM -0400, Rik van Riel ([email protected]) wrote:
> >Dropping these non-essential packets makes sure the reserve memory
> >doesn't get stuck in some random blocked user-space process, hence
> >you can make progress.
>
> In short:
> - every incoming packet needs to be received at the packet level
> - when memory is low, we only deliver data to memory critical sockets
> - packets to other sockets get dropped, so the memory can be reused
> for receiving other packets, including the packets needed for the
> memory critical sockets to make progress
>
> Forwarding packets while in low memory mode should not be a problem
> at all, since forwarded packets get freed quickly.
>
> The memory pool for receiving packets does not need much accounting
> of any kind, since every packet will end up coming from that pool
> when normal allocations start failing. Maybe Evgeniy's allocator
> can do something smarter internally, and mark skbuffs as MEMALLOC
> when the number of available skbuffs is getting low?

As you described above, memory for each packet must be allocated (either
from SLAB or from reserve), so network needs special allocator in OOM
condition, and that allocator should be separated from SLAB's one which
got OOM, so my purpose is just to use that different allocator (with
additional features) for netroking always. Since every piece of
networking is limited (socket queues, socket numbers, hardware queues,
hardware wire speeds an so on) there is always a maximum amount of
memory it can consume and can never exceed, so if network allocator will
get that amount of memory at the begining, it will never meet OOM,
so it will _always_ work and thus can allow to make slow progress for
OOM-capable things like block devices and swap issues.
There are no special reserve and no need to switch to/from it and
no possibility to have OOM by design.

--
Evgeniy Polyakov

2006-08-12 09:20:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, 2006-08-12 at 12:47 +0400, Evgeniy Polyakov wrote:
> On Fri, Aug 11, 2006 at 11:42:50PM -0400, Rik van Riel ([email protected]) wrote:
> > >Dropping these non-essential packets makes sure the reserve memory
> > >doesn't get stuck in some random blocked user-space process, hence
> > >you can make progress.
> >
> > In short:
> > - every incoming packet needs to be received at the packet level
> > - when memory is low, we only deliver data to memory critical sockets
> > - packets to other sockets get dropped, so the memory can be reused
> > for receiving other packets, including the packets needed for the
> > memory critical sockets to make progress
> >
> > Forwarding packets while in low memory mode should not be a problem
> > at all, since forwarded packets get freed quickly.
> >
> > The memory pool for receiving packets does not need much accounting
> > of any kind, since every packet will end up coming from that pool
> > when normal allocations start failing. Maybe Evgeniy's allocator
> > can do something smarter internally, and mark skbuffs as MEMALLOC
> > when the number of available skbuffs is getting low?
>
> As you described above, memory for each packet must be allocated (either
> from SLAB or from reserve), so network needs special allocator in OOM
> condition, and that allocator should be separated from SLAB's one which
> got OOM, so my purpose is just to use that different allocator (with
> additional features) for netroking always. Since every piece of
> networking is limited (socket queues, socket numbers, hardware queues,
> hardware wire speeds an so on) there is always a maximum amount of
> memory it can consume and can never exceed, so if network allocator will
> get that amount of memory at the begining, it will never meet OOM,
> so it will _always_ work and thus can allow to make slow progress for
> OOM-capable things like block devices and swap issues.
> There are no special reserve and no need to switch to/from it and
> no possibility to have OOM by design.

I'm not sure if the network stack is bounded as you say; for instance
imagine you taking a lot of packets for blocked user-space processes,
these will just accumulate in the network stack and go nowhere. In that
case memory usage is very much unbounded.

Even if blocked sockets would only accept a limited amount of packets,
it would then become a function of the amount of open sockets, which is
again unbounded.

In any scheme you need to bound the amount of memory, and in low memory
situations it is very usefull to return memory as soon as possible.

Hence this approach, it uses the global memory reserve, and a page based
allocator that does indeed return memory instantly (yes the one I have
show so far is ghastly, I've since written a nice one).

2006-08-12 09:37:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 11:19:49AM +0200, Peter Zijlstra ([email protected]) wrote:
> > As you described above, memory for each packet must be allocated (either
> > from SLAB or from reserve), so network needs special allocator in OOM
> > condition, and that allocator should be separated from SLAB's one which
> > got OOM, so my purpose is just to use that different allocator (with
> > additional features) for netroking always. Since every piece of
> > networking is limited (socket queues, socket numbers, hardware queues,
> > hardware wire speeds an so on) there is always a maximum amount of
> > memory it can consume and can never exceed, so if network allocator will
> > get that amount of memory at the begining, it will never meet OOM,
> > so it will _always_ work and thus can allow to make slow progress for
> > OOM-capable things like block devices and swap issues.
> > There are no special reserve and no need to switch to/from it and
> > no possibility to have OOM by design.
>
> I'm not sure if the network stack is bounded as you say; for instance
> imagine you taking a lot of packets for blocked user-space processes,
> these will just accumulate in the network stack and go nowhere. In that
> case memory usage is very much unbounded.

No it is not. There are socket queues and they are limited. Things like
TCP behave even better.

> Even if blocked sockets would only accept a limited amount of packets,
> it would then become a function of the amount of open sockets, which is
> again unbounded.

Does it? I though it is possible to only have 64k of working sockets per
device in TCP.

> In any scheme you need to bound the amount of memory, and in low memory
> situations it is very usefull to return memory as soon as possible.

Feel free to drop packets as soon as it was found that they belong to
something that you do not want to get data right now.
It is an additional step, not a requirement.
All robust systems are built on top of priveledge separation and layered
access, so one compromised component would not affect other, it was
proven in a lot of CS theories. In case of reserve, which is based on
main allocator, system still uses SLAB for both types of data flows -
network and block data, there is always a possibility to have that
reserve empty or not refilled when OOM happens, so problem is not solved,
only system painfull death is slightly postponed (and maybe it will solve
the problem for exact condition, but it's roots are still there).
If system is limited enough to provide enough memory for network tree
allocator, it is possible to create it's own drop condition inside NTA,
but it must be saparated from the weakest chain element in that
conditions - SLAB OOM.

--
Evgeniy Polyakov

2006-08-12 10:19:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, 2006-08-12 at 13:37 +0400, Evgeniy Polyakov wrote:
> On Sat, Aug 12, 2006 at 11:19:49AM +0200, Peter Zijlstra ([email protected]) wrote:
> > > As you described above, memory for each packet must be allocated (either
> > > from SLAB or from reserve), so network needs special allocator in OOM
> > > condition, and that allocator should be separated from SLAB's one which
> > > got OOM, so my purpose is just to use that different allocator (with
> > > additional features) for netroking always. Since every piece of
> > > networking is limited (socket queues, socket numbers, hardware queues,
> > > hardware wire speeds an so on) there is always a maximum amount of
> > > memory it can consume and can never exceed, so if network allocator will
> > > get that amount of memory at the begining, it will never meet OOM,
> > > so it will _always_ work and thus can allow to make slow progress for
> > > OOM-capable things like block devices and swap issues.
> > > There are no special reserve and no need to switch to/from it and
> > > no possibility to have OOM by design.
> >
> > I'm not sure if the network stack is bounded as you say; for instance
> > imagine you taking a lot of packets for blocked user-space processes,
> > these will just accumulate in the network stack and go nowhere. In that
> > case memory usage is very much unbounded.
>
> No it is not. There are socket queues and they are limited. Things like
> TCP behave even better.
>
> > Even if blocked sockets would only accept a limited amount of packets,
> > it would then become a function of the amount of open sockets, which is
> > again unbounded.
>
> Does it? I though it is possible to only have 64k of working sockets per
> device in TCP.

65535 sockets * 128 packets * 16384 bytes/packet =
1^16 * 1^7 * 1^14 = 1^(16+7+14) = 1^37 = 128G of memory per IP

And systems with a lot of IP numbers are not unthinkable.

I wonder what kind of system you have to feel that that is not a
problem. (I'm not sure on the 128 packets per socket, and the 16k per
packet is considering jumbo frames without scather gather receive)

> If system is limited enough to provide enough memory for network tree
> allocator, it is possible to create it's own drop condition inside NTA,
> but it must be saparated from the weakest chain element in that
> conditions - SLAB OOM.

Hence the alternative allocator to use on tight memory conditions.

2006-08-12 10:42:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 12:18:07PM +0200, Peter Zijlstra ([email protected]) wrote:
> > Does it? I though it is possible to only have 64k of working sockets per
> > device in TCP.
>
> 65535 sockets * 128 packets * 16384 bytes/packet =
> 1^16 * 1^7 * 1^14 = 1^(16+7+14) = 1^37 = 128G of memory per IP
>
> And systems with a lot of IP numbers are not unthinkable.
>
> I wonder what kind of system you have to feel that that is not a
> problem. (I'm not sure on the 128 packets per socket, and the 16k per
> packet is considering jumbo frames without scather gather receive)

:) have you ever tried to get those numbers in practice?

There are couple of problems with that:
1. You only have 1gbit/sec link (someone has 10gbit/sec).
2. Depending on packet size those limit can be lower (and it _is_ lower
for the real-world traffic where you need to send ACKs).
3. If data transferred is supposed to be processed somehow, you need to
add those cost too.
4. When we are talking about OOM conditions, they whole system is slowed
down hard, so you again can not achieve those limit.

And there are a lot of other factors which can slow down you network
(like netfilter, ipsec and others).

> > If system is limited enough to provide enough memory for network tree
> > allocator, it is possible to create it's own drop condition inside NTA,
> > but it must be saparated from the weakest chain element in that
> > conditions - SLAB OOM.
>
> Hence the alternative allocator to use on tight memory conditions.

You seems to not understand the situation.
You want to make a progress in SLAB OOM condition which is bound to
network and since progress requires allocation, it can deadlock.
That is the problem - SLAB allocator got OOM condition.
The only solution for that (expect more memory addition) is to make
underlying system robust (in discussed case it is network). Robust means
not depend on condition which leads to SLAB OOM.
When network uses the same allocator, it depends on it, and thus it is
possible to have (cut by you) a situation when reserve (which depends on
SLAB and it's OOM too) is not filled or even does not exist.
Underlying system can have it's own problems, millions of problems, but
it does not matter when we are talking about higher-layer conditions -
forward progress in this scenario does not depend on the weakest chain
element in described conditions - it does not depend on SLAB OOM.

If transferred to your implementation, then just steal some pages from
SLAB when new network device is added and use them when OOM happens.
It is much simpler and can help in the most of situations.

--
Evgeniy Polyakov

2006-08-12 10:52:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 02:42:26PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> > Hence the alternative allocator to use on tight memory conditions.
>
> If transferred to your implementation, then just steal some pages from
> SLAB when new network device is added and use them when OOM happens.
> It is much simpler and can help in the most of situations.

And just to make things clear - I do not insult your implementation
in any way, it can be 100% correct and behave perfectly.
I'm just saying that there are other methods to solve the problem which
seems to me more appropriate.

--
Evgeniy Polyakov

2006-08-12 11:41:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, 2006-08-12 at 14:42 +0400, Evgeniy Polyakov wrote:

> When network uses the same allocator, it depends on it, and thus it is
> possible to have (cut by you) a situation when reserve (which depends on
> SLAB and it's OOM too) is not filled or even does not exist.

No, the reserve does not depend on SLAB, and I totally short-circuit the
SLAB allocator for skbs and related on memory pressure.

The memalloc reserve is on the page allocator level and is only
accessable for PF_MEMALLOC processes or __GFP_MEMALLOC (new in my
patches) allocations. (arguably there could be some more deadlocks wrt.
PF_MEMALLOC where the code under PF_MEMALLOC is not properly bounded,
those would be bugs and should be fixed if present/found)

> If transferred to your implementation, then just steal some pages from
> SLAB when new network device is added and use them when OOM happens.
> It is much simpler and can help in the most of situations.

SLAB reclaim is painfull and has been tried by the time you OOM.

2006-08-12 11:54:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 01:40:29PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Sat, 2006-08-12 at 14:42 +0400, Evgeniy Polyakov wrote:
>
> > When network uses the same allocator, it depends on it, and thus it is
> > possible to have (cut by you) a situation when reserve (which depends on
> > SLAB and it's OOM too) is not filled or even does not exist.
>
> No, the reserve does not depend on SLAB, and I totally short-circuit the
> SLAB allocator for skbs and related on memory pressure.
>
> The memalloc reserve is on the page allocator level and is only
> accessable for PF_MEMALLOC processes or __GFP_MEMALLOC (new in my
> patches) allocations. (arguably there could be some more deadlocks wrt.
> PF_MEMALLOC where the code under PF_MEMALLOC is not properly bounded,
> those would be bugs and should be fixed if present/found)

By SLAB I always mean allocator which is used to get the memory.
In your design main allocator is used to reserve the memory, which then
can be used by your own allocator.

> > If transferred to your implementation, then just steal some pages from
> > SLAB when new network device is added and use them when OOM happens.
> > It is much simpler and can help in the most of situations.
>
> SLAB reclaim is painfull and has been tried by the time you OOM.

Just never return reserved pages, provide kernel parameter of how large
your reserve is and get pages from there when you detected OOM.
SLAB (main allocator) can do anything it want, but your reserve will never
be touched by it.

--
Evgeniy Polyakov

2006-08-12 14:40:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Sat, Aug 12, 2006 at 11:19:49AM +0200, Peter Zijlstra ([email protected]) wrote:
>>> As you described above, memory for each packet must be allocated (either
>>> from SLAB or from reserve), so network needs special allocator in OOM
>>> condition, and that allocator should be separated from SLAB's one which
>>> got OOM, so my purpose is just to use that different allocator (with
>>> additional features) for netroking always.
>
> No it is not. There are socket queues and they are limited. Things like
> TCP behave even better.

Ahhh, but there are two allocators in play here.

The first one allocates the memory for receiving packets.
This can be one pool, as long as it is isolated from
other things in the system it is fine.

The second allocator allocates more memory for socket
buffers. The memory critical sockets should get their
memory from a mempool, once normal socket memory
allocations start failing.

This means our allocation differentiation only needs
to happen at the socket stage.

Or am I overlooking something?

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2006-08-12 14:50:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 10:40:23AM -0400, Rik van Riel ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >On Sat, Aug 12, 2006 at 11:19:49AM +0200, Peter Zijlstra
> >([email protected]) wrote:
> >>>As you described above, memory for each packet must be allocated (either
> >>>from SLAB or from reserve), so network needs special allocator in OOM
> >>>condition, and that allocator should be separated from SLAB's one which
> >>>got OOM, so my purpose is just to use that different allocator (with
> >>>additional features) for netroking always.
> >
> >No it is not. There are socket queues and they are limited. Things like
> >TCP behave even better.
>
> Ahhh, but there are two allocators in play here.
>
> The first one allocates the memory for receiving packets.
> This can be one pool, as long as it is isolated from
> other things in the system it is fine.
>
> The second allocator allocates more memory for socket
> buffers. The memory critical sockets should get their
> memory from a mempool, once normal socket memory
> allocations start failing.
>
> This means our allocation differentiation only needs
> to happen at the socket stage.
>
> Or am I overlooking something?

Yep. Socket allocations end up with alloc_skb() which is essentialy the
same as what is being done for receiving path skbs.
If you really want to separate critical from non-critical sockets, it is
much better not to play with alloc_skb() but directly forbid it in
appropriate socket allocation function like sock_alloc_send_skb().

What I suggested in previous e-mail is to separate networking
allocations from other system allocations, so problem in main allocator
and it's OOM would never affect network path.

--
Evgeniy Polyakov

2006-08-12 14:56:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Sat, Aug 12, 2006 at 10:40:23AM -0400, Rik van Riel ([email protected]) wrote:
>> Evgeniy Polyakov wrote:
>>> On Sat, Aug 12, 2006 at 11:19:49AM +0200, Peter Zijlstra
>>> ([email protected]) wrote:
>>>>> As you described above, memory for each packet must be allocated (either
>>>> >from SLAB or from reserve), so network needs special allocator in OOM
>>>>> condition, and that allocator should be separated from SLAB's one which
>>>>> got OOM, so my purpose is just to use that different allocator (with
>>>>> additional features) for netroking always.
>>> No it is not. There are socket queues and they are limited. Things like
>>> TCP behave even better.
>> Ahhh, but there are two allocators in play here.
>>
>> The first one allocates the memory for receiving packets.
>> This can be one pool, as long as it is isolated from
>> other things in the system it is fine.
>>
>> The second allocator allocates more memory for socket
>> buffers. The memory critical sockets should get their
>> memory from a mempool, once normal socket memory
>> allocations start failing.
>>
>> This means our allocation differentiation only needs
>> to happen at the socket stage.
>>
>> Or am I overlooking something?
>
> Yep. Socket allocations end up with alloc_skb() which is essentialy the
> same as what is being done for receiving path skbs.
> If you really want to separate critical from non-critical sockets, it is
> much better not to play with alloc_skb() but directly forbid it in
> appropriate socket allocation function like sock_alloc_send_skb().

The problem is the RECEIVE side.

> What I suggested in previous e-mail is to separate networking
> allocations from other system allocations, so problem in main allocator
> and it's OOM would never affect network path.

That solves half of the problem. We still need to make sure we
do not allocate memory to non-critical sockets when the system
is almost out of memory.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2006-08-12 15:09:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 10:56:31AM -0400, Rik van Riel ([email protected]) wrote:
> >Yep. Socket allocations end up with alloc_skb() which is essentialy the
> >same as what is being done for receiving path skbs.
> >If you really want to separate critical from non-critical sockets, it is
> >much better not to play with alloc_skb() but directly forbid it in
> >appropriate socket allocation function like sock_alloc_send_skb().
>
> The problem is the RECEIVE side.
>
> >What I suggested in previous e-mail is to separate networking
> >allocations from other system allocations, so problem in main allocator
> >and it's OOM would never affect network path.
>
> That solves half of the problem. We still need to make sure we
> do not allocate memory to non-critical sockets when the system
> is almost out of memory.

One must receive a packet to determine if that packet must be dropped
until tricky hardware with header split capabilities or MMIO copying is
used. Peter uses special pool to get data from when system is in OOM (at
least in his latest patchset), so allocations are separated and thus
network code is not affected by OOM condition, which allows to make
forward progress.
Critical flag can be setup through setsockopt() and checked in
tcp_v4_rcv().

--
Evgeniy Polyakov

2006-08-12 15:23:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, 2006-08-12 at 19:08 +0400, Evgeniy Polyakov wrote:

> One must receive a packet to determine if that packet must be dropped
> until tricky hardware with header split capabilities or MMIO copying is
> used.

True, that is done, but we then discard this packet at the very first
moment we know it's not for a special socket. This way we know this
piece of memory will not get stuck waiting on some unimportant blocked
process. So even though we allocate the packet we do not loose the
memory.

> Peter uses special pool to get data from when system is in OOM (at
> least in his latest patchset), so allocations are separated and thus
> network code is not affected by OOM condition, which allows to make
> forward progress.

I've done that throughout the patches, in various forms of brokenness.
Only with this full allocator could I implement all the semantics needed
for all skb operations though.

Previous attempts had some horrors build on alloc_pages() in there.

> Critical flag can be setup through setsockopt() and checked in
> tcp_v4_rcv().

I have looked at setsockopt(), but since I'm not sure I want to expose
this to userspace I chose to not do that.


2006-08-13 00:45:49

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Evgeniy Polyakov <[email protected]>
Date: Sat, 12 Aug 2006 13:37:06 +0400

> Does it? I though it is possible to only have 64k of working sockets per
> device in TCP.

Where does this limit come from?

You think there is something magic about 64K local ports,
but if remote IP addresses in the TCP socket IDs are all
different, number of possible TCP sockets is only limited
by "number of client IPs * 64K" and ram :-)

2006-08-13 00:46:35

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

From: Peter Zijlstra <[email protected]>
Date: Sat, 12 Aug 2006 12:18:07 +0200

> 65535 sockets * 128 packets * 16384 bytes/packet =
> 1^16 * 1^7 * 1^14 = 1^(16+7+14) = 1^37 = 128G of memory per IP
>
> And systems with a lot of IP numbers are not unthinkable.

TCP restricts the amount of global memory that may be consumed
by all TCP sockets via the tcp_mem[] sysctl.

Otherwise several forms of DoS attacks would be possible.

2006-08-13 01:11:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Sat, 12 Aug 2006 12:18:07 +0200
>
>> 65535 sockets * 128 packets * 16384 bytes/packet =
>> 1^16 * 1^7 * 1^14 = 1^(16+7+14) = 1^37 = 128G of memory per IP
>>
>> And systems with a lot of IP numbers are not unthinkable.
>
> TCP restricts the amount of global memory that may be consumed
> by all TCP sockets via the tcp_mem[] sysctl.

This is exactly why we need to be careful which sockets
we allocate memory for, when the system is about to run
out of memory.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2006-08-13 09:07:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sat, Aug 12, 2006 at 05:46:07PM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Sat, 12 Aug 2006 13:37:06 +0400
>
> > Does it? I though it is possible to only have 64k of working sockets per
> > device in TCP.
>
> Where does this limit come from?
>
> You think there is something magic about 64K local ports,
> but if remote IP addresses in the TCP socket IDs are all
> different, number of possible TCP sockets is only limited
> by "number of client IPs * 64K" and ram :-)

I talked about working sockets, but not about how many of them system
can have at all :)

--
Evgeniy Polyakov

2006-08-13 09:52:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sun, Aug 13, 2006 at 01:06:21PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> On Sat, Aug 12, 2006 at 05:46:07PM -0700, David Miller ([email protected]) wrote:
> > From: Evgeniy Polyakov <[email protected]>
> > Date: Sat, 12 Aug 2006 13:37:06 +0400
> >
> > > Does it? I though it is possible to only have 64k of working sockets per
> > > device in TCP.
> >
> > Where does this limit come from?
> >
> > You think there is something magic about 64K local ports,
> > but if remote IP addresses in the TCP socket IDs are all
> > different, number of possible TCP sockets is only limited
> > by "number of client IPs * 64K" and ram :-)
>
> I talked about working sockets, but not about how many of them system
> can have at all :)

working -> bound.

--
Evgeniy Polyakov

2006-08-13 19:39:41

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

David Miller wrote:
> I think he's saying that he doesn't think your code is yet a
> reasonable way to solve the problem, and therefore doesn't belong
> upstream.

That is why it has not yet been submitted upstream. Respectfully, I
do not think that jgarzik has yet put in the work to know if this anti
deadlock technique is reasonable or not, and he was only commenting
on some superficial blemish. I still don't get his point, if there
was one. He seems to be arguing in favor of a jump-off-the-cliff
approach to driver conversion. If he wants to do the work and take
the blame when some driver inevitably breaks because of being edited
in a hurry then he is welcome to submit the necessary additional
patches. Until then, there are about 3 nics that actually matter to
network storage at the moment, all of them GigE.

The layer 2 blemishes can be fixed easily, including avoiding the
atomic op stall and the ->dev volatility . Thankyou for pointing
those out.

Regards,

Daniel

2006-08-13 19:53:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/9] 3c59x driver conversion

Daniel Phillips wrote:
> That is why it has not yet been submitted upstream. Respectfully, I
> do not think that jgarzik has yet put in the work to know if this anti
> deadlock technique is reasonable or not, and he was only commenting
> on some superficial blemish. I still don't get his point, if there
> was one. He seems to be arguing in favor of a jump-off-the-cliff
> approach to driver conversion. If he wants to do the work and take
> the blame when some driver inevitably breaks because of being edited
> in a hurry then he is welcome to submit the necessary additional
> patches. Until then, there are about 3 nics that actually matter to
> network storage at the moment, all of them GigE.

Quite whining and blaming the reviewer for a poor approach.

A "this driver is sane, VM-wise" flag is just plain stupid, and clearly
fragments drivers.

In Linux, "temporary flags"... aren't.

Jeff



2006-08-13 20:17:43

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Peter Zijlstra wrote:
> On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote:
>>People are doing I/O over IP exactly for it's ubiquity and
>>flexibility. It seems a major limitation of the design if you cancel
>>out major components of this flexibility.
>
> We're not, that was a bit of my own frustration leaking out; I think
> this whole push to IP based storage is a bit silly. I'm just not going
> to help the admin who's server just hangs because his VPN key expired.
>
> Running critical resources remotely like this is tricky, and every
> hop/layer you put in between increases the risk of something going bad.
> The only setup I think even remotely sane is a dedicated network in the
> very same room - not unlike FC but cheaper (which I think is the whole
> push behind this, eth is cheap)

Indeed. The rest of the corner cases like netfilter, layered protocol and
so on need to be handled, however they do not need to be handled right now
in order to make remote storage on a lan work properly. The sane thing for
the immediate future is to flag each socket as safe for remote block IO or
not, then gradually widen the scope of what is safe. We need to set up an
opt in strategy for network block IO that views such network subsystems as
ipfilter as not safe by default, until somebody puts in the work to make
them safe.

But really, if you expect to run reliable block IO to Zanzibar over an ssh
tunnel through a firewall, then you might also consider taking up bungie
jumping with the cord tied to your neck.

Regards,

Daniel

2006-08-13 21:22:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
> From: Peter Zijlstra <[email protected]>
>>Hmm, what does sk_buff::input_dev do? That seems to store the initial
>>device?
>
> You can run grep on the tree just as easily as I can which is what I
> did to answer this question. It only takes a few seconds of your
> time to grep the source tree for things like "skb->input_dev", so
> would you please do that before asking more questions like this?
>
> It does store the initial device, but as Thomas tried so hard to
> explain to you guys these device pointers in the skb are transient and
> you cannot refer to them outside of packet receive processing.

Thomas did a great job of explaining and without any flaming or ad
hominem attacks.

We have now formed a decent plan for doing the accounting in a stable
way without adding new fields to sk_buff, thankyou for the catch.

> The reason is that there is no refcounting performed on these devices
> when they are attached to the skb, for performance reasons, and thus
> the device can be downed, the module for it removed, etc. long before
> the skb is freed up.

The virtual block device can refcount the network device on virtual
device create and un-refcount on virtual device delete. We need to
add that to the core patch and maybe package it nicely so the memalloc
reserve/unreserve happens at the same time, in a tidy little library
function to share with other virtual devices like iSCSI that also need
some anti-deadlock lovin.

Regards,

Daniel

2006-08-13 22:05:55

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Rik van Riel wrote:
> Thomas Graf wrote:
>> skb->dev is not guaranteed to still point to the "allocating" device
>> once the skb is freed again so reserve/unreserve isn't symmetric.
>> You'd need skb->alloc_dev or something.
>
> There's another consequence of this property of the network
> stack.
>
> Every network interface must be able to fall back to these
> MEMALLOC allocations, because the memory critical socket
> could be on another network interface. Hence, we cannot
> know which network interfaces should (not) be marked MEMALLOC.

Good point. We do however know which interfaces should be marked
capable of carrying block IO traffic: the ones that have been fixed,
audited and tested. We might then allow the network block device to
specify which interface(s) will actually carry the traffic.

The advantage of being specific about which devices are carrying at
least one block io socket is, we can skip the reserve accounting for
the other interfaces. But is the extra layer of configuration gack a
better idea than just doing the accounting for every device, provided
the system is in reclaim?

By the way, another way to avoid impact on the normal case is an
experimental option such as CONFIG_PREVENT_NETWORK_BLOCKIO_DEADLOCK.

Regards,

Daniel

2006-08-13 23:49:14

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Daniel Phillips <[email protected]>
Date: Sun, 13 Aug 2006 14:22:31 -0700

> David Miller wrote:
> > The reason is that there is no refcounting performed on these devices
> > when they are attached to the skb, for performance reasons, and thus
> > the device can be downed, the module for it removed, etc. long before
> > the skb is freed up.
>
> The virtual block device can refcount the network device on virtual
> device create and un-refcount on virtual device delete.

What if the packet is originally received on the device in question,
and then gets redirected to another device by a packet scheduler
traffic classifier action or a netfilter rule?

It is necessary to handle the case where the device changes on the
skb, and the skb gets freed up in a context and assosciation different
from when the skb was allocated (for example, different from the
device attached to the virtual block device).

2006-08-13 23:55:19

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Daniel Phillips <[email protected]>
Date: Sun, 13 Aug 2006 15:05:25 -0700

> By the way, another way to avoid impact on the normal case is an
> experimental option such as CONFIG_PREVENT_NETWORK_BLOCKIO_DEADLOCK.

That would just make the solution look more like a hack, and "bolted
on" rather than designed.

I think there is more profitability from a solution that really does
something about "network memory", and doesn't try to say "these
devices are special" or "these sockets are special". Special cases
generally suck.

We already limit and control TCP socket memory globally in the system.
If we do this for all socket and anonymous network buffer allocations,
which is sort of implicity in Evgeniy's network tree allocator design,
we can solve this problem in a more reasonable way.

And here's the kick, there are other unrelated highly positive
consequences to using Evgeniy's network tree allocator.

It doesn't just solve the _one_ problem it was built for, it solves
several problems. And that is the hallmark signature of good design.

2006-08-14 00:56:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> One must receive a packet to determine if that packet must be dropped
> until tricky hardware with header split capabilities or MMIO copying is
> used. Peter uses special pool to get data from when system is in OOM (at
> least in his latest patchset), so allocations are separated and thus
> network code is not affected by OOM condition, which allows to make
> forward progress.

Nice executive summary. Crucial point: you want to say "in reclaim"
not "in OOM".

Yes, right from the beginning the patch set got its sk_buff memory
from a special pool when the system is in reclaim, however the exact
nature of the pool and how/where it is accounted has evolved... mostly
forward.

Regards,

Daniel

2006-08-14 01:15:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
>From: Daniel Phillips <[email protected]>
>>David Miller wrote:
>>
>>>The reason is that there is no refcounting performed on these devices
>>>when they are attached to the skb, for performance reasons, and thus
>>>the device can be downed, the module for it removed, etc. long before
>>>the skb is freed up.
>>
>>The virtual block device can refcount the network device on virtual
>>device create and un-refcount on virtual device delete.
>
> What if the packet is originally received on the device in question,
> and then gets redirected to another device by a packet scheduler
> traffic classifier action or a netfilter rule?
>
> It is necessary to handle the case where the device changes on the
> skb, and the skb gets freed up in a context and assosciation different
> from when the skb was allocated (for example, different from the
> device attached to the virtual block device).

This aspect of the patch became moot because of the change to a single
reserve for all layer 2 delivery in Peter's more recent revisions.

*However* maybe it is worth mentioning that I intended to provide a
pointer from each sk_buff to a common accounting struct. This pointer
is set by the device driver. If the driver knows nothing about memory
accounting then the pointer is null, no accounting is done, and block
IO over the interface will be dangerous. Otherwise, if the system is
in reclaim (which we currently detect crudely when a normal allocation
fails) then atomic ops will be done on the accounting structure.

I planned to use the (struct sock *)sk_buff->sk field for this, which
is unused during layer 2 delivery as far as I can see. The accounting
struct can look somewhat like a struct sock if we like, size doesn't
really matter, and it might make the code more robust. Or the field
could become a union.

Anyway, this bit doesn't matter any more, the single global packet
delivery reserve is better and simpler.

Regards,

Daniel

2006-08-14 01:31:45

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

David Miller wrote:
> I think there is more profitability from a solution that really does
> something about "network memory", and doesn't try to say "these
> devices are special" or "these sockets are special". Special cases
> generally suck.
>
> We already limit and control TCP socket memory globally in the system.
> If we do this for all socket and anonymous network buffer allocations,
> which is sort of implicity in Evgeniy's network tree allocator design,
> we can solve this problem in a more reasonable way.

This does sound promising, but...

It is not possible to solve this problem entirely in the network
layer. At minimum, throttling is needed in the nbd driver, or even
better, in the submit_bio path as we have it now. We also need a way
of letting the virtual block device declare its resource needs, which
we also have in some form. We also need a mechanism to guarantee
level 2 delivery so we can drop unrelated packets if necessary, which
exists in the current patch set.

So Evgeniy's work might well be a part of the solution, but it is not
the whole solution.

> And here's the kick, there are other unrelated highly positive
> consequences to using Evgeniy's network tree allocator.

Also good. But is it ready to use today? We need to actually fix
the out of memory deadlock/performance bug right now so that remote
storage works properly.

> It doesn't just solve the _one_ problem it was built for, it solves
> several problems. And that is the hallmark signature of good design.

Great. But to solve the whole problem, the block IO system and the
network layer absolutely must cooperate.

Regards,

Daniel

2006-08-14 01:53:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Sun, 13 Aug 2006 18:31:14 -0700
Daniel Phillips <[email protected]> wrote:

> But to solve the whole problem

What problem? Has anyone come up with a testcase which others can
reproduce?

2006-08-14 04:42:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Sun, 2006-08-13 at 18:53 -0700, Andrew Morton wrote:
> On Sun, 13 Aug 2006 18:31:14 -0700
> Daniel Phillips <[email protected]> wrote:
>
> > But to solve the whole problem
>
> What problem? Has anyone come up with a testcase which others can
> reproduce?

Problem:

Networked Block devices (NBD, iSCSI, AoE) can deadlock in the following
manner:
deplete normal memory because of memory pressure; deplete reserves by
writeout over network (pageout happens under PF_MEMALLOC), little to no
memory left for receiving those now crucial ACK packets.
A few packets could still fit in memory, but are quickly gobbled up by
non-crucial sockets and are left waiting on blocked user-space
processes. All memory is depleted and progress stalled forever.

(This affects swap and shared mmap)

Our Solution:

Mark some sockets with SOCK_MEMALLOC; which is essentially a promise to
never block. When under memory pressure only deliver packets to these
sockets, memory will still be used but never lost waiting on a blocked
user space process.

Also make sure the reserve is large enough so that writeout will never
be able to completely deplete it.

(It is here I still do not see Evgeniy's Network Tree Allocator work;
where is the guarantee that you do not end up with all memory lost
waiting on blocked sockets?)

Testcase:

Mount an NBD device as sole swap device and mmap > physical RAM, then
loop through touching pages only once.

My normal test setup is a p3-550 with 192M of ram with a 100Mbit card
and remote machine with a regular 7200 RPM pata drive.

I'm sure there is an iSCSI equivalent scenario, playing with iSCSI is
next on my list of things.

2006-08-14 05:05:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> On Mon, 14 Aug 2006 06:40:53 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > Testcase:
> >
> > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > loop through touching pages only once.
>
> Fix: don't try to swap over the network. Yes, there may be some scenarios
> where people have no local storage, but it's reasonable to expect anyone
> who is using Linux as an "enterprise storage platform" to stick a local
> disk on the thing for swap.

I wish you were right, however there seems to be a large demand to go
diskless and swap over iSCSI because disks seem to be the nr. 1 failing
piece of hardware in systems these days.

> That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> that, will it not?

Will makes it less likely. One can still have memory pressure, the
remaining bits of memory can still get stuck in socket queues for
blocked processes.

2006-08-14 05:05:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, 14 Aug 2006 06:40:53 +0200
Peter Zijlstra <[email protected]> wrote:

> Testcase:
>
> Mount an NBD device as sole swap device and mmap > physical RAM, then
> loop through touching pages only once.

Fix: don't try to swap over the network. Yes, there may be some scenarios
where people have no local storage, but it's reasonable to expect anyone
who is using Linux as an "enterprise storage platform" to stick a local
disk on the thing for swap.

That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
that, will it not?

2006-08-14 05:14:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips ([email protected]) wrote:
> Indeed. The rest of the corner cases like netfilter, layered protocol and
> so on need to be handled, however they do not need to be handled right now
> in order to make remote storage on a lan work properly. The sane thing for
> the immediate future is to flag each socket as safe for remote block IO or
> not, then gradually widen the scope of what is safe. We need to set up an
> opt in strategy for network block IO that views such network subsystems as
> ipfilter as not safe by default, until somebody puts in the work to make
> them safe.

Just for clarification - it will be completely impossible to login using
openssh or some other priveledge separation protocol to the machine due
to the nature of unix sockets. So you will be unable to manage your
storage system just because it is in OOM - it is not what is expected
from reliable system.

> But really, if you expect to run reliable block IO to Zanzibar over an ssh
> tunnel through a firewall, then you might also consider taking up bungie
> jumping with the cord tied to your neck.

Just pure openssh for control connection (admin should be able to
login).

> Regards,
>
> Daniel

--
Evgeniy Polyakov

2006-08-14 05:29:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, 14 Aug 2006 07:03:55 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> > On Mon, 14 Aug 2006 06:40:53 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > Testcase:
> > >
> > > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > > loop through touching pages only once.
> >
> > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > where people have no local storage, but it's reasonable to expect anyone
> > who is using Linux as an "enterprise storage platform" to stick a local
> > disk on the thing for swap.
>
> I wish you were right, however there seems to be a large demand to go
> diskless and swap over iSCSI because disks seem to be the nr. 1 failing
> piece of hardware in systems these days.

We could track dirty anonymous memory and throttle.

Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
machine is no longer deadlockable with any of these tricks. Do we know
what level that is?

> > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > that, will it not?
>
> Will makes it less likely. One can still have memory pressure, the
> remaining bits of memory can still get stuck in socket queues for
> blocked processes.

But there's lots of reclaimable pagecache around and kswapd will free it
up?

2006-08-14 06:46:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Mon, 2006-08-14 at 09:13 +0400, Evgeniy Polyakov wrote:
> On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips ([email protected]) wrote:
> > Indeed. The rest of the corner cases like netfilter, layered protocol and
> > so on need to be handled, however they do not need to be handled right now
> > in order to make remote storage on a lan work properly. The sane thing for
> > the immediate future is to flag each socket as safe for remote block IO or
> > not, then gradually widen the scope of what is safe. We need to set up an
> > opt in strategy for network block IO that views such network subsystems as
> > ipfilter as not safe by default, until somebody puts in the work to make
> > them safe.
>
> Just for clarification - it will be completely impossible to login using
> openssh or some other priveledge separation protocol to the machine due
> to the nature of unix sockets. So you will be unable to manage your
> storage system just because it is in OOM - it is not what is expected
> from reliable system.
>
> > But really, if you expect to run reliable block IO to Zanzibar over an ssh
> > tunnel through a firewall, then you might also consider taking up bungie
> > jumping with the cord tied to your neck.
>
> Just pure openssh for control connection (admin should be able to
> login).

These periods of degenerated functionality should be short and
infrequent albeit critical for machine recovery. Would you rather have a
slower ssh login (the machine will recover) or drive/fly to Zanzibar to
physically reboot the machine?

2006-08-14 06:46:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> On Mon, 14 Aug 2006 07:03:55 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> > > On Mon, 14 Aug 2006 06:40:53 +0200
> > > Peter Zijlstra <[email protected]> wrote:
> > >
> > > > Testcase:
> > > >
> > > > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > > > loop through touching pages only once.
> > >
> > > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > > where people have no local storage, but it's reasonable to expect anyone
> > > who is using Linux as an "enterprise storage platform" to stick a local
> > > disk on the thing for swap.
> >
> > I wish you were right, however there seems to be a large demand to go
> > diskless and swap over iSCSI because disks seem to be the nr. 1 failing
> > piece of hardware in systems these days.
>
> We could track dirty anonymous memory and throttle.
>
> Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> machine is no longer deadlockable with any of these tricks. Do we know
> what level that is?

Not sure, the theoretical max amount of memory one can 'lose' in socket
wait queues is well over the amount of physical memory we have in
machines today (even for SGI); this combined with the fact that we limit
the memory in some way to avoid DoS attacks, could make for all memory
to be stuck in wait queues. Of course this becomes rather more unlikely
for ever larger amounts of memory. But unlikely is never a guarantee.

>
> > > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > > that, will it not?
> >
> > Will makes it less likely. One can still have memory pressure, the
> > remaining bits of memory can still get stuck in socket queues for
> > blocked processes.
>
> But there's lots of reclaimable pagecache around and kswapd will free it
> up?

Yes, however it is possible for kswapd and direct reclaim to block on
get_request_wait() for the nbd/iscsi request queue by sheer misfortune.
In that case there will be no more reclaim; of course the more active
processes we have the unlikelier this will be. Still with the sheer
amount of cpu time invested in Linux its not a gamble we're likely to
never lose.

2006-08-14 06:55:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Mon, Aug 14, 2006 at 08:45:43AM +0200, Peter Zijlstra ([email protected]) wrote:
> > Just for clarification - it will be completely impossible to login using
> > openssh or some other priveledge separation protocol to the machine due
> > to the nature of unix sockets. So you will be unable to manage your
> > storage system just because it is in OOM - it is not what is expected
> > from reliable system.
> >
> > > But really, if you expect to run reliable block IO to Zanzibar over an ssh
> > > tunnel through a firewall, then you might also consider taking up bungie
> > > jumping with the cord tied to your neck.
> >
> > Just pure openssh for control connection (admin should be able to
> > login).
>
> These periods of degenerated functionality should be short and
> infrequent albeit critical for machine recovery. Would you rather have a
> slower ssh login (the machine will recover) or drive/fly to Zanzibar to
> physically reboot the machine?

It will not work, since you can not mark openssh sockets as those which
are able to get memory from reserved pool. So admin unable to check the
system status and make anything to turn system's life on.

--
Evgeniy Polyakov

2006-08-14 07:14:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, 14 Aug 2006 08:45:40 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> > On Mon, 14 Aug 2006 07:03:55 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> > > > On Mon, 14 Aug 2006 06:40:53 +0200
> > > > Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > > Testcase:
> > > > >
> > > > > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > > > > loop through touching pages only once.
> > > >
> > > > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > > > where people have no local storage, but it's reasonable to expect anyone
> > > > who is using Linux as an "enterprise storage platform" to stick a local
> > > > disk on the thing for swap.
> > >
> > > I wish you were right, however there seems to be a large demand to go
> > > diskless and swap over iSCSI because disks seem to be the nr. 1 failing
> > > piece of hardware in systems these days.
> >
> > We could track dirty anonymous memory and throttle.
> >
> > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> > machine is no longer deadlockable with any of these tricks. Do we know
> > what level that is?
>
> Not sure, the theoretical max amount of memory one can 'lose' in socket
> wait queues is well over the amount of physical memory we have in
> machines today (even for SGI); this combined with the fact that we limit
> the memory in some way to avoid DoS attacks, could make for all memory
> to be stuck in wait queues. Of course this becomes rather more unlikely
> for ever larger amounts of memory. But unlikely is never a guarantee.

What is a "socket wait queue" and how/why can it consume so much memory?

Can it be prevented from doing that?

If this refers to the socket buffers, they're mostly allocated with
at least __GFP_WAIT, aren't they?

> >
> > > > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > > > that, will it not?
> > >
> > > Will makes it less likely. One can still have memory pressure, the
> > > remaining bits of memory can still get stuck in socket queues for
> > > blocked processes.
> >
> > But there's lots of reclaimable pagecache around and kswapd will free it
> > up?
>
> Yes, however it is possible for kswapd and direct reclaim to block on
> get_request_wait() for the nbd/iscsi request queue by sheer misfortune.

Possibly there are some situations where kswapd will get stuck on request
queues. But as long as the block layer is correctly calling
set_queue_congested(), these are easily avoidable via
bdi_write_congested().

> In that case there will be no more reclaim; of course the more active
> processes we have the unlikelier this will be. Still with the sheer
> amount of cpu time invested in Linux its not a gamble we're likely to
> never lose.

I suspect that with mm-tracking-shared-dirty-pages.patch, a bit of tuning
and perhaps some bugfixing we can make this problem go away for all
practical purposes. Particularly if we're prepared to require local
storage for swap (the paranoid can use RAID, no?).

Seem to me that more investigation of these options is needed before we can
justify adding lots of hard-to-test complexity to networking?

2006-08-14 07:17:50

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Monday August 14, [email protected] wrote:
> On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> >
> > We could track dirty anonymous memory and throttle.
> >
> > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> > machine is no longer deadlockable with any of these tricks. Do we know
> > what level that is?
>
> Not sure, the theoretical max amount of memory one can 'lose' in socket
> wait queues is well over the amount of physical memory we have in
> machines today (even for SGI); this combined with the fact that we limit
> the memory in some way to avoid DoS attacks, could make for all memory
> to be stuck in wait queues. Of course this becomes rather more unlikely
> for ever larger amounts of memory. But unlikely is never a guarantee.

What is the minimum amount of memory we need to reserve for each
socket? 1K? 1 page? Call it X
Suppose that whenever a socket is created (or bound or connected or
whatever is right) we first allocate that much to a recv pool.
If any socket has less than X queued, then it is allowed to allocate
up to a total of X from the reserve pool. After that it can only
receive when memory can be allocated from elsewhere. Then we will
never block on recv.

Note that X doesn't need to be the biggest possible incoming message.
It only needs to be enough to get an 'ack' over any possible network
storage protocol with any possible layering. I suspect that it well
within one page.

Would it be too much waste to reserve one page for every idle socket?

Does this have some fatal flaw?

NeilBrown

2006-08-14 07:39:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, Aug 14, 2006 at 05:17:29PM +1000, Neil Brown ([email protected]) wrote:
> Would it be too much waste to reserve one page for every idle socket?
>
> Does this have some fatal flaw?

Yep, in some cases number of sockets is unlimited, but number of total
memory they can eat is limited already as David mentioned by tcp_?mem[].

> NeilBrown

--
Evgeniy Polyakov

2006-08-14 08:17:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, 2006-08-14 at 00:07 -0700, Andrew Morton wrote:
> On Mon, 14 Aug 2006 08:45:40 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> > > On Mon, 14 Aug 2006 07:03:55 +0200
> > > Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Sun, 2006-08-13 at 21:58 -0700, Andrew Morton wrote:
> > > > > On Mon, 14 Aug 2006 06:40:53 +0200
> > > > > Peter Zijlstra <[email protected]> wrote:
> > > > >
> > > > > > Testcase:
> > > > > >
> > > > > > Mount an NBD device as sole swap device and mmap > physical RAM, then
> > > > > > loop through touching pages only once.
> > > > >
> > > > > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > > > > where people have no local storage, but it's reasonable to expect anyone
> > > > > who is using Linux as an "enterprise storage platform" to stick a local
> > > > > disk on the thing for swap.
> > > >
> > > > I wish you were right, however there seems to be a large demand to go
> > > > diskless and swap over iSCSI because disks seem to be the nr. 1 failing
> > > > piece of hardware in systems these days.
> > >
> > > We could track dirty anonymous memory and throttle.
> > >
> > > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> > > machine is no longer deadlockable with any of these tricks. Do we know
> > > what level that is?
> >
> > Not sure, the theoretical max amount of memory one can 'lose' in socket
> > wait queues is well over the amount of physical memory we have in
> > machines today (even for SGI); this combined with the fact that we limit
> > the memory in some way to avoid DoS attacks, could make for all memory
> > to be stuck in wait queues. Of course this becomes rather more unlikely
> > for ever larger amounts of memory. But unlikely is never a guarantee.
>
> What is a "socket wait queue" and how/why can it consume so much memory?
>
> Can it be prevented from doing that?
>
> If this refers to the socket buffers, they're mostly allocated with
> at least __GFP_WAIT, aren't they?

Wherever it is that packets go if the local end is tied up and cannot
accept them instantly. The simple but prob wrong calculation I made for
evgeniy is: suppose we have 64k sockets, each socket can buffer up to
128 packets, and each packet can be up to 16k (roundup for jumboframes)
large, that makes for 128G of memory. This calculation is wrong on
several points (we can have >64k sockets, and I have no idea on the 128)
but the order of things doesn't get better.

> > > > > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > > > > that, will it not?
> > > >
> > > > Will makes it less likely. One can still have memory pressure, the
> > > > remaining bits of memory can still get stuck in socket queues for
> > > > blocked processes.
> > >
> > > But there's lots of reclaimable pagecache around and kswapd will free it
> > > up?
> >
> > Yes, however it is possible for kswapd and direct reclaim to block on
> > get_request_wait() for the nbd/iscsi request queue by sheer misfortune.
>
> Possibly there are some situations where kswapd will get stuck on request
> queues. But as long as the block layer is correctly calling
> set_queue_congested(), these are easily avoidable via
> bdi_write_congested().

Right, and this might, regardless of what we're going to end up doing,
be a good thing to do.

> > In that case there will be no more reclaim; of course the more active
> > processes we have the unlikelier this will be. Still with the sheer
> > amount of cpu time invested in Linux its not a gamble we're likely to
> > never lose.
>
> I suspect that with mm-tracking-shared-dirty-pages.patch, a bit of tuning
> and perhaps some bugfixing we can make this problem go away for all
> practical purposes. Particularly if we're prepared to require local
> storage for swap (the paranoid can use RAID, no?).
>
> Seem to me that more investigation of these options is needed before we can
> justify adding lots of hard-to-test complexity to networking?

Well, my aim here, as disgusting as you might think it is, is to get
swap over network working. I sympathise with your stance of: don't do
that; but I have been set this task and shall try to get something that
does not offend people.

As for hard to test, I can supply some patches that would make SROG
(still find the name horrid) the default network allocator so one could
more easily test the code paths. As for the dropping of packets, I could
supply a debug control to switch it on/off regardless of memory
pressure.

As for overall complexity, A simple fallback allocator that kicks in
when the normal allocation path fails, and some simple checks to drop
packets allocated in this fashion when not bound for critical sockets
doesn't seem like a lot of complexity to me.


2006-08-14 08:33:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, Aug 14, 2006 at 10:15:52AM +0200, Peter Zijlstra ([email protected]) wrote:
> > If this refers to the socket buffers, they're mostly allocated with
> > at least __GFP_WAIT, aren't they?
>
> Wherever it is that packets go if the local end is tied up and cannot
> accept them instantly. The simple but prob wrong calculation I made for
> evgeniy is: suppose we have 64k sockets, each socket can buffer up to
> 128 packets, and each packet can be up to 16k (roundup for jumboframes)
> large, that makes for 128G of memory. This calculation is wrong on
> several points (we can have >64k sockets, and I have no idea on the 128)
> but the order of things doesn't get better.

TCP memory is limited for all sockets - it is tcp_*mem parameters.
tcp_mem max on my amd64 with 1gb of ram is 768 kb for _all_ sockets.

--
Evgeniy Polyakov

2006-08-14 08:33:43

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

From: Andrew Morton <[email protected]>
Date: Mon, 14 Aug 2006 00:07:36 -0700

> What is a "socket wait queue" and how/why can it consume so much memory?
>
> Can it be prevented from doing that?
>
> If this refers to the socket buffers, they're mostly allocated with
> at least __GFP_WAIT, aren't they?

He's talking about the fact that, once we've tied a receive buffer to
a socket, we can't liberate that memory in any way until the user
reads in the data (which can be whenever it likes) especially if we've
ACK'd the data for protocols such as TCP.

Receive buffers are allocated by the device, usually in interrupt of
software interrupt context, to refill it's RX ring using GFP_ATOMIC or
similar.

Send buffers are usually allocated with GFP_KERNEL, but that can be
modified via the sk->sk_allocation socket member. This is used by
things like sunrpc and other cases which need to allocate socket
buffers with GFP_ATOMIC or with GFP_NOFS for NFS's sake.

2006-08-14 08:36:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Mon, 2006-08-14 at 12:25 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 14, 2006 at 10:15:52AM +0200, Peter Zijlstra ([email protected]) wrote:
> > > If this refers to the socket buffers, they're mostly allocated with
> > > at least __GFP_WAIT, aren't they?
> >
> > Wherever it is that packets go if the local end is tied up and cannot
> > accept them instantly. The simple but prob wrong calculation I made for
> > evgeniy is: suppose we have 64k sockets, each socket can buffer up to
> > 128 packets, and each packet can be up to 16k (roundup for jumboframes)
> > large, that makes for 128G of memory. This calculation is wrong on
> > several points (we can have >64k sockets, and I have no idea on the 128)
> > but the order of things doesn't get better.
>
> TCP memory is limited for all sockets - it is tcp_*mem parameters.
> tcp_mem max on my amd64 with 1gb of ram is 768 kb for _all_ sockets.

Yes, I've said as much a few emails back, but that does not make the
theoretical limit any lower.


2006-08-15 19:20:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Hi!

> Recently, Peter Zijlstra and I have been busily collaborating on a
> solution to the memory deadlock problem described here:
>
> http://lwn.net/Articles/144273/
> "Kernel Summit 2005: Convergence of network and storage paths"
>
> We believe that an approach very much like today's patch set is
> necessary for NBD, iSCSI, AoE or the like ever to work reliably.
> We further believe that a properly working version of at least one of
> these subsystems is critical to the viability of Linux as a modern
> storage platform.
...
> Unfortunately, a particularly nasty form of memory deadlock arises from
> the fact that receive side of the network stack is also a sort of

What about transmit side? I believe you need to reply to ARPs or you
will be unable to communicate over ethernet...
--
Thanks for all the (sleeping) penguins.

2006-08-17 04:03:23

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:
> Peter Zijlstra <[email protected]> wrote:
>>Testcase:
>>
>>Mount an NBD device as sole swap device and mmap > physical RAM, then
>>loop through touching pages only once.
>
> Fix: don't try to swap over the network. Yes, there may be some scenarios
> where people have no local storage, but it's reasonable to expect anyone
> who is using Linux as an "enterprise storage platform" to stick a local
> disk on the thing for swap.
>
> That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> that, will it not?

Hi Andrew,

What happened to the case where we just fill memory full of dirty file
pages backed by a remote disk?

Regards,

Daniel

2006-08-17 04:31:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:
> What is a "socket wait queue" and how/why can it consume so much memory?

Two things:

1) sk_buffs in flight between device receive interrupt and layer 3
protocol/socket identification.

2) sk_buffs queued onto a particular socket waiting for some task to
come along and pull them off via read or equivalent.

Case (1) probably can't consume a unbounded amount of memory, but I
would not swear to that with my current reading knowledge of the network
stack. The upper bound here is obscured by clever SMP device processing,
netfilter options, softirq scheduling questions, probably other things.
This needs a considered explanation from a network guru, or perhaps a
pointer to documentation.

Case (2) is the elephant under the rug. Some form of TCP memory
throttling exists, but there is no organized way to correlate that with
actual memory conditions, and it appears to be exposed to user control.
Memory throttling seems to be entirely absent for non-TCP protocols,
e.g., UDP.

> Can it be prevented from doing that?

This patch set does that, and also provides an emergency reserve for
network devices in order to prevent atomic allocation failures while
trying to refill NIC DMA buffer rings. It is the moral equivalent of
our bio reservation scheme, but with additional twists specific to the
network stack.

Regards,

Daniel

2006-08-17 04:48:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips ([email protected]) wrote:
>>Indeed. The rest of the corner cases like netfilter, layered protocol and
>>so on need to be handled, however they do not need to be handled right now
>>in order to make remote storage on a lan work properly. The sane thing for
>>the immediate future is to flag each socket as safe for remote block IO or
>>not, then gradually widen the scope of what is safe. We need to set up an
>>opt in strategy for network block IO that views such network subsystems as
>>ipfilter as not safe by default, until somebody puts in the work to make
>>them safe.
>
> Just for clarification - it will be completely impossible to login using
> openssh or some other priveledge separation protocol to the machine due
> to the nature of unix sockets. So you will be unable to manage your
> storage system just because it is in OOM - it is not what is expected
> from reliable system.

The system is not OOM, it is in reclaim, a transient condition that will be
resolved in normal course by IO progress. However you raise an excellent
point: if there is any remote management that we absolutely require to be
available while remote IO is interrupted - manual failover for example -
then we must supply a means of carrying out such remote administration, that
is guaranteed not to deadlock on a normal mode memory request. This ends up
as a new network stack feature I think, and probably a theoretical one for
the time being since we don't actually know of any such mandatory login
that must be carried out while remote disk IO is suspended.

>>But really, if you expect to run reliable block IO to Zanzibar over an ssh
>>tunnel through a firewall, then you might also consider taking up bungie
>>jumping with the cord tied to your neck.
>
> Just pure openssh for control connection (admin should be able to
> login).

And the admin will be able to, but in the cluster stack itself we don't
bless such stupidity as emailing an admin to ask for a login in order to
break a tie over which node should take charge of DLM recovery.

Regards,

Da niel

2006-08-17 04:51:08

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Mon, Aug 14, 2006 at 08:45:43AM +0200, Peter Zijlstra ([email protected]) wrote:
>>>Just pure openssh for control connection (admin should be able to
>>>login).
>>
>>These periods of degenerated functionality should be short and
>>infrequent albeit critical for machine recovery. Would you rather have a
>>slower ssh login (the machine will recover) or drive/fly to Zanzibar to
>>physically reboot the machine?
>
> It will not work, since you can not mark openssh sockets as those which
> are able to get memory from reserved pool. So admin unable to check the
> system status and make anything to turn system's life on.

This is incorrect, please see my previous email.

Regards,

Daniel

2006-08-17 05:36:51

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Wed, Aug 16, 2006 at 09:48:37PM -0700, Daniel Phillips ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips
> >([email protected]) wrote:
> >>Indeed. The rest of the corner cases like netfilter, layered protocol and
> >>so on need to be handled, however they do not need to be handled right now
> >>in order to make remote storage on a lan work properly. The sane thing
> >>for
> >>the immediate future is to flag each socket as safe for remote block IO or
> >>not, then gradually widen the scope of what is safe. We need to set up an
> >>opt in strategy for network block IO that views such network subsystems as
> >>ipfilter as not safe by default, until somebody puts in the work to make
> >>them safe.
> >
> >Just for clarification - it will be completely impossible to login using
> >openssh or some other priveledge separation protocol to the machine due
> >to the nature of unix sockets. So you will be unable to manage your
> >storage system just because it is in OOM - it is not what is expected
> >from reliable system.
>
> The system is not OOM, it is in reclaim, a transient condition that will be
> resolved in normal course by IO progress. However you raise an excellent
> point: if there is any remote management that we absolutely require to be
> available while remote IO is interrupted - manual failover for example -
> then we must supply a means of carrying out such remote administration, that
> is guaranteed not to deadlock on a normal mode memory request. This ends up
> as a new network stack feature I think, and probably a theoretical one for
> the time being since we don't actually know of any such mandatory login
> that must be carried out while remote disk IO is suspended.

That is why you are not allowed to depend on main system's allocator
problems. That is why network can have it's own allocator.

> >>But really, if you expect to run reliable block IO to Zanzibar over an ssh
> >>tunnel through a firewall, then you might also consider taking up bungie
> >>jumping with the cord tied to your neck.
> >
> >Just pure openssh for control connection (admin should be able to
> >login).
>
> And the admin will be able to, but in the cluster stack itself we don't
> bless such stupidity as emailing an admin to ask for a login in order to
> break a tie over which node should take charge of DLM recovery.

No, you can't since openssh and any other priveledge separation
mechanisms use adtional sockets to transfer data between it's parts,
unix sockets require page sized allocation frequently which will endup
with 8k allocation in slab.
You will not be able to login using openssh.

> Regards,
>
> Da niel

--
Evgeniy Polyakov

2006-08-17 06:04:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Wed, 16 Aug 2006 20:58:28 -0700
Daniel Phillips <[email protected]> wrote:

> Andrew Morton wrote:
> > Peter Zijlstra <[email protected]> wrote:
> >>Testcase:
> >>
> >>Mount an NBD device as sole swap device and mmap > physical RAM, then
> >>loop through touching pages only once.
> >
> > Fix: don't try to swap over the network. Yes, there may be some scenarios
> > where people have no local storage, but it's reasonable to expect anyone
> > who is using Linux as an "enterprise storage platform" to stick a local
> > disk on the thing for swap.
> >
> > That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
> > that, will it not?
>
> Hi Andrew,
>
> What happened to the case where we just fill memory full of dirty file
> pages backed by a remote disk?
>

Processes which are dirtying those pages throttle at
/proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill"
memory with dirty pages. If the amount of physical memory which is dirty
exceeds 40%: bug.

2006-08-17 18:02:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
>>>Just for clarification - it will be completely impossible to login using
>>>openssh or some other priveledge separation protocol to the machine due
>>>to the nature of unix sockets. So you will be unable to manage your
>>>storage system just because it is in OOM - it is not what is expected
>>>from reliable system.
>>
>>The system is not OOM, it is in reclaim, a transient condition that will be
>>resolved in normal course by IO progress. However you raise an excellent
>>point: if there is any remote management that we absolutely require to be
>>available while remote IO is interrupted - manual failover for example -
>>then we must supply a means of carrying out such remote administration, that
>>is guaranteed not to deadlock on a normal mode memory request. This ends up
>>as a new network stack feature I think, and probably a theoretical one for
>>the time being since we don't actually know of any such mandatory login
>>that must be carried out while remote disk IO is suspended.
>
> That is why you are not allowed to depend on main system's allocator
> problems. That is why network can have it's own allocator.

Please check your assumptions. Do you understand how PF_MEMALLOC works,
and where it gets its reserve memory from?

>>>>But really, if you expect to run reliable block IO to Zanzibar over an ssh
>>>>tunnel through a firewall, then you might also consider taking up bungie
>>>>jumping with the cord tied to your neck.
>>>Just pure openssh for control connection (admin should be able to
>>>login).
>>
>>And the admin will be able to, but in the cluster stack itself we don't
>>bless such stupidity as emailing an admin to ask for a login in order to
>>break a tie over which node should take charge of DLM recovery.
>
> No, you can't since openssh and any other priveledge separation
> mechanisms use adtional sockets to transfer data between it's parts,
> unix sockets require page sized allocation frequently which will endup
> with 8k allocation in slab.
> You will not be able to login using openssh.

*** The system is not OOM, it is in reclaim, a transient condition ***

Which Peter already told you! Please, let's get our facts straight,
then we can look intelligently at whether your work is appropriate to
solve the block IO network starvation problem that Peter and I are
working on.

Regards,

Daniel

2006-08-17 18:56:10

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Thu, Aug 17, 2006 at 11:01:52AM -0700, Daniel Phillips ([email protected]) wrote:
> *** The system is not OOM, it is in reclaim, a transient condition ***

It does not matter how condition when not every user can get memory is
called. And actually no one can know in advance how long it will be.

> Which Peter already told you! Please, let's get our facts straight,
> then we can look intelligently at whether your work is appropriate to
> solve the block IO network starvation problem that Peter and I are
> working on.

I got openssh as example of situation when system does not know in
advance, what sockets must be marked as critical.
OpenSSH works with network and unix sockets in parallel, so you need to
hack openssh code to be able to allow it to use reserve when there is
not enough memory. Actually all sockets must be able to get data, since
kernel can not diffirentiate between telnetd and a process which will
receive an ack for swapped page or other significant information.
So network must behave separately from main allocator in that period of
time, but since it is possible that reserve can be not filled or has not
enough space or something other, it must be preallocated in far advance
and should be quite big, but then why netwrok should use it at all, when
being separated from main allocations solves the problem?

I do not argue that your approach is bad or does not solve the problem,
I'm just trying to show that further evolution of that idea eventually
ends up in separated allocator (as long as all most robust systems
separate operations), which can improve things in a lot of other sides
too.


> Regards,
>
> Daniel

--
Evgeniy Polyakov

2006-08-17 19:17:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Thu, 2006-08-17 at 22:42 +0400, Evgeniy Polyakov wrote:
> On Thu, Aug 17, 2006 at 11:01:52AM -0700, Daniel Phillips ([email protected]) wrote:
> > *** The system is not OOM, it is in reclaim, a transient condition ***
>
> It does not matter how condition when not every user can get memory is
> called. And actually no one can know in advance how long it will be.

Well, we have a direct influence here, we're working on code-paths that
are called from reclaim. If we stall, the box is dead.

> > Which Peter already told you! Please, let's get our facts straight,
> > then we can look intelligently at whether your work is appropriate to
> > solve the block IO network starvation problem that Peter and I are
> > working on.
>
> I got openssh as example of situation when system does not know in
> advance, what sockets must be marked as critical.
> OpenSSH works with network and unix sockets in parallel, so you need to
> hack openssh code to be able to allow it to use reserve when there is
> not enough memory.

OpenSSH or any other user-space program will never ever have the problem
we are trying to solve. Nor could it be fixed the way we are solving it,
user-space programs can be stalled indefinite. We are concerned with
kernel services, and the continued well being of the kernel, not
user-space. (well therefore indirectly also user-space of course)

> Actually all sockets must be able to get data, since
> kernel can not diffirentiate between telnetd and a process which will
> receive an ack for swapped page or other significant information.

Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
and AoE are all kernel services, not user-space. And it is the core of
our work to provide this information to the kernel; to distinguish these
few critical sockets.

> So network must behave separately from main allocator in that period of
> time, but since it is possible that reserve can be not filled or has not
> enough space or something other, it must be preallocated in far advance
> and should be quite big, but then why netwrok should use it at all, when
> being separated from main allocations solves the problem?

You still need to guarantee data-paths to these services, and you need
to make absolutely sure that your last bit of memory is used to service
these critical sockets, not some random blocked user-space process.

You cannot pre-allocate enough memory _ever_ to satisfy the total
capacity of the network stack. You _can_ allot a certain amount of
memory to the network stack (avoiding DoS), and drop packets once you
exceed that. But still, you need to make sure these few critical
_kernel_ services get their data.

> I do not argue that your approach is bad or does not solve the problem,
> I'm just trying to show that further evolution of that idea eventually
> ends up in separated allocator (as long as all most robust systems
> separate operations), which can improve things in a lot of other sides
> too.

Not a separate allocator per-se, separate socket group, they are
serviced by the kernel, they will never refuse to process data, and it
is critical for the continued well-being of your kernel that they get
their data.

Also, I do not think people would like it if say 100M of their 1G system
just disappears, never to used again for eg. page-cache in periods of
low network traffic.



2006-08-17 19:53:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Thu, Aug 17, 2006 at 09:15:14PM +0200, Peter Zijlstra ([email protected]) wrote:
> > I got openssh as example of situation when system does not know in
> > advance, what sockets must be marked as critical.
> > OpenSSH works with network and unix sockets in parallel, so you need to
> > hack openssh code to be able to allow it to use reserve when there is
> > not enough memory.
>
> OpenSSH or any other user-space program will never ever have the problem
> we are trying to solve. Nor could it be fixed the way we are solving it,
> user-space programs can be stalled indefinite. We are concerned with
> kernel services, and the continued well being of the kernel, not
> user-space. (well therefore indirectly also user-space of course)

You limit your system here - it is possible that userspace should send
some command when kernel agent requires some negotiation.
And even for them it is possible to require ARP request and/or ICMP
processing.

> > Actually all sockets must be able to get data, since
> > kernel can not diffirentiate between telnetd and a process which will
> > receive an ack for swapped page or other significant information.
>
> Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
> and AoE are all kernel services, not user-space. And it is the core of
> our work to provide this information to the kernel; to distinguish these
> few critical sockets.

As I stated above it is not enough.
And even if you will cover all kernel-only network allocations, which are
used for your selected datapath, problem still there - admin is unable
to connect although it can be critical connection too.

> > So network must behave separately from main allocator in that period of
> > time, but since it is possible that reserve can be not filled or has not
> > enough space or something other, it must be preallocated in far advance
> > and should be quite big, but then why netwrok should use it at all, when
> > being separated from main allocations solves the problem?
>
> You still need to guarantee data-paths to these services, and you need
> to make absolutely sure that your last bit of memory is used to service
> these critical sockets, not some random blocked user-space process.
>
> You cannot pre-allocate enough memory _ever_ to satisfy the total
> capacity of the network stack. You _can_ allot a certain amount of
> memory to the network stack (avoiding DoS), and drop packets once you
> exceed that. But still, you need to make sure these few critical
> _kernel_ services get their data.

Feel free to implement any receiving policy inside _separated_ allocator
to meet your needs, but if allocator depends on main system's memory
conditions it is always possible that it will fail to make forward
progress.

> > I do not argue that your approach is bad or does not solve the problem,
> > I'm just trying to show that further evolution of that idea eventually
> > ends up in separated allocator (as long as all most robust systems
> > separate operations), which can improve things in a lot of other sides
> > too.
>
> Not a separate allocator per-se, separate socket group, they are
> serviced by the kernel, they will never refuse to process data, and it
> is critical for the continued well-being of your kernel that they get
> their data.

You do not know in advance which sockets must be separated (since only
in the simplest situation it is the same as in NBD and is
kernelspace-only),
you can not solve problem with ARP/ICMP/route changes and other control
messages, netfilter, IPsec and compression which still can happen in your
setup,
if something goes wrong and receiving will require additional
allocation from network datapath, system is dead,
this strict conditions does not allow flexible control over possible
connections and does not allow to create additional connections.

As far as I understood, it is ok to solve the problem in the exact your
case without all or most of the above issues in the setup, but in
general some other mechanism should be used, which will not suffer or
will allow to control above issues.

> Also, I do not think people would like it if say 100M of their 1G system
> just disappears, never to used again for eg. page-cache in periods of
> low network traffic.

Just for clarification: network tree allocator gets 512kb and then
increases cache size when it is required. Default value can be changed
of course.

--
Evgeniy Polyakov

2006-08-17 23:26:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

Evgeniy Polyakov wrote:
> On Thu, Aug 17, 2006 at 09:15:14PM +0200, Peter Zijlstra ([email protected]) wrote:
>
>>>I got openssh as example of situation when system does not know in
>>>advance, what sockets must be marked as critical.
>>>OpenSSH works with network and unix sockets in parallel, so you need to
>>>hack openssh code to be able to allow it to use reserve when there is
>>>not enough memory.
>>
>>OpenSSH or any other user-space program will never ever have the problem
>>we are trying to solve. Nor could it be fixed the way we are solving it,
>>user-space programs can be stalled indefinite. We are concerned with
>>kernel services, and the continued well being of the kernel, not
>>user-space. (well therefore indirectly also user-space of course)
>
>
> You limit your system here - it is possible that userspace should send
> some command when kernel agent requires some negotiation.
> And even for them it is possible to require ARP request and/or ICMP
> processing.
>
>
>>> Actually all sockets must be able to get data, since
>>>kernel can not diffirentiate between telnetd and a process which will
>>>receive an ack for swapped page or other significant information.
>>
>>Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
>>and AoE are all kernel services, not user-space. And it is the core of
>>our work to provide this information to the kernel; to distinguish these
>>few critical sockets.
>
>
> As I stated above it is not enough.
> And even if you will cover all kernel-only network allocations, which are
> used for your selected datapath, problem still there - admin is unable
> to connect although it can be critical connection too.
>
>
>>>So network must behave separately from main allocator in that period of
>>>time, but since it is possible that reserve can be not filled or has not
>>>enough space or something other, it must be preallocated in far advance
>>>and should be quite big, but then why netwrok should use it at all, when
>>>being separated from main allocations solves the problem?
>>
>>You still need to guarantee data-paths to these services, and you need
>>to make absolutely sure that your last bit of memory is used to service
>>these critical sockets, not some random blocked user-space process.
>>
>>You cannot pre-allocate enough memory _ever_ to satisfy the total
>>capacity of the network stack. You _can_ allot a certain amount of
>>memory to the network stack (avoiding DoS), and drop packets once you
>>exceed that. But still, you need to make sure these few critical
>>_kernel_ services get their data.
>
> Feel free to implement any receiving policy inside _separated_ allocator
> to meet your needs, but if allocator depends on main system's memory
> conditions it is always possible that it will fail to make forward
> progress.

Wrong. Our main allocator has a special reserve that can be accessed
only by a task that has its PF_MEMALLOC flag set. This reserve exists
in order to guarantee forward progress in just such situations as the
network runs into when it is trying to receive responses from a remote
disk. Anything otherwise is a bug.

>>>I do not argue that your approach is bad or does not solve the problem,
>>>I'm just trying to show that further evolution of that idea eventually
>>>ends up in separated allocator (as long as all most robust systems
>>>separate operations), which can improve things in a lot of other sides
>>>too.
>>
>>Not a separate allocator per-se, separate socket group, they are
>>serviced by the kernel, they will never refuse to process data, and it
>>is critical for the continued well-being of your kernel that they get
>>their data.

The memalloc reserve is indeed a separate reserve, however it is a
reserve that already exists, and you are busy creating another separate
reserve to further partition memory. Partitioning memory is not the
direction we should be going, we should be trying to unify our reserves
wherever possible, and find ways such as Andrew and others propose to
implement "reserve on demand", or in other words, take advantage of
"easily freeable" pages.

If your allocation code is so much more efficient than slab then why
don't you fix slab instead of replicating functionality that already
exists elsewhere in the system, and has since day one?

> You do not know in advance which sockets must be separated (since only
> in the simplest situation it is the same as in NBD and is
> kernelspace-only),

Yes we do, they are exactly those sockets that lie in the block IO path.
The VM cannot deadlock on any others.

> you can not solve problem with ARP/ICMP/route changes and other control
> messages, netfilter, IPsec and compression which still can happen in your
> setup,

If you bothered to read the patches you would know that ICMP is indeed
handled. ARP I don't think so, we may need to do that since ARP can
believably be required for remote disk interface failover. Anybody
who designs ssh into remote disk failover is an idiot. Ssh for
configuration, monitoring and administration is fine. Ssh for fencing
or whatever is just plain stupid, unless the nodes running both server
and client are not allowed to mount the remote disk.

> if something goes wrong and receiving will require additional
> allocation from network datapath, system is dead,
> this strict conditions does not allow flexible control over possible
> connections and does not allow to create additional connections.

You know that how?

>>Also, I do not think people would like it if say 100M of their 1G system
>>just disappears, never to used again for eg. page-cache in periods of
>>low network traffic.
>
> Just for clarification: network tree allocator gets 512kb and then
> increases cache size when it is required. Default value can be changed
> of course.

Great. Now why does the network layer need its own, invented-in-netland
allocator? Why can't everybody use your allocator if it is better?

Also, please don't get the idea that your allocator by itself solves the
block IO receive starvation problem. At the very least you need to do
something about network traffic that is unrelated to forward progress of
memory writeout, yet can starve the memory writeout. Oh wait, our patch
set already does that.

Regards,

Daniel

2006-08-17 23:56:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:
> Daniel Phillips <[email protected]> wrote:
>>What happened to the case where we just fill memory full of dirty file
>>pages backed by a remote disk?
>
> Processes which are dirtying those pages throttle at
> /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill"
> memory with dirty pages. If the amount of physical memory which is dirty
> exceeds 40%: bug.

Hi Andrew,

So we make 400 MB of a 1 GB system unavailable for write caching just to
get around the network receive starvation issue?

What happens if some in kernel user grabs 68% of kernel memory to do some
very important thing, does this starvation avoidance scheme still work?

Regards,

Daniel

2006-08-18 00:27:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Daniel Phillips wrote:
> Andrew Morton wrote:
>> Daniel Phillips <[email protected]> wrote:
>>> What happened to the case where we just fill memory full of dirty file
>>> pages backed by a remote disk?
>>
>> Processes which are dirtying those pages throttle at
>> /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to
>> "fill"
>> memory with dirty pages. If the amount of physical memory which is dirty
>> exceeds 40%: bug.

> So we make 400 MB of a 1 GB system unavailable for write caching just to
> get around the network receive starvation issue?
>
> What happens if some in kernel user grabs 68% of kernel memory to do some
> very important thing, does this starvation avoidance scheme still work?

Also think about eg. scientific calculations, or anonymous memory.

People want to be able to use a larger percentage of their memory
for dirty data, without swapping...

--
What is important? What you want to be true, or what is true?

2006-08-18 00:39:09

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Daniel Phillips wrote:
> Andrew Morton wrote:
>> Processes which are dirtying those pages throttle at
>> /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill"
>> memory with dirty pages. If the amount of physical memory which is dirty
>> exceeds 40%: bug.
>
> So we make 400 MB of a 1 GB system unavailable for write caching just to
> get around the network receive starvation issue?

Excuse me, 600 MB of a 1 GB system :-/

Regards,

Daniel

2006-08-18 01:14:48

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Thursday August 17, [email protected] wrote:
> Andrew Morton wrote:
> > Daniel Phillips <[email protected]> wrote:
> >>What happened to the case where we just fill memory full of dirty file
> >>pages backed by a remote disk?
> >
> > Processes which are dirtying those pages throttle at
> > /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill"
> > memory with dirty pages. If the amount of physical memory which is dirty
> > exceeds 40%: bug.
>
> Hi Andrew,
>
> So we make 400 MB of a 1 GB system unavailable for write caching just to
> get around the network receive starvation issue?

No. We make it unavailable for write caching so it is available for
other important things like cached clean pages and executables etc.
You have to start throttling some where or you get very bad
behaviour. 40% seems a good number, but it is tunable.
The fact that it helps with receive starvation is just s bonus.

>
> What happens if some in kernel user grabs 68% of kernel memory to do some
> very important thing, does this starvation avoidance scheme still work?

That is a very good question. Is memlocked memory throttled against
dirty pages, and does it decrease the space available to the 40%
calculation? I don't know. I guess I could look at the code...

get_dirty_limits in page_writeback caps 'dirty_ratio' (40 by default,
hence the 40%) at unmapped_ratio/2.
So yes. If 68% is mapped and locked (I assume that it the situation
you are referring to) that leaves 32% unlocked, so the 40% above is
reduced to 16% and you should still have 160Meg of breathing space.

NeilBrown

2006-08-18 06:12:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Thu, 17 Aug 2006 16:53:01 -0700
Daniel Phillips <[email protected]> wrote:

> Andrew Morton wrote:
> > Daniel Phillips <[email protected]> wrote:
> >>What happened to the case where we just fill memory full of dirty file
> >>pages backed by a remote disk?
> >
> > Processes which are dirtying those pages throttle at
> > /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill"
> > memory with dirty pages. If the amount of physical memory which is dirty
> > exceeds 40%: bug.
>
> Hi Andrew,
>
> So we make 400 MB of a 1 GB system

by default - it's runtime configurable.

> unavailable for write caching just to
> get around the network receive starvation issue?

No, it's mainly to avoid latency: to prevent tasks which want to allocate
pages from getting stuck behind writeback.

> What happens if some in kernel user grabs 68% of kernel memory to do some
> very important thing, does this starvation avoidance scheme still work?

Well something has to give way. The process might get swapped out a bit,
or it might stall in the page allocator because of all the dirty memory.

2006-08-18 07:19:26

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

On Thu, Aug 17, 2006 at 04:24:26PM -0700, Daniel Phillips ([email protected]) wrote:
> >Feel free to implement any receiving policy inside _separated_ allocator
> >to meet your needs, but if allocator depends on main system's memory
> >conditions it is always possible that it will fail to make forward
> >progress.
>
> Wrong. Our main allocator has a special reserve that can be accessed
> only by a task that has its PF_MEMALLOC flag set. This reserve exists
> in order to guarantee forward progress in just such situations as the
> network runs into when it is trying to receive responses from a remote
> disk. Anything otherwise is a bug.

Ok, I see your point.
You create special fix for special config situation.
In general it does not work.

> >>>I do not argue that your approach is bad or does not solve the problem,
> >>>I'm just trying to show that further evolution of that idea eventually
> >>>ends up in separated allocator (as long as all most robust systems
> >>>separate operations), which can improve things in a lot of other sides
> >>>too.
> >>
> >>Not a separate allocator per-se, separate socket group, they are
> >>serviced by the kernel, they will never refuse to process data, and it
> >>is critical for the continued well-being of your kernel that they get
> >>their data.
>
> The memalloc reserve is indeed a separate reserve, however it is a
> reserve that already exists, and you are busy creating another separate
> reserve to further partition memory. Partitioning memory is not the
> direction we should be going, we should be trying to unify our reserves
> wherever possible, and find ways such as Andrew and others propose to
> implement "reserve on demand", or in other words, take advantage of
> "easily freeable" pages.

Such approach does not fix the problem.
Why no one complain that there is priveledge separation while "we can
fix all existing application"?
It is possible that there will not be any "easily freeable" pages, and
your special reserve will not be filled.

> If your allocation code is so much more efficient than slab then why
> don't you fix slab instead of replicating functionality that already
> exists elsewhere in the system, and has since day one?

No one need an exuse to rewrite something.

> >You do not know in advance which sockets must be separated (since only
> >in the simplest situation it is the same as in NBD and is
> >kernelspace-only),
>
> Yes we do, they are exactly those sockets that lie in the block IO path.
> The VM cannot deadlock on any others.

There are some pieces of the world behind NBD and iSCSI.

> >you can not solve problem with ARP/ICMP/route changes and other control
> >messages, netfilter, IPsec and compression which still can happen in your
> >setup,
>
> If you bothered to read the patches you would know that ICMP is indeed
> handled. ARP I don't think so, we may need to do that since ARP can
> believably be required for remote disk interface failover. Anybody
> who designs ssh into remote disk failover is an idiot. Ssh for
> configuration, monitoring and administration is fine. Ssh for fencing
> or whatever is just plain stupid, unless the nodes running both server
> and client are not allowed to mount the remote disk.

I only remember that socket with sk_memalloc is being handled, no ICMP,
no ARP. What about other control messages in bonding setup of failover?

> >if something goes wrong and receiving will require additional
> >allocation from network datapath, system is dead,
> >this strict conditions does not allow flexible control over possible
> >connections and does not allow to create additional connections.
>
> You know that how?
>
> >>Also, I do not think people would like it if say 100M of their 1G system
> >>just disappears, never to used again for eg. page-cache in periods of
> >>low network traffic.
> >
> >Just for clarification: network tree allocator gets 512kb and then
> >increases cache size when it is required. Default value can be changed
> >of course.
>
> Great. Now why does the network layer need its own, invented-in-netland
> allocator? Why can't everybody use your allocator if it is better?

As far as I recall, I several times already said, that there is no
problem to use that allocator in any other places, MMU-less systems will
especially greatly benefit from it (since it was designed for them too).

> Also, please don't get the idea that your allocator by itself solves the
> block IO receive starvation problem. At the very least you need to do
> something about network traffic that is unrelated to forward progress of
> memory writeout, yet can starve the memory writeout. Oh wait, our patch
> set already does that.

It is not my allocator that solves the problem, but a situation when
pools are separated!
And you slowly go that direction too (global reserve instead of per-socket
is the first step).

Problem is not solved, when critical allocations depend on reserve which
depends on main system conditions.

> Regards,
>
> Daniel

--
Evgeniy Polyakov

2006-08-18 21:25:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:
>Daniel Phillips wrote:
>>Andrew Morton wrote:
>
> ...it's runtime configurable.

So we default to "less than the best" because we are too lazy to fix the
network starvation issue properly? Maybe we don't really need a mempool for
struct bio either, isn't that one rather like the reserve pool we propose
for sk_buffs? (And for that matter, isn't a bio rather more like an sk_buff
than one would think from the code we have written?)

>>unavailable for write caching just to get around the network receive
>> starvation issue?
>
> No, it's mainly to avoid latency: to prevent tasks which want to allocate
> pages from getting stuck behind writeback.

This seems like a great help for some common loads, but a regression for
other not-so-rare loads. Regardless of whether or not marking 60% of memory
as unusable for write caching is a great idea, it is not a compelling reason
to fall for the hammer/nail trap.

>>What happens if some in kernel user grabs 68% of kernel memory to do some
>>very important thing, does this starvation avoidance scheme still work?
>
> Well something has to give way. The process might get swapped out a bit,
> or it might stall in the page allocator because of all the dirty memory.

It is that "something has to give way" that makes me doubt the efficacy of
our global write throttling scheme as a solution for network receive memory
starvation. As it is, the thing that gives way may well be the network
stack. Avoiding this hazard via a little bit of simple code is the whole
point of our patch set.

Yes, we did waltz off into some not-so-simple code at one point, but after
discussing it, Peter and I agree that the simple approach actually works
well enough, so we just need to trim down the patch set accordingly and
thus prove that our approach really is nice, light and tight.

Specifically, we do not actually need any fancy sk_buff sub-allocator
(either ours or the one proposed by davem). We already got rid of the per
device reserve accounting in order to finesse away the sk_buff->dev oddity
and shorten the patch. So the next rev will be rather lighter than the
original.

Regards,

Daniel

2006-08-18 22:41:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Fri, 18 Aug 2006 14:22:07 -0700
Daniel Phillips <[email protected]> wrote:

> Andrew Morton wrote:
> >Daniel Phillips wrote:
> >>Andrew Morton wrote:
> >
> > ...it's runtime configurable.
>
> So we default to "less than the best" because we are too lazy to fix the
> network starvation issue properly? Maybe we don't really need a mempool for
> struct bio either, isn't that one rather like the reserve pool we propose
> for sk_buffs? (And for that matter, isn't a bio rather more like an sk_buff
> than one would think from the code we have written?)
>
> >>unavailable for write caching just to get around the network receive
> >> starvation issue?
> >
> > No, it's mainly to avoid latency: to prevent tasks which want to allocate
> > pages from getting stuck behind writeback.
>
> This seems like a great help for some common loads, but a regression for
> other not-so-rare loads. Regardless of whether or not marking 60% of memory
> as unusable for write caching is a great idea, it is not a compelling reason
> to fall for the hammer/nail trap.
>
> >>What happens if some in kernel user grabs 68% of kernel memory to do some
> >>very important thing, does this starvation avoidance scheme still work?
> >
> > Well something has to give way. The process might get swapped out a bit,
> > or it might stall in the page allocator because of all the dirty memory.
>
> It is that "something has to give way" that makes me doubt the efficacy of
> our global write throttling scheme as a solution for network receive memory
> starvation. As it is, the thing that gives way may well be the network
> stack. Avoiding this hazard via a little bit of simple code is the whole
> point of our patch set.
>
> Yes, we did waltz off into some not-so-simple code at one point, but after
> discussing it, Peter and I agree that the simple approach actually works
> well enough, so we just need to trim down the patch set accordingly and
> thus prove that our approach really is nice, light and tight.
>
> Specifically, we do not actually need any fancy sk_buff sub-allocator
> (either ours or the one proposed by davem). We already got rid of the per
> device reserve accounting in order to finesse away the sk_buff->dev oddity
> and shorten the patch. So the next rev will be rather lighter than the
> original.
>

Sigh. Daniel, in my earlier emails I asked a number of questions regarding
whether existing facilities, queued patches or further slight kernel
changes could provide a sufficient solution to these problems. The answer
may well be "no". But diligence requires that we be able to prove that,
and handwaving doesn't provide that proof, does it?

2006-08-18 23:47:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:
> ...in my earlier emails I asked a number of questions regarding
> whether existing facilities, queued patches or further slight kernel
> changes could provide a sufficient solution to these problems. The answer
> may well be "no". But diligence requires that we be able to prove that,
> and handwaving doesn't provide that proof, does it?

Hi Andrew,

I missed those questions about queued patches. So far the closest we
have seen to anything relevant is Evgeniy's proposed network sub-allocater,
which just doesn't address the problem, as Peter and I have explained in
some detail, and the dirty page throttling machinery which is obviously
flawed in that it leaves a lot of room for corner cases that have to be
handled with more special purpose logic. For example, what about kernel
users that just go ahead and use lots of slab without getting involved
in the dirty/mapped page accounting? I don't know, maybe you will handle
that too, but then there will be another case that isn't handled, and so
on. The dirty page throttling approach is just plain fragile by nature.

We already know we need to do reserve accounting for struct bio, so what
is the conceptual difficulty in extending that reasoning to struct
sk_buff, which is very nearly the same kind of beastie, performing very
nearly the same kind of function, and suffering very nearly the same kind
of problems we had in the block layer before mingo's mempool machinery
arrived?

Did I miss some other queued patch that bears on this? And I am not sure
what handwaving you are referring to.

Note that we have not yet submitted this patch to the queue, just put
it up for comment. This patch set is actually getting more elegant as
various individual artists get their flames in. At some point it will
acquire a before-and-after test case as well, then we can realistically
compare it to the best of the alternate proposals.

Regards,

Daniel

2006-08-19 02:51:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Fri, 18 Aug 2006 16:44:01 -0700
Daniel Phillips <[email protected]> wrote:

> handwaving

- The mmap(MAP_SHARED)-the-whole-world scenario should be fixed by
mm-tracking-shared-dirty-pages.patch. Please test it and if you are
still able to demonstrate deadlocks, describe how, and why they
are occurring.

- We expect that the lots-of-dirty-anon-memory-over-swap-over-network
scenario might still cause deadlocks.

I assert that this can be solved by putting swap on local disks. Peter
asserts that this isn't acceptable due to disk unreliability. I point
out that local disk reliability can be increased via MD, all goes quiet.

A good exposition which helps us to understand whether and why a
significant proportion of the target user base still wishes to do
swap-over-network would be useful.

- Assuming that exposition is convincing, I would ask that you determine
at what level of /proc/sys/vm/min_free_kbytes the swap-over-network
deadlock is no longer demonstrable.

If there are any remaining demonstrable deadlock scenarios, please describe
how they were created and, if possible, perform some analysis of them for
us. sysrq-T and sysrq-M traces would be helpful.

Once we have this information in hand, we will be in a better position to
work out how to solve these problems.

Thanks.

2006-08-19 04:17:48

by Daniel Phillips

[permalink] [raw]
Subject: Re: Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core)

Andrew Morton wrote:
>>handwaving
>
> - The mmap(MAP_SHARED)-the-whole-world scenario should be fixed by
> mm-tracking-shared-dirty-pages.patch. Please test it and if you are
> still able to demonstrate deadlocks, describe how, and why they
> are occurring.

OK, but please see "atomic 0 order allocation failure" below.

> - We expect that the lots-of-dirty-anon-memory-over-swap-over-network
> scenario might still cause deadlocks.
>
> I assert that this can be solved by putting swap on local disks. Peter
> asserts that this isn't acceptable due to disk unreliability. I point
> out that local disk reliability can be increased via MD, all goes quiet.
>
> A good exposition which helps us to understand whether and why a
> significant proportion of the target user base still wishes to do
> swap-over-network would be useful.

I don't care much about swap-over-network, Peter does. I care about
reliable remote disks in general, and reliable nbd in particular.

> - Assuming that exposition is convincing, I would ask that you determine
> at what level of /proc/sys/vm/min_free_kbytes the swap-over-network
> deadlock is no longer demonstrable.

Earlier, I wrote: "we need to actually fix the out of memory
deadlock/performance bug right now so that remote storage works
properly." It is actually far more likely that memory reclaim stuffups
will cause TCP to drop block IO packets here and there than that we
will hit some endless retransmit deadlock. But dropping packets and
thus causing TCP stalls during block writeout is a very bad thing too,
I do not think you could call a system that exhibits such behavior a
particularly reliable one.

So rather than just the word deadlock, let us add "or atomic 0 order
alloc failure during TCP receive" to the challenge. Fair?

On the client send side, Peter already showed an easily reproducable
deadlock which we avoid by using elevator-noop. The bug is still
there, it is just under a different rock now.

Regards,

Daniel

2006-08-19 07:35:41

by Andrew Morton

[permalink] [raw]
Subject: Re: Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core)

On Fri, 18 Aug 2006 21:14:09 -0700
Daniel Phillips <[email protected]> wrote:

> So rather than just the word deadlock, let us add "or atomic 0 order
> alloc failure during TCP receive" to the challenge. Fair?

If it's significantly performance-affecting in any way which is at all likely to
affect anyone, sure.

You can get those warnings now with regular networking using e1000, due to
a combination of excessive default rx ringsize and incorrect VM tuning.

2006-08-19 15:08:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Andrew Morton wrote:

> - We expect that the lots-of-dirty-anon-memory-over-swap-over-network
> scenario might still cause deadlocks.
>
> I assert that this can be solved by putting swap on local disks. Peter
> asserts that this isn't acceptable due to disk unreliability. I point
> out that local disk reliability can be increased via MD, all goes quiet.
>
> A good exposition which helps us to understand whether and why a
> significant proportion of the target user base still wishes to do
> swap-over-network would be useful.

You cannot put disks in many models of blade servers.

At all.

--
What is important? What you want to be true, or what is true?

2006-08-19 16:53:58

by Ray Lee

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On 8/18/06, Andrew Morton <[email protected]> wrote:
> I assert that this can be solved by putting swap on local disks. Peter
> asserts that this isn't acceptable due to disk unreliability. I point
> out that local disk reliability can be increased via MD, all goes quiet.
>
> A good exposition which helps us to understand whether and why a
> significant proportion of the target user base still wishes to do
> swap-over-network would be useful.

Adding a hard drive adds $low per system, another failure point, and
more importantly ~3-10 Watts which then has to be paid for twice (once
to power it, again to cool it). For a hundred seats, that's
significant. For 500, it's ranging toward fully painful.

I'm in the process of designing the next upgrade for a VoIP call
center, and we want to go entirely diskless in the agent systems. We'd
also rather not swap over the network, but 'swap is as swap does.'

That said, it in no way invalidates using /proc/sys/vm/min_free_kbytes...

Ray

2006-08-20 01:33:39

by Andre Tomt

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Rik van Riel wrote:
> Andrew Morton wrote:
>
>> - We expect that the lots-of-dirty-anon-memory-over-swap-over-network
>> scenario might still cause deadlocks.
>> I assert that this can be solved by putting swap on local disks.
>> Peter
>> asserts that this isn't acceptable due to disk unreliability. I point
>> out that local disk reliability can be increased via MD, all goes
>> quiet.
>>
>> A good exposition which helps us to understand whether and why a
>> significant proportion of the target user base still wishes to do
>> swap-over-network would be useful.
>
> You cannot put disks in many models of blade servers.
>
> At all.

Or many thin clients in general. They are used in quite a few schools
over here, running Linux.
Some of them do in fact have space for disks, but disks adds costs
(heat, power, replacing failed drives)

2006-08-21 13:27:55

by Philip R Auld

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Hi Andrew,

Rumor has it that on Fri, Aug 18, 2006 at 07:44:35PM -0700 Andrew Morton said:
> On Fri, 18 Aug 2006 16:44:01 -0700
> Daniel Phillips <[email protected]> wrote:
>
> - We expect that the lots-of-dirty-anon-memory-over-swap-over-network
> scenario might still cause deadlocks.
>
> I assert that this can be solved by putting swap on local disks. Peter
> asserts that this isn't acceptable due to disk unreliability. I point
> out that local disk reliability can be increased via MD, all goes quiet.

Putting swap on local disks really messes up the concept of stateless
servers. I suppose you can do some sort of swap encryption, but
otherwise you need to scrub the swap partition on boot if you
re-purpose the hardware. You also then need to do hardware
configuration to make sure the local disks are all setup the
same way across all server platforms so the common images can
boot.

Please don't require a hardware solution to a software problem.

>
> A good exposition which helps us to understand whether and why a
> significant proportion of the target user base still wishes to do
> swap-over-network would be useful.
>

I can't claim to represent a significant proportion of the target
user base. However, stateless hardware is a powerful and useful
model.


Cheers,

Phil

--
Philip R. Auld, Ph.D. Egenera, Inc.
Software Architect 165 Forest St.
(508) 858-2628 Marlboro, MA 01752

2006-08-21 13:36:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

On Fri, Aug 18 2006, Daniel Phillips wrote:
> nearly the same kind of function, and suffering very nearly the same kind
> of problems we had in the block layer before mingo's mempool machinery
> arrived?

Correction, the block layer wasn't buggy (eg deadlock prone) before
mempool, mempool was merely an abstraction that allowed to move this
code out of the bio.c file since it was apparent that it had other
possible users as well.

--
Jens Axboe

2006-08-25 10:47:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/9] deadlock prevention core

Hi!

> > - We expect that the lots-of-dirty-anon-memory-over-swap-over-network
> > scenario might still cause deadlocks.
> >
> > I assert that this can be solved by putting swap on local disks. Peter
> > asserts that this isn't acceptable due to disk unreliability. I point
> > out that local disk reliability can be increased via MD, all goes quiet.
>
> Putting swap on local disks really messes up the concept of stateless
> servers. I suppose you can do some sort of swap encryption, but
> otherwise you need to scrub the swap partition on boot if you
> re-purpose the hardware. You also then need to do hardware
> configuration to make sure the local disks are all setup the
> same way across all server platforms so the common images can
> boot.

We should really encrypt swap with random key generated at boot, for
all the machine. I believe it is possible (with some non-trivial
setup) today, but it would be nice to do it automagically.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html