2014-01-14 00:46:39

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 0/4] net_dma removal, and dma debug extension

Follow up patches to 77873803363c "net_dma: mark broken" to remove
net_dma bits and provide debug infrastructure to flag other
get_user_pages() vs dma instances that might violate the dma api.

Will takes this through the dmaengine tree once acked. Just looking for
an ack to patch 2 at this point.

Changes since v2 [1]:

1/ Keep the 'tcp_' prefix to cleanup_rbuf() in patch 3

2/ Fix up patch 4 according to Andrew's comments: Added documentation
and dropped CONFIG_DMA_VS_CPU_DEBUG. [2]

Changes since v1 [3]:

1/ net_dma removal patch has been expanded to revert other
net_dma induced changes.

2/ updated the debug_dma_assert_idle() api to be gated on
CONFIG_DMA_VS_CPU_DEBUG

[1]: http://marc.info/?l=linux-netdev&m=138929837129496&w=2
[2]: http://marc.info/?l=linux-netdev&m=138931431901890&w=2
[3]: http://marc.info/?l=linux-netdev&m=138732574814049&w=2

---

Dan Williams (4):
net_dma: simple removal
net_dma: revert 'copied_early'
net: make tcp_cleanup_rbuf private
dma debug: introduce debug_dma_assert_idle()


Documentation/ABI/removed/net_dma | 8 +
Documentation/networking/ip-sysctl.txt | 6 -
drivers/dma/Kconfig | 12 -
drivers/dma/Makefile | 1
drivers/dma/dmaengine.c | 104 ------------
drivers/dma/ioat/dma.c | 1
drivers/dma/ioat/dma.h | 7 -
drivers/dma/ioat/dma_v2.c | 1
drivers/dma/ioat/dma_v3.c | 1
drivers/dma/iovlock.c | 280 --------------------------------
include/linux/dma-debug.h | 6 +
include/linux/dmaengine.h | 22 ---
include/linux/skbuff.h | 8 -
include/linux/tcp.h | 8 -
include/net/netdma.h | 32 ----
include/net/sock.h | 19 --
include/net/tcp.h | 9 -
kernel/sysctl_binary.c | 1
lib/Kconfig.debug | 12 +
lib/dma-debug.c | 169 ++++++++++++++++++-
mm/memory.c | 3
net/core/Makefile | 1
net/core/dev.c | 10 -
net/core/sock.c | 6 -
net/core/user_dma.c | 131 ---------------
net/dccp/proto.c | 4
net/ipv4/sysctl_net_ipv4.c | 9 -
net/ipv4/tcp.c | 149 ++---------------
net/ipv4/tcp_input.c | 83 +--------
net/ipv4/tcp_ipv4.c | 18 --
net/ipv6/tcp_ipv6.c | 13 -
net/llc/af_llc.c | 10 +
32 files changed, 219 insertions(+), 925 deletions(-)
create mode 100644 Documentation/ABI/removed/net_dma
delete mode 100644 drivers/dma/iovlock.c
delete mode 100644 include/net/netdma.h
delete mode 100644 net/core/user_dma.c


2014-01-14 00:46:52

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 1/4] net_dma: simple removal

Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Cc: Dave Jiang <[email protected]>
Cc: Vinod Koul <[email protected]>
Cc: David Whipple <[email protected]>
Cc: Alexander Duyck <[email protected]>
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---

No changes since v2


Documentation/ABI/removed/net_dma | 8 +
Documentation/networking/ip-sysctl.txt | 6 -
drivers/dma/Kconfig | 12 -
drivers/dma/Makefile | 1
drivers/dma/dmaengine.c | 104 ------------
drivers/dma/ioat/dma.c | 1
drivers/dma/ioat/dma.h | 7 -
drivers/dma/ioat/dma_v2.c | 1
drivers/dma/ioat/dma_v3.c | 1
drivers/dma/iovlock.c | 280 --------------------------------
include/linux/dmaengine.h | 22 ---
include/linux/skbuff.h | 8 -
include/linux/tcp.h | 8 -
include/net/netdma.h | 32 ----
include/net/sock.h | 19 --
include/net/tcp.h | 8 -
kernel/sysctl_binary.c | 1
net/core/Makefile | 1
net/core/dev.c | 10 -
net/core/sock.c | 6 -
net/core/user_dma.c | 131 ---------------
net/dccp/proto.c | 4
net/ipv4/sysctl_net_ipv4.c | 9 -
net/ipv4/tcp.c | 147 ++---------------
net/ipv4/tcp_input.c | 61 -------
net/ipv4/tcp_ipv4.c | 18 --
net/ipv6/tcp_ipv6.c | 13 -
net/llc/af_llc.c | 10 +
28 files changed, 35 insertions(+), 894 deletions(-)
create mode 100644 Documentation/ABI/removed/net_dma
delete mode 100644 drivers/dma/iovlock.c
delete mode 100644 include/net/netdma.h
delete mode 100644 net/core/user_dma.c

diff --git a/Documentation/ABI/removed/net_dma b/Documentation/ABI/removed/net_dma
new file mode 100644
index 000000000000..a173aecc2f18
--- /dev/null
+++ b/Documentation/ABI/removed/net_dma
@@ -0,0 +1,8 @@
+What: tcp_dma_copybreak sysctl
+Date: Removed in kernel v3.13
+Contact: Dan Williams <[email protected]>
+Description:
+ Formerly the lower limit, in bytes, of the size of socket reads
+ that will be offloaded to a DMA copy engine. Removed due to
+ coherency issues of the cpu potentially touching the buffers
+ while dma is in flight.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3c12d9a7ed00..bdd8a67f0be2 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -538,12 +538,6 @@ tcp_workaround_signed_windows - BOOLEAN
not receive a window scaling option from them.
Default: 0

-tcp_dma_copybreak - INTEGER
- Lower limit, in bytes, of the size of socket reads that will be
- offloaded to a DMA copy engine, if one is present in the system
- and CONFIG_NET_DMA is enabled.
- Default: 4096
-
tcp_thin_linear_timeouts - BOOLEAN
Enable dynamic triggering of linear timeouts for thin streams.
If set, a check is performed upon retransmission by timeout to
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index c823daaf9043..b24f13195272 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -351,18 +351,6 @@ config DMA_OF
comment "DMA Clients"
depends on DMA_ENGINE

-config NET_DMA
- bool "Network: TCP receive copy offload"
- depends on DMA_ENGINE && NET
- default (INTEL_IOATDMA || FSL_DMA)
- depends on BROKEN
- help
- This enables the use of DMA engines in the network stack to
- offload receive copy-to-user operations, freeing CPU cycles.
-
- Say Y here if you enabled INTEL_IOATDMA or FSL_DMA, otherwise
- say N.
-
config ASYNC_TX_DMA
bool "Async_tx: Offload support for the async_tx api"
depends on DMA_ENGINE
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 0ce2da97e429..024b008a25de 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -6,7 +6,6 @@ obj-$(CONFIG_DMA_VIRTUAL_CHANNELS) += virt-dma.o
obj-$(CONFIG_DMA_ACPI) += acpi-dma.o
obj-$(CONFIG_DMA_OF) += of-dma.o

-obj-$(CONFIG_NET_DMA) += iovlock.o
obj-$(CONFIG_INTEL_MID_DMAC) += intel_mid_dma.o
obj-$(CONFIG_DMATEST) += dmatest.o
obj-$(CONFIG_INTEL_IOATDMA) += ioat/
diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index ef63b9058f3c..d7f4f4e0d71f 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -1029,110 +1029,6 @@ dmaengine_get_unmap_data(struct device *dev, int nr, gfp_t flags)
}
EXPORT_SYMBOL(dmaengine_get_unmap_data);

-/**
- * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
- * @chan: DMA channel to offload copy to
- * @dest_pg: destination page
- * @dest_off: offset in page to copy to
- * @src_pg: source page
- * @src_off: offset in page to copy from
- * @len: length
- *
- * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
- * address according to the DMA mapping API rules for streaming mappings.
- * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
- * (kernel memory or locked user space pages).
- */
-dma_cookie_t
-dma_async_memcpy_pg_to_pg(struct dma_chan *chan, struct page *dest_pg,
- unsigned int dest_off, struct page *src_pg, unsigned int src_off,
- size_t len)
-{
- struct dma_device *dev = chan->device;
- struct dma_async_tx_descriptor *tx;
- struct dmaengine_unmap_data *unmap;
- dma_cookie_t cookie;
- unsigned long flags;
-
- unmap = dmaengine_get_unmap_data(dev->dev, 2, GFP_NOWAIT);
- if (!unmap)
- return -ENOMEM;
-
- unmap->to_cnt = 1;
- unmap->from_cnt = 1;
- unmap->addr[0] = dma_map_page(dev->dev, src_pg, src_off, len,
- DMA_TO_DEVICE);
- unmap->addr[1] = dma_map_page(dev->dev, dest_pg, dest_off, len,
- DMA_FROM_DEVICE);
- unmap->len = len;
- flags = DMA_CTRL_ACK;
- tx = dev->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0],
- len, flags);
-
- if (!tx) {
- dmaengine_unmap_put(unmap);
- return -ENOMEM;
- }
-
- dma_set_unmap(tx, unmap);
- cookie = tx->tx_submit(tx);
- dmaengine_unmap_put(unmap);
-
- preempt_disable();
- __this_cpu_add(chan->local->bytes_transferred, len);
- __this_cpu_inc(chan->local->memcpy_count);
- preempt_enable();
-
- return cookie;
-}
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-
-/**
- * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
- * @chan: DMA channel to offload copy to
- * @dest: destination address (virtual)
- * @src: source address (virtual)
- * @len: length
- *
- * Both @dest and @src must be mappable to a bus address according to the
- * DMA mapping API rules for streaming mappings.
- * Both @dest and @src must stay memory resident (kernel memory or locked
- * user space pages).
- */
-dma_cookie_t
-dma_async_memcpy_buf_to_buf(struct dma_chan *chan, void *dest,
- void *src, size_t len)
-{
- return dma_async_memcpy_pg_to_pg(chan, virt_to_page(dest),
- (unsigned long) dest & ~PAGE_MASK,
- virt_to_page(src),
- (unsigned long) src & ~PAGE_MASK, len);
-}
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-
-/**
- * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
- * @chan: DMA channel to offload copy to
- * @page: destination page
- * @offset: offset in page to copy to
- * @kdata: source address (virtual)
- * @len: length
- *
- * Both @page/@offset and @kdata must be mappable to a bus address according
- * to the DMA mapping API rules for streaming mappings.
- * Both @page/@offset and @kdata must stay memory resident (kernel memory or
- * locked user space pages)
- */
-dma_cookie_t
-dma_async_memcpy_buf_to_pg(struct dma_chan *chan, struct page *page,
- unsigned int offset, void *kdata, size_t len)
-{
- return dma_async_memcpy_pg_to_pg(chan, page, offset,
- virt_to_page(kdata),
- (unsigned long) kdata & ~PAGE_MASK, len);
-}
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-
void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
struct dma_chan *chan)
{
diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c
index 1a49c777607c..97fa394ca855 100644
--- a/drivers/dma/ioat/dma.c
+++ b/drivers/dma/ioat/dma.c
@@ -1175,7 +1175,6 @@ int ioat1_dma_probe(struct ioatdma_device *device, int dca)
err = ioat_probe(device);
if (err)
return err;
- ioat_set_tcp_copy_break(4096);
err = ioat_register(device);
if (err)
return err;
diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
index 11fb877ddca9..664ec9cbd651 100644
--- a/drivers/dma/ioat/dma.h
+++ b/drivers/dma/ioat/dma.h
@@ -214,13 +214,6 @@ __dump_desc_dbg(struct ioat_chan_common *chan, struct ioat_dma_descriptor *hw,
#define dump_desc_dbg(c, d) \
({ if (d) __dump_desc_dbg(&c->base, d->hw, &d->txd, desc_id(d)); 0; })

-static inline void ioat_set_tcp_copy_break(unsigned long copybreak)
-{
- #ifdef CONFIG_NET_DMA
- sysctl_tcp_dma_copybreak = copybreak;
- #endif
-}
-
static inline struct ioat_chan_common *
ioat_chan_by_index(struct ioatdma_device *device, int index)
{
diff --git a/drivers/dma/ioat/dma_v2.c b/drivers/dma/ioat/dma_v2.c
index 5d3affe7e976..31e8098e444f 100644
--- a/drivers/dma/ioat/dma_v2.c
+++ b/drivers/dma/ioat/dma_v2.c
@@ -900,7 +900,6 @@ int ioat2_dma_probe(struct ioatdma_device *device, int dca)
err = ioat_probe(device);
if (err)
return err;
- ioat_set_tcp_copy_break(2048);

list_for_each_entry(c, &dma->channels, device_node) {
chan = to_chan_common(c);
diff --git a/drivers/dma/ioat/dma_v3.c b/drivers/dma/ioat/dma_v3.c
index 820817e97e62..4bb81346bee2 100644
--- a/drivers/dma/ioat/dma_v3.c
+++ b/drivers/dma/ioat/dma_v3.c
@@ -1652,7 +1652,6 @@ int ioat3_dma_probe(struct ioatdma_device *device, int dca)
err = ioat_probe(device);
if (err)
return err;
- ioat_set_tcp_copy_break(262144);

list_for_each_entry(c, &dma->channels, device_node) {
chan = to_chan_common(c);
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
deleted file mode 100644
index bb48a57c2fc1..000000000000
--- a/drivers/dma/iovlock.c
+++ /dev/null
@@ -1,280 +0,0 @@
-/*
- * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
- * Portions based on net/core/datagram.c and copyrighted by their authors.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program; if not, write to the Free Software Foundation, Inc., 59
- * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
- *
- * The full GNU General Public License is included in this distribution in the
- * file called COPYING.
- */
-
-/*
- * This code allows the net stack to make use of a DMA engine for
- * skb to iovec copies.
- */
-
-#include <linux/dmaengine.h>
-#include <linux/pagemap.h>
-#include <linux/slab.h>
-#include <net/tcp.h> /* for memcpy_toiovec */
-#include <asm/io.h>
-#include <asm/uaccess.h>
-
-static int num_pages_spanned(struct iovec *iov)
-{
- return
- ((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
- ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT);
-}
-
-/*
- * Pin down all the iovec pages needed for len bytes.
- * Return a struct dma_pinned_list to keep track of pages pinned down.
- *
- * We are allocating a single chunk of memory, and then carving it up into
- * 3 sections, the latter 2 whose size depends on the number of iovecs and the
- * total number of pages, respectively.
- */
-struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
-{
- struct dma_pinned_list *local_list;
- struct page **pages;
- int i;
- int ret;
- int nr_iovecs = 0;
- int iovec_len_used = 0;
- int iovec_pages_used = 0;
-
- /* don't pin down non-user-based iovecs */
- if (segment_eq(get_fs(), KERNEL_DS))
- return NULL;
-
- /* determine how many iovecs/pages there are, up front */
- do {
- iovec_len_used += iov[nr_iovecs].iov_len;
- iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]);
- nr_iovecs++;
- } while (iovec_len_used < len);
-
- /* single kmalloc for pinned list, page_list[], and the page arrays */
- local_list = kmalloc(sizeof(*local_list)
- + (nr_iovecs * sizeof (struct dma_page_list))
- + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
- if (!local_list)
- goto out;
-
- /* list of pages starts right after the page list array */
- pages = (struct page **) &local_list->page_list[nr_iovecs];
-
- local_list->nr_iovecs = 0;
-
- for (i = 0; i < nr_iovecs; i++) {
- struct dma_page_list *page_list = &local_list->page_list[i];
-
- len -= iov[i].iov_len;
-
- if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len))
- goto unpin;
-
- page_list->nr_pages = num_pages_spanned(&iov[i]);
- page_list->base_address = iov[i].iov_base;
-
- page_list->pages = pages;
- pages += page_list->nr_pages;
-
- /* pin pages down */
- down_read(&current->mm->mmap_sem);
- ret = get_user_pages(
- current,
- current->mm,
- (unsigned long) iov[i].iov_base,
- page_list->nr_pages,
- 1, /* write */
- 0, /* force */
- page_list->pages,
- NULL);
- up_read(&current->mm->mmap_sem);
-
- if (ret != page_list->nr_pages)
- goto unpin;
-
- local_list->nr_iovecs = i + 1;
- }
-
- return local_list;
-
-unpin:
- dma_unpin_iovec_pages(local_list);
-out:
- return NULL;
-}
-
-void dma_unpin_iovec_pages(struct dma_pinned_list *pinned_list)
-{
- int i, j;
-
- if (!pinned_list)
- return;
-
- for (i = 0; i < pinned_list->nr_iovecs; i++) {
- struct dma_page_list *page_list = &pinned_list->page_list[i];
- for (j = 0; j < page_list->nr_pages; j++) {
- set_page_dirty_lock(page_list->pages[j]);
- page_cache_release(page_list->pages[j]);
- }
- }
-
- kfree(pinned_list);
-}
-
-
-/*
- * We have already pinned down the pages we will be using in the iovecs.
- * Each entry in iov array has corresponding entry in pinned_list->page_list.
- * Using array indexing to keep iov[] and page_list[] in sync.
- * Initial elements in iov array's iov->iov_len will be 0 if already copied into
- * by another call.
- * iov array length remaining guaranteed to be bigger than len.
- */
-dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
- struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len)
-{
- int iov_byte_offset;
- int copy;
- dma_cookie_t dma_cookie = 0;
- int iovec_idx;
- int page_idx;
-
- if (!chan)
- return memcpy_toiovec(iov, kdata, len);
-
- iovec_idx = 0;
- while (iovec_idx < pinned_list->nr_iovecs) {
- struct dma_page_list *page_list;
-
- /* skip already used-up iovecs */
- while (!iov[iovec_idx].iov_len)
- iovec_idx++;
-
- page_list = &pinned_list->page_list[iovec_idx];
-
- iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
- page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
- - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
-
- /* break up copies to not cross page boundary */
- while (iov[iovec_idx].iov_len) {
- copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
- copy = min_t(int, copy, iov[iovec_idx].iov_len);
-
- dma_cookie = dma_async_memcpy_buf_to_pg(chan,
- page_list->pages[page_idx],
- iov_byte_offset,
- kdata,
- copy);
- /* poll for a descriptor slot */
- if (unlikely(dma_cookie < 0)) {
- dma_async_issue_pending(chan);
- continue;
- }
-
- len -= copy;
- iov[iovec_idx].iov_len -= copy;
- iov[iovec_idx].iov_base += copy;
-
- if (!len)
- return dma_cookie;
-
- kdata += copy;
- iov_byte_offset = 0;
- page_idx++;
- }
- iovec_idx++;
- }
-
- /* really bad if we ever run out of iovecs */
- BUG();
- return -EFAULT;
-}
-
-dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
- struct dma_pinned_list *pinned_list, struct page *page,
- unsigned int offset, size_t len)
-{
- int iov_byte_offset;
- int copy;
- dma_cookie_t dma_cookie = 0;
- int iovec_idx;
- int page_idx;
- int err;
-
- /* this needs as-yet-unimplemented buf-to-buff, so punt. */
- /* TODO: use dma for this */
- if (!chan || !pinned_list) {
- u8 *vaddr = kmap(page);
- err = memcpy_toiovec(iov, vaddr + offset, len);
- kunmap(page);
- return err;
- }
-
- iovec_idx = 0;
- while (iovec_idx < pinned_list->nr_iovecs) {
- struct dma_page_list *page_list;
-
- /* skip already used-up iovecs */
- while (!iov[iovec_idx].iov_len)
- iovec_idx++;
-
- page_list = &pinned_list->page_list[iovec_idx];
-
- iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
- page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
- - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
-
- /* break up copies to not cross page boundary */
- while (iov[iovec_idx].iov_len) {
- copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
- copy = min_t(int, copy, iov[iovec_idx].iov_len);
-
- dma_cookie = dma_async_memcpy_pg_to_pg(chan,
- page_list->pages[page_idx],
- iov_byte_offset,
- page,
- offset,
- copy);
- /* poll for a descriptor slot */
- if (unlikely(dma_cookie < 0)) {
- dma_async_issue_pending(chan);
- continue;
- }
-
- len -= copy;
- iov[iovec_idx].iov_len -= copy;
- iov[iovec_idx].iov_base += copy;
-
- if (!len)
- return dma_cookie;
-
- offset += copy;
- iov_byte_offset = 0;
- page_idx++;
- }
- iovec_idx++;
- }
-
- /* really bad if we ever run out of iovecs */
- BUG();
- return -EFAULT;
-}
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 41cf0c399288..890545871af0 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -875,18 +875,6 @@ static inline void dmaengine_put(void)
}
#endif

-#ifdef CONFIG_NET_DMA
-#define net_dmaengine_get() dmaengine_get()
-#define net_dmaengine_put() dmaengine_put()
-#else
-static inline void net_dmaengine_get(void)
-{
-}
-static inline void net_dmaengine_put(void)
-{
-}
-#endif
-
#ifdef CONFIG_ASYNC_TX_DMA
#define async_dmaengine_get() dmaengine_get()
#define async_dmaengine_put() dmaengine_put()
@@ -908,16 +896,8 @@ async_dma_find_channel(enum dma_transaction_type type)
return NULL;
}
#endif /* CONFIG_ASYNC_TX_DMA */
-
-dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
- void *dest, void *src, size_t len);
-dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
- struct page *page, unsigned int offset, void *kdata, size_t len);
-dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
- struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
- unsigned int src_off, size_t len);
void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
- struct dma_chan *chan);
+ struct dma_chan *chan);

static inline void async_tx_ack(struct dma_async_tx_descriptor *tx)
{
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bec1cc7d5e3c..ac4f84dfa84b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -28,7 +28,6 @@
#include <linux/textsearch.h>
#include <net/checksum.h>
#include <linux/rcupdate.h>
-#include <linux/dmaengine.h>
#include <linux/hrtimer.h>
#include <linux/dma-mapping.h>
#include <linux/netdev_features.h>
@@ -496,11 +495,8 @@ struct sk_buff {
/* 6/8 bit hole (depending on ndisc_nodetype presence) */
kmemcheck_bitfield_end(flags2);

-#if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
- union {
- unsigned int napi_id;
- dma_cookie_t dma_cookie;
- };
+#ifdef CONFIG_NET_RX_BUSY_POLL
+ unsigned int napi_id;
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d68633452d9b..26f16021ce1d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -19,7 +19,6 @@


#include <linux/skbuff.h>
-#include <linux/dmaengine.h>
#include <net/sock.h>
#include <net/inet_connection_sock.h>
#include <net/inet_timewait_sock.h>
@@ -169,13 +168,6 @@ struct tcp_sock {
struct iovec *iov;
int memory;
int len;
-#ifdef CONFIG_NET_DMA
- /* members for async copy */
- struct dma_chan *dma_chan;
- int wakeup;
- struct dma_pinned_list *pinned_list;
- dma_cookie_t dma_cookie;
-#endif
} ucopy;

u32 snd_wl1; /* Sequence for window update */
diff --git a/include/net/netdma.h b/include/net/netdma.h
deleted file mode 100644
index 8ba8ce284eeb..000000000000
--- a/include/net/netdma.h
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program; if not, write to the Free Software Foundation, Inc., 59
- * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
- *
- * The full GNU General Public License is included in this distribution in the
- * file called COPYING.
- */
-#ifndef NETDMA_H
-#define NETDMA_H
-#ifdef CONFIG_NET_DMA
-#include <linux/dmaengine.h>
-#include <linux/skbuff.h>
-
-int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
- struct sk_buff *skb, int offset, struct iovec *to,
- size_t len, struct dma_pinned_list *pinned_list);
-
-#endif /* CONFIG_NET_DMA */
-#endif /* NETDMA_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index e3a18ff0c38b..9d5f716e921e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -231,7 +231,6 @@ struct cg_proto;
* @sk_receive_queue: incoming packets
* @sk_wmem_alloc: transmit queue bytes committed
* @sk_write_queue: Packet sending queue
- * @sk_async_wait_queue: DMA copied packets
* @sk_omem_alloc: "o" is "option" or "other"
* @sk_wmem_queued: persistent queue size
* @sk_forward_alloc: space allocated forward
@@ -354,10 +353,6 @@ struct sock {
struct sk_filter __rcu *sk_filter;
struct socket_wq __rcu *sk_wq;

-#ifdef CONFIG_NET_DMA
- struct sk_buff_head sk_async_wait_queue;
-#endif
-
#ifdef CONFIG_XFRM
struct xfrm_policy *sk_policy[2];
#endif
@@ -2200,27 +2195,15 @@ void sock_tx_timestamp(struct sock *sk, __u8 *tx_flags);
* sk_eat_skb - Release a skb if it is no longer needed
* @sk: socket to eat this skb from
* @skb: socket buffer to eat
- * @copied_early: flag indicating whether DMA operations copied this data early
*
* This routine must be called with interrupts disabled or with the socket
* locked so that the sk_buff queue operation is ok.
*/
-#ifdef CONFIG_NET_DMA
-static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
-{
- __skb_unlink(skb, &sk->sk_receive_queue);
- if (!copied_early)
- __kfree_skb(skb);
- else
- __skb_queue_tail(&sk->sk_async_wait_queue, skb);
-}
-#else
-static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
+static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
{
__skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
}
-#endif

static inline
struct net *sock_net(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70e55d200610..084c163e9d40 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -27,7 +27,6 @@
#include <linux/cache.h>
#include <linux/percpu.h>
#include <linux/skbuff.h>
-#include <linux/dmaengine.h>
#include <linux/crypto.h>
#include <linux/cryptohash.h>
#include <linux/kref.h>
@@ -267,7 +266,6 @@ extern int sysctl_tcp_adv_win_scale;
extern int sysctl_tcp_tw_reuse;
extern int sysctl_tcp_frto;
extern int sysctl_tcp_low_latency;
-extern int sysctl_tcp_dma_copybreak;
extern int sysctl_tcp_nometrics_save;
extern int sysctl_tcp_moderate_rcvbuf;
extern int sysctl_tcp_tso_win_divisor;
@@ -1032,12 +1030,6 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp)
tp->ucopy.len = 0;
tp->ucopy.memory = 0;
skb_queue_head_init(&tp->ucopy.prequeue);
-#ifdef CONFIG_NET_DMA
- tp->ucopy.dma_chan = NULL;
- tp->ucopy.wakeup = 0;
- tp->ucopy.pinned_list = NULL;
- tp->ucopy.dma_cookie = 0;
-#endif
}

bool tcp_prequeue(struct sock *sk, struct sk_buff *skb);
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 653cbbd9e7ad..d457005acedf 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -390,7 +390,6 @@ static const struct bin_table bin_net_ipv4_table[] = {
{ CTL_INT, NET_TCP_MTU_PROBING, "tcp_mtu_probing" },
{ CTL_INT, NET_TCP_BASE_MSS, "tcp_base_mss" },
{ CTL_INT, NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS, "tcp_workaround_signed_windows" },
- { CTL_INT, NET_TCP_DMA_COPYBREAK, "tcp_dma_copybreak" },
{ CTL_INT, NET_TCP_SLOW_START_AFTER_IDLE, "tcp_slow_start_after_idle" },
{ CTL_INT, NET_CIPSOV4_CACHE_ENABLE, "cipso_cache_enable" },
{ CTL_INT, NET_CIPSOV4_CACHE_BUCKET_SIZE, "cipso_cache_bucket_size" },
diff --git a/net/core/Makefile b/net/core/Makefile
index b33b996f5dd6..5f98e5983bd3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -16,7 +16,6 @@ obj-y += net-sysfs.o
obj-$(CONFIG_PROC_FS) += net-procfs.o
obj-$(CONFIG_NET_PKTGEN) += pktgen.o
obj-$(CONFIG_NETPOLL) += netpoll.o
-obj-$(CONFIG_NET_DMA) += user_dma.o
obj-$(CONFIG_FIB_RULES) += fib_rules.o
obj-$(CONFIG_TRACEPOINTS) += net-traces.o
obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea5ebb3..677a5a4dcca7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1262,7 +1262,6 @@ static int __dev_open(struct net_device *dev)
clear_bit(__LINK_STATE_START, &dev->state);
else {
dev->flags |= IFF_UP;
- net_dmaengine_get();
dev_set_rx_mode(dev);
dev_activate(dev);
add_device_randomness(dev->dev_addr, dev->addr_len);
@@ -1338,7 +1337,6 @@ static int __dev_close_many(struct list_head *head)
ops->ndo_stop(dev);

dev->flags &= ~IFF_UP;
- net_dmaengine_put();
}

return 0;
@@ -4362,14 +4360,6 @@ static void net_rx_action(struct softirq_action *h)
out:
net_rps_action_and_irq_enable(sd);

-#ifdef CONFIG_NET_DMA
- /*
- * There may not be any more sk_buffs coming right now, so push
- * any pending DMA copies to hardware
- */
- dma_issue_pending_all();
-#endif
-
return;

softnet_break:
diff --git a/net/core/sock.c b/net/core/sock.c
index ab20ed9b0f31..411dab3a5726 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1461,9 +1461,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
atomic_set(&newsk->sk_omem_alloc, 0);
skb_queue_head_init(&newsk->sk_receive_queue);
skb_queue_head_init(&newsk->sk_write_queue);
-#ifdef CONFIG_NET_DMA
- skb_queue_head_init(&newsk->sk_async_wait_queue);
-#endif

spin_lock_init(&newsk->sk_dst_lock);
rwlock_init(&newsk->sk_callback_lock);
@@ -2290,9 +2287,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
skb_queue_head_init(&sk->sk_receive_queue);
skb_queue_head_init(&sk->sk_write_queue);
skb_queue_head_init(&sk->sk_error_queue);
-#ifdef CONFIG_NET_DMA
- skb_queue_head_init(&sk->sk_async_wait_queue);
-#endif

sk->sk_send_head = NULL;

diff --git a/net/core/user_dma.c b/net/core/user_dma.c
deleted file mode 100644
index 1b5fefdb8198..000000000000
--- a/net/core/user_dma.c
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
- * Portions based on net/core/datagram.c and copyrighted by their authors.
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License as published by the Free
- * Software Foundation; either version 2 of the License, or (at your option)
- * any later version.
- *
- * This program is distributed in the hope that it will be useful, but WITHOUT
- * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
- * more details.
- *
- * You should have received a copy of the GNU General Public License along with
- * this program; if not, write to the Free Software Foundation, Inc., 59
- * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
- *
- * The full GNU General Public License is included in this distribution in the
- * file called COPYING.
- */
-
-/*
- * This code allows the net stack to make use of a DMA engine for
- * skb to iovec copies.
- */
-
-#include <linux/dmaengine.h>
-#include <linux/socket.h>
-#include <linux/export.h>
-#include <net/tcp.h>
-#include <net/netdma.h>
-
-#define NET_DMA_DEFAULT_COPYBREAK 4096
-
-int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK;
-EXPORT_SYMBOL(sysctl_tcp_dma_copybreak);
-
-/**
- * dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
- * @skb - buffer to copy
- * @offset - offset in the buffer to start copying from
- * @iovec - io vector to copy to
- * @len - amount of data to copy from buffer to iovec
- * @pinned_list - locked iovec buffer data
- *
- * Note: the iovec is modified during the copy.
- */
-int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
- struct sk_buff *skb, int offset, struct iovec *to,
- size_t len, struct dma_pinned_list *pinned_list)
-{
- int start = skb_headlen(skb);
- int i, copy = start - offset;
- struct sk_buff *frag_iter;
- dma_cookie_t cookie = 0;
-
- /* Copy header. */
- if (copy > 0) {
- if (copy > len)
- copy = len;
- cookie = dma_memcpy_to_iovec(chan, to, pinned_list,
- skb->data + offset, copy);
- if (cookie < 0)
- goto fault;
- len -= copy;
- if (len == 0)
- goto end;
- offset += copy;
- }
-
- /* Copy paged appendix. Hmm... why does this look so complicated? */
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
- int end;
- const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
- WARN_ON(start > offset + len);
-
- end = start + skb_frag_size(frag);
- copy = end - offset;
- if (copy > 0) {
- struct page *page = skb_frag_page(frag);
-
- if (copy > len)
- copy = len;
-
- cookie = dma_memcpy_pg_to_iovec(chan, to, pinned_list, page,
- frag->page_offset + offset - start, copy);
- if (cookie < 0)
- goto fault;
- len -= copy;
- if (len == 0)
- goto end;
- offset += copy;
- }
- start = end;
- }
-
- skb_walk_frags(skb, frag_iter) {
- int end;
-
- WARN_ON(start > offset + len);
-
- end = start + frag_iter->len;
- copy = end - offset;
- if (copy > 0) {
- if (copy > len)
- copy = len;
- cookie = dma_skb_copy_datagram_iovec(chan, frag_iter,
- offset - start,
- to, copy,
- pinned_list);
- if (cookie < 0)
- goto fault;
- len -= copy;
- if (len == 0)
- goto end;
- offset += copy;
- }
- start = end;
- }
-
-end:
- if (!len) {
- skb->dma_cookie = cookie;
- return cookie;
- }
-
-fault:
- return -EFAULT;
-}
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index eb892b4f4814..f9076f295b13 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -848,7 +848,7 @@ int dccp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
default:
dccp_pr_debug("packet_type=%s\n",
dccp_packet_name(dh->dccph_type));
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
}
verify_sock_status:
if (sock_flag(sk, SOCK_DONE)) {
@@ -905,7 +905,7 @@ verify_sock_status:
len = skb->len;
found_fin_ok:
if (!(flags & MSG_PEEK))
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
break;
} while (1);
out:
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3d69ec8dac57..79a90b92e12d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -642,15 +642,6 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
-#ifdef CONFIG_NET_DMA
- {
- .procname = "tcp_dma_copybreak",
- .data = &sysctl_tcp_dma_copybreak,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec
- },
-#endif
{
.procname = "tcp_slow_start_after_idle",
.data = &sysctl_tcp_slow_start_after_idle,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6f0238..8dc913dfbaef 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -274,7 +274,6 @@
#include <net/tcp.h>
#include <net/xfrm.h>
#include <net/ip.h>
-#include <net/netdma.h>
#include <net/sock.h>

#include <asm/uaccess.h>
@@ -1409,39 +1408,6 @@ static void tcp_prequeue_process(struct sock *sk)
tp->ucopy.memory = 0;
}

-#ifdef CONFIG_NET_DMA
-static void tcp_service_net_dma(struct sock *sk, bool wait)
-{
- dma_cookie_t done, used;
- dma_cookie_t last_issued;
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (!tp->ucopy.dma_chan)
- return;
-
- last_issued = tp->ucopy.dma_cookie;
- dma_async_issue_pending(tp->ucopy.dma_chan);
-
- do {
- if (dma_async_is_tx_complete(tp->ucopy.dma_chan,
- last_issued, &done,
- &used) == DMA_COMPLETE) {
- /* Safe to free early-copied skbs now */
- __skb_queue_purge(&sk->sk_async_wait_queue);
- break;
- } else {
- struct sk_buff *skb;
- while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
- (dma_async_is_complete(skb->dma_cookie, done,
- used) == DMA_COMPLETE)) {
- __skb_dequeue(&sk->sk_async_wait_queue);
- kfree_skb(skb);
- }
- }
- } while (wait);
-}
-#endif
-
static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
{
struct sk_buff *skb;
@@ -1459,7 +1425,7 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
* splitted a fat GRO packet, while we released socket lock
* in skb_splice_bits()
*/
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
}
return NULL;
}
@@ -1525,11 +1491,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
continue;
}
if (tcp_hdr(skb)->fin) {
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
++seq;
break;
}
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
if (!desc->count)
break;
tp->copied_seq = seq;
@@ -1567,7 +1533,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
int target; /* Read at least this many bytes */
long timeo;
struct task_struct *user_recv = NULL;
- bool copied_early = false;
struct sk_buff *skb;
u32 urg_hole = 0;

@@ -1610,28 +1575,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,

target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

-#ifdef CONFIG_NET_DMA
- tp->ucopy.dma_chan = NULL;
- preempt_disable();
- skb = skb_peek_tail(&sk->sk_receive_queue);
- {
- int available = 0;
-
- if (skb)
- available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
- if ((available < target) &&
- (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
- !sysctl_tcp_low_latency &&
- net_dma_find_channel()) {
- preempt_enable_no_resched();
- tp->ucopy.pinned_list =
- dma_pin_iovec_pages(msg->msg_iov, len);
- } else {
- preempt_enable_no_resched();
- }
- }
-#endif
-
do {
u32 offset;

@@ -1762,16 +1705,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
/* __ Set realtime policy in scheduler __ */
}

-#ifdef CONFIG_NET_DMA
- if (tp->ucopy.dma_chan) {
- if (tp->rcv_wnd == 0 &&
- !skb_queue_empty(&sk->sk_async_wait_queue)) {
- tcp_service_net_dma(sk, true);
- tcp_cleanup_rbuf(sk, copied);
- } else
- dma_async_issue_pending(tp->ucopy.dma_chan);
- }
-#endif
if (copied >= target) {
/* Do not sleep, just process backlog. */
release_sock(sk);
@@ -1779,11 +1712,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
} else
sk_wait_data(sk, &timeo);

-#ifdef CONFIG_NET_DMA
- tcp_service_net_dma(sk, false); /* Don't block */
- tp->ucopy.wakeup = 0;
-#endif
-
if (user_recv) {
int chunk;

@@ -1841,43 +1769,13 @@ do_prequeue:
}

if (!(flags & MSG_TRUNC)) {
-#ifdef CONFIG_NET_DMA
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
-
- if (tp->ucopy.dma_chan) {
- tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
- tp->ucopy.dma_chan, skb, offset,
- msg->msg_iov, used,
- tp->ucopy.pinned_list);
-
- if (tp->ucopy.dma_cookie < 0) {
-
- pr_alert("%s: dma_cookie < 0\n",
- __func__);
-
- /* Exception. Bailout! */
- if (!copied)
- copied = -EFAULT;
- break;
- }
-
- dma_async_issue_pending(tp->ucopy.dma_chan);
-
- if ((offset + used) == skb->len)
- copied_early = true;
-
- } else
-#endif
- {
- err = skb_copy_datagram_iovec(skb, offset,
- msg->msg_iov, used);
- if (err) {
- /* Exception. Bailout! */
- if (!copied)
- copied = -EFAULT;
- break;
- }
+ err = skb_copy_datagram_iovec(skb, offset,
+ msg->msg_iov, used);
+ if (err) {
+ /* Exception. Bailout! */
+ if (!copied)
+ copied = -EFAULT;
+ break;
}
}

@@ -1897,19 +1795,15 @@ skip_copy:

if (tcp_hdr(skb)->fin)
goto found_fin_ok;
- if (!(flags & MSG_PEEK)) {
- sk_eat_skb(sk, skb, copied_early);
- copied_early = false;
- }
+ if (!(flags & MSG_PEEK))
+ sk_eat_skb(sk, skb);
continue;

found_fin_ok:
/* Process the FIN. */
++*seq;
- if (!(flags & MSG_PEEK)) {
- sk_eat_skb(sk, skb, copied_early);
- copied_early = false;
- }
+ if (!(flags & MSG_PEEK))
+ sk_eat_skb(sk, skb);
break;
} while (len > 0);

@@ -1932,16 +1826,6 @@ skip_copy:
tp->ucopy.len = 0;
}

-#ifdef CONFIG_NET_DMA
- tcp_service_net_dma(sk, true); /* Wait for queue to drain */
- tp->ucopy.dma_chan = NULL;
-
- if (tp->ucopy.pinned_list) {
- dma_unpin_iovec_pages(tp->ucopy.pinned_list);
- tp->ucopy.pinned_list = NULL;
- }
-#endif
-
/* According to UNIX98, msg_name/msg_namelen are ignored
* on connected socket. I was just happy when found this 8) --ANK
*/
@@ -2285,9 +2169,6 @@ int tcp_disconnect(struct sock *sk, int flags)
__skb_queue_purge(&sk->sk_receive_queue);
tcp_write_queue_purge(sk);
__skb_queue_purge(&tp->out_of_order_queue);
-#ifdef CONFIG_NET_DMA
- __skb_queue_purge(&sk->sk_async_wait_queue);
-#endif

inet->inet_dport = 0;

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53b7f35c51d..33ef18e550c5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -73,7 +73,6 @@
#include <net/inet_common.h>
#include <linux/ipsec.h>
#include <asm/unaligned.h>
-#include <net/netdma.h>

int sysctl_tcp_timestamps __read_mostly = 1;
int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -4967,53 +4966,6 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
__tcp_checksum_complete_user(sk, skb);
}

-#ifdef CONFIG_NET_DMA
-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
- int hlen)
-{
- struct tcp_sock *tp = tcp_sk(sk);
- int chunk = skb->len - hlen;
- int dma_cookie;
- bool copied_early = false;
-
- if (tp->ucopy.wakeup)
- return false;
-
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
-
- if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
-
- dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
- skb, hlen,
- tp->ucopy.iov, chunk,
- tp->ucopy.pinned_list);
-
- if (dma_cookie < 0)
- goto out;
-
- tp->ucopy.dma_cookie = dma_cookie;
- copied_early = true;
-
- tp->ucopy.len -= chunk;
- tp->copied_seq += chunk;
- tcp_rcv_space_adjust(sk);
-
- if ((tp->ucopy.len == 0) ||
- (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||
- (atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1))) {
- tp->ucopy.wakeup = 1;
- sk->sk_data_ready(sk, 0);
- }
- } else if (chunk > 0) {
- tp->ucopy.wakeup = 1;
- sk->sk_data_ready(sk, 0);
- }
-out:
- return copied_early;
-}
-#endif /* CONFIG_NET_DMA */
-
/* Does PAWS and seqno based validation of an incoming segment, flags will
* play significant role here.
*/
@@ -5198,14 +5150,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,

if (tp->copied_seq == tp->rcv_nxt &&
len - tcp_header_len <= tp->ucopy.len) {
-#ifdef CONFIG_NET_DMA
- if (tp->ucopy.task == current &&
- sock_owned_by_user(sk) &&
- tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
- copied_early = 1;
- eaten = 1;
- }
-#endif
if (tp->ucopy.task == current &&
sock_owned_by_user(sk) && !copied_early) {
__set_current_state(TASK_RUNNING);
@@ -5271,11 +5215,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
__tcp_ack_snd_check(sk, 0);
no_ack:
-#ifdef CONFIG_NET_DMA
- if (copied_early)
- __skb_queue_tail(&sk->sk_async_wait_queue, skb);
- else
-#endif
if (eaten)
kfree_skb_partial(skb, fragstolen);
sk->sk_data_ready(sk, 0);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 59a6f8b90cd9..dc92ba9d0350 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -72,7 +72,6 @@
#include <net/inet_common.h>
#include <net/timewait_sock.h>
#include <net/xfrm.h>
-#include <net/netdma.h>
#include <net/secure_seq.h>
#include <net/tcp_memcontrol.h>
#include <net/busy_poll.h>
@@ -2000,18 +1999,8 @@ process:
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
-#ifdef CONFIG_NET_DMA
- struct tcp_sock *tp = tcp_sk(sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
+ if (!tcp_prequeue(sk, skb))
ret = tcp_v4_do_rcv(sk, skb);
- else
-#endif
- {
- if (!tcp_prequeue(sk, skb))
- ret = tcp_v4_do_rcv(sk, skb);
- }
} else if (unlikely(sk_add_backlog(sk, skb,
sk->sk_rcvbuf + sk->sk_sndbuf))) {
bh_unlock_sock(sk);
@@ -2170,11 +2159,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
}
#endif

-#ifdef CONFIG_NET_DMA
- /* Cleans up our sk_async_wait_queue */
- __skb_queue_purge(&sk->sk_async_wait_queue);
-#endif
-
/* Clean prequeue, it must be empty really */
__skb_queue_purge(&tp->ucopy.prequeue);

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0740f93a114a..e27972590379 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -59,7 +59,6 @@
#include <net/snmp.h>
#include <net/dsfield.h>
#include <net/timewait_sock.h>
-#include <net/netdma.h>
#include <net/inet_common.h>
#include <net/secure_seq.h>
#include <net/tcp_memcontrol.h>
@@ -1504,18 +1503,8 @@ process:
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
-#ifdef CONFIG_NET_DMA
- struct tcp_sock *tp = tcp_sk(sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
+ if (!tcp_prequeue(sk, skb))
ret = tcp_v6_do_rcv(sk, skb);
- else
-#endif
- {
- if (!tcp_prequeue(sk, skb))
- ret = tcp_v6_do_rcv(sk, skb);
- }
} else if (unlikely(sk_add_backlog(sk, skb,
sk->sk_rcvbuf + sk->sk_sndbuf))) {
bh_unlock_sock(sk);
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 7b01b9f5846c..e1b46709f8d6 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -838,7 +838,7 @@ static int llc_ui_recvmsg(struct kiocb *iocb, struct socket *sock,

if (!(flags & MSG_PEEK)) {
spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
- sk_eat_skb(sk, skb, false);
+ sk_eat_skb(sk, skb);
spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
*seq = 0;
}
@@ -860,10 +860,10 @@ copy_uaddr:
llc_cmsg_rcv(msg, skb);

if (!(flags & MSG_PEEK)) {
- spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
- sk_eat_skb(sk, skb, false);
- spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
- *seq = 0;
+ spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
+ sk_eat_skb(sk, skb);
+ spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
+ *seq = 0;
}

goto out;

2014-01-14 00:47:20

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 2/4] net_dma: revert 'copied_early'

Now that tcp_dma_try_early_copy() is gone nothing ever sets
copied_early.

Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
since it is no longer necessary.

Cc: Ali Saidi <[email protected]>
Cc: James Morris <[email protected]>
Cc: Patrick McHardy <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Cc: Neal Cardwell <[email protected]>
Reported-by: Dave Jones <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---

Looking for an ack for this one, no changes since v2.

net/ipv4/tcp_input.c | 22 ++++++++--------------
1 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 33ef18e550c5..15911a280485 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5145,19 +5145,15 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
}
} else {
int eaten = 0;
- int copied_early = 0;
bool fragstolen = false;

- if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
- if (tp->ucopy.task == current &&
- sock_owned_by_user(sk) && !copied_early) {
- __set_current_state(TASK_RUNNING);
+ if (tp->ucopy.task == current &&
+ tp->copied_seq == tp->rcv_nxt &&
+ len - tcp_header_len <= tp->ucopy.len &&
+ sock_owned_by_user(sk)) {
+ __set_current_state(TASK_RUNNING);

- if (!tcp_copy_to_iovec(sk, skb, tcp_header_len))
- eaten = 1;
- }
- if (eaten) {
+ if (!tcp_copy_to_iovec(sk, skb, tcp_header_len)) {
/* Predicted packet is in window by definition.
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
@@ -5173,9 +5169,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
__skb_pull(skb, tcp_header_len);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ eaten = 1;
}
- if (copied_early)
- tcp_cleanup_rbuf(sk, skb->len);
}
if (!eaten) {
if (tcp_checksum_complete_user(sk, skb))
@@ -5212,8 +5207,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
goto no_ack;
}

- if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
- __tcp_ack_snd_check(sk, 0);
+ __tcp_ack_snd_check(sk, 0);
no_ack:
if (eaten)
kfree_skb_partial(skb, fragstolen);

2014-01-14 00:48:06

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 3/4] net: make tcp_cleanup_rbuf private

net_dma was the only external user so this can become local to tcp.c
again.

Cc: James Morris <[email protected]>
Cc: Patrick McHardy <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---

Since v1: keep the tcp_ prefix.

include/net/tcp.h | 1 -
net/ipv4/tcp.c | 2 +-
2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 084c163e9d40..571036b3bead 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -370,7 +370,6 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
const struct tcphdr *th, unsigned int len);
void tcp_rcv_space_adjust(struct sock *sk);
-void tcp_cleanup_rbuf(struct sock *sk, int copied);
int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp);
void tcp_twsk_destructor(struct sock *sk);
ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8dc913dfbaef..10dda80fccc9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1332,7 +1332,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
* calculation of whether or not we must ACK for the sake of
* a window update.
*/
-void tcp_cleanup_rbuf(struct sock *sk, int copied)
+static void tcp_cleanup_rbuf(struct sock *sk, int copied)
{
struct tcp_sock *tp = tcp_sk(sk);
bool time_to_ack = false;

2014-01-14 00:49:00

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 4/4] dma debug: introduce debug_dma_assert_idle()

Record actively mapped pages and provide an api for asserting a given
page is dma inactive before execution proceeds. Placing
debug_dma_assert_idle() in cow_user_page() flagged the violation of the
dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
mark broken").

Cc: Joerg Roedel <[email protected]>
Cc: Vinod Koul <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Russell King <[email protected]>
Cc: James Bottomley <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---

Since v2: add documentation, drop CONFIG_DMA_API_VS_CPU

include/linux/dma-debug.h | 6 ++
lib/Kconfig.debug | 12 +++
lib/dma-debug.c | 169 ++++++++++++++++++++++++++++++++++++++++++---
mm/memory.c | 3 +
4 files changed, 175 insertions(+), 15 deletions(-)

diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fc0e34ce038f..fe8cb610deac 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -85,6 +85,8 @@ extern void debug_dma_sync_sg_for_device(struct device *dev,

extern void debug_dma_dump_mappings(struct device *dev);

+extern void debug_dma_assert_idle(struct page *page);
+
#else /* CONFIG_DMA_API_DEBUG */

static inline void dma_debug_add_bus(struct bus_type *bus)
@@ -183,6 +185,10 @@ static inline void debug_dma_dump_mappings(struct device *dev)
{
}

+static inline void debug_dma_assert_idle(struct page *page)
+{
+}
+
#endif /* CONFIG_DMA_API_DEBUG */

#endif /* __DMA_DEBUG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index db25707aa41b..df3e41819fad 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1575,8 +1575,16 @@ config DMA_API_DEBUG
With this option you will be able to detect common bugs in device
drivers like double-freeing of DMA mappings or freeing mappings that
were never allocated.
- This option causes a performance degredation. Use only if you want
- to debug device drivers. If unsure, say N.
+
+ This also attempts to catch cases where a page owned by DMA is
+ accessed by the cpu in a way that could cause data corruption. For
+ example, this enables cow_user_page() to check that the source page is
+ not undergoing DMA.
+
+ This option causes a performance degradation. Use only if you want to
+ debug device drivers and dma interactions.
+
+ If unsure, say N.

source "samples/Kconfig"

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index d87a17a819d0..c5264d3cb142 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -53,11 +53,26 @@ enum map_err_types {

#define DMA_DEBUG_STACKTRACE_ENTRIES 5

+/**
+ * struct dma_debug_entry - track a dma_map* or dma_alloc_coherent mapping
+ * @list: node on pre-allocated free_entries list
+ * @dev: 'dev' argument to dma_map_{page|single|sg} or dma_alloc_coherent
+ * @type: single, page, sg, coherent
+ * @pfn: page frame of the start address
+ * @offset: offset of mapping relative to pfn
+ * @size: length of the mapping
+ * @direction: enum dma_data_direction
+ * @sg_call_ents: 'nents' from dma_map_sg
+ * @sg_mapped_ents: 'mapped_ents' from dma_map_sg
+ * @map_err_type: track whether dma_mapping_error() was checked
+ * @stacktrace: support backtraces when a violation is detected
+ */
struct dma_debug_entry {
struct list_head list;
struct device *dev;
int type;
- phys_addr_t paddr;
+ unsigned long pfn;
+ size_t offset;
u64 dev_addr;
u64 size;
int direction;
@@ -372,6 +387,11 @@ static void hash_bucket_del(struct dma_debug_entry *entry)
list_del(&entry->list);
}

+static unsigned long long phys_addr(struct dma_debug_entry *entry)
+{
+ return page_to_phys(pfn_to_page(entry->pfn)) + entry->offset;
+}
+
/*
* Dump mapping entries for debugging purposes
*/
@@ -389,9 +409,9 @@ void debug_dma_dump_mappings(struct device *dev)
list_for_each_entry(entry, &bucket->list, list) {
if (!dev || dev == entry->dev) {
dev_info(entry->dev,
- "%s idx %d P=%Lx D=%Lx L=%Lx %s %s\n",
+ "%s idx %d P=%Lx N=%lx D=%Lx L=%Lx %s %s\n",
type2name[entry->type], idx,
- (unsigned long long)entry->paddr,
+ phys_addr(entry), entry->pfn,
entry->dev_addr, entry->size,
dir2name[entry->direction],
maperr2str[entry->map_err_type]);
@@ -404,6 +424,108 @@ void debug_dma_dump_mappings(struct device *dev)
EXPORT_SYMBOL(debug_dma_dump_mappings);

/*
+ * For each page mapped (initial page in the case of
+ * dma_alloc_coherent/dma_map_{single|page}, or each page in a
+ * scatterlist) insert into this tree using the pfn as the key. At
+ * dma_unmap_{single|sg|page} or dma_free_coherent delete the entry. If
+ * the pfn already exists at insertion time add a tag as a reference
+ * count for the overlapping mappings. For now, the overlap tracking
+ * just ensures that 'unmaps' balance 'maps' before marking the pfn
+ * idle, but we should also be flagging overlaps as an API violation.
+ *
+ * Memory usage is mostly constrained by the maximum number of available
+ * dma-debug entries in that we need a free dma_debug_entry before
+ * inserting into the tree. In the case of dma_map_{single|page} and
+ * dma_alloc_coherent there is only one dma_debug_entry and one pfn to
+ * track per event. dma_map_sg(), on the other hand,
+ * consumes a single dma_debug_entry, but inserts 'nents' entries into
+ * the tree.
+ *
+ * At any time debug_dma_assert_idle() can be called to trigger a
+ * warning if the given page is in the active set.
+ */
+static RADIX_TREE(dma_active_pfn, GFP_NOWAIT);
+static DEFINE_SPINLOCK(radix_lock);
+
+static void __active_pfn_inc_overlap(struct dma_debug_entry *entry)
+{
+ unsigned long pfn = entry->pfn;
+ int i;
+
+ for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
+ if (radix_tree_tag_get(&dma_active_pfn, pfn, i) == 0) {
+ radix_tree_tag_set(&dma_active_pfn, pfn, i);
+ return;
+ }
+ pr_debug("DMA-API: max overlap count (%d) reached for pfn 0x%lx\n",
+ RADIX_TREE_MAX_TAGS, pfn);
+}
+
+static void __active_pfn_dec_overlap(struct dma_debug_entry *entry)
+{
+ unsigned long pfn = entry->pfn;
+ int i;
+
+ for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
+ if (radix_tree_tag_get(&dma_active_pfn, pfn, i)) {
+ radix_tree_tag_clear(&dma_active_pfn, pfn, i);
+ return;
+ }
+ radix_tree_delete(&dma_active_pfn, pfn);
+}
+
+static int active_pfn_insert(struct dma_debug_entry *entry)
+{
+ unsigned long flags;
+ int rc;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
+ if (rc == -EEXIST)
+ __active_pfn_inc_overlap(entry);
+ spin_unlock_irqrestore(&radix_lock, flags);
+
+ return rc;
+}
+
+static void active_pfn_remove(struct dma_debug_entry *entry)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ __active_pfn_dec_overlap(entry);
+ spin_unlock_irqrestore(&radix_lock, flags);
+}
+
+/**
+ * debug_dma_assert_idle() - assert that a page is not undergoing dma
+ * @page: page to lookup in the dma_active_pfn tree
+ *
+ * Place a call to this routine in cases where the cpu touching the page
+ * before the dma completes (page is dma_unmapped) will lead to data
+ * corruption.
+ */
+void debug_dma_assert_idle(struct page *page)
+{
+ unsigned long flags;
+ struct dma_debug_entry *entry;
+
+ if (!page)
+ return;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ entry = radix_tree_lookup(&dma_active_pfn, page_to_pfn(page));
+ spin_unlock_irqrestore(&radix_lock, flags);
+
+ if (!entry)
+ return;
+
+ err_printk(entry->dev, entry,
+ "DMA-API: cpu touching an active dma mapped page "
+ "[pfn=0x%lx]\n", entry->pfn);
+}
+
+/*
* Wrapper function for adding an entry to the hash.
* This function takes care of locking itself.
*/
@@ -411,10 +533,22 @@ static void add_dma_entry(struct dma_debug_entry *entry)
{
struct hash_bucket *bucket;
unsigned long flags;
+ int rc;

bucket = get_hash_bucket(entry, &flags);
hash_bucket_add(bucket, entry);
put_hash_bucket(bucket, &flags);
+
+ rc = active_pfn_insert(entry);
+ if (rc == -ENOMEM) {
+ pr_err("DMA-API: pfn tracking out of memory - "
+ "disabling dma-debug\n");
+ global_disable = true;
+ }
+
+ /* TODO: report -EEXIST errors as overlapping mappings are not
+ * supported by the DMA API
+ */
}

static struct dma_debug_entry *__dma_entry_alloc(void)
@@ -469,6 +603,8 @@ static void dma_entry_free(struct dma_debug_entry *entry)
{
unsigned long flags;

+ active_pfn_remove(entry);
+
/*
* add to beginning of the list - this way the entries are
* more likely cache hot when they are reallocated.
@@ -895,15 +1031,15 @@ static void check_unmap(struct dma_debug_entry *ref)
ref->dev_addr, ref->size,
type2name[entry->type], type2name[ref->type]);
} else if ((entry->type == dma_debug_coherent) &&
- (ref->paddr != entry->paddr)) {
+ (phys_addr(ref) != phys_addr(entry))) {
err_printk(ref->dev, entry, "DMA-API: device driver frees "
"DMA memory with different CPU address "
"[device address=0x%016llx] [size=%llu bytes] "
"[cpu alloc address=0x%016llx] "
"[cpu free address=0x%016llx]",
ref->dev_addr, ref->size,
- (unsigned long long)entry->paddr,
- (unsigned long long)ref->paddr);
+ phys_addr(entry),
+ phys_addr(ref));
}

if (ref->sg_call_ents && ref->type == dma_debug_sg &&
@@ -1052,7 +1188,8 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

entry->dev = dev;
entry->type = dma_debug_page;
- entry->paddr = page_to_phys(page) + offset;
+ entry->pfn = page_to_pfn(page);
+ entry->offset = offset,
entry->dev_addr = dma_addr;
entry->size = size;
entry->direction = direction;
@@ -1148,7 +1285,8 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,

entry->type = dma_debug_sg;
entry->dev = dev;
- entry->paddr = sg_phys(s);
+ entry->pfn = page_to_pfn(sg_page(s));
+ entry->offset = s->offset,
entry->size = sg_dma_len(s);
entry->dev_addr = sg_dma_address(s);
entry->direction = direction;
@@ -1198,7 +1336,8 @@ void debug_dma_unmap_sg(struct device *dev, struct scatterlist *sglist,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = dir,
@@ -1233,7 +1372,8 @@ void debug_dma_alloc_coherent(struct device *dev, size_t size,

entry->type = dma_debug_coherent;
entry->dev = dev;
- entry->paddr = virt_to_phys(virt);
+ entry->pfn = page_to_pfn(virt_to_page(virt));
+ entry->offset = (size_t) virt & PAGE_MASK;
entry->size = size;
entry->dev_addr = dma_addr;
entry->direction = DMA_BIDIRECTIONAL;
@@ -1248,7 +1388,8 @@ void debug_dma_free_coherent(struct device *dev, size_t size,
struct dma_debug_entry ref = {
.type = dma_debug_coherent,
.dev = dev,
- .paddr = virt_to_phys(virt),
+ .pfn = page_to_pfn(virt_to_page(virt)),
+ .offset = (size_t) virt & PAGE_MASK,
.dev_addr = addr,
.size = size,
.direction = DMA_BIDIRECTIONAL,
@@ -1356,7 +1497,8 @@ void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = direction,
@@ -1388,7 +1530,8 @@ void debug_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = direction,
diff --git a/mm/memory.c b/mm/memory.c
index 5d9025f3b3e1..c89788436f81 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/dma-debug.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -2559,6 +2560,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,

static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
{
+ debug_dma_assert_idle(src);
+
/*
* If the source page was a PFN mapping, we don't have
* a "struct page" for it. We do a best-effort copy by

2014-01-14 01:14:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] dma debug: introduce debug_dma_assert_idle()

On Mon, 13 Jan 2014 16:48:47 -0800 Dan Williams <[email protected]> wrote:

> Record actively mapped pages and provide an api for asserting a given
> page is dma inactive before execution proceeds. Placing
> debug_dma_assert_idle() in cow_user_page() flagged the violation of the
> dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
> mark broken").

Some discussion of the overlap counter thing would be useful.

> --- a/include/linux/dma-debug.h
> +++ b/include/linux/dma-debug.h
>
> ...
>
> +static void __active_pfn_inc_overlap(struct dma_debug_entry *entry)
> +{
> + unsigned long pfn = entry->pfn;
> + int i;
> +
> + for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
> + if (radix_tree_tag_get(&dma_active_pfn, pfn, i) == 0) {
> + radix_tree_tag_set(&dma_active_pfn, pfn, i);
> + return;
> + }
> + pr_debug("DMA-API: max overlap count (%d) reached for pfn 0x%lx\n",
> + RADIX_TREE_MAX_TAGS, pfn);
> +}
> +
> +static void __active_pfn_dec_overlap(struct dma_debug_entry *entry)
> +{
> + unsigned long pfn = entry->pfn;
> + int i;
> +
> + for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
> + if (radix_tree_tag_get(&dma_active_pfn, pfn, i)) {
> + radix_tree_tag_clear(&dma_active_pfn, pfn, i);
> + return;
> + }
> + radix_tree_delete(&dma_active_pfn, pfn);
> +}
> +
> +static int active_pfn_insert(struct dma_debug_entry *entry)
> +{
> + unsigned long flags;
> + int rc;
> +
> + spin_lock_irqsave(&radix_lock, flags);
> + rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
> + if (rc == -EEXIST)
> + __active_pfn_inc_overlap(entry);
> + spin_unlock_irqrestore(&radix_lock, flags);
> +
> + return rc;
> +}
> +
> +static void active_pfn_remove(struct dma_debug_entry *entry)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&radix_lock, flags);
> + __active_pfn_dec_overlap(entry);
> + spin_unlock_irqrestore(&radix_lock, flags);
> +}

OK, I think I see what's happening. The tags thing acts as a crude
counter and if the map/unmap count ends up imbalanced, we deliberately
leak an entry in the radix-tree so it can later be reported via undescribed
means. Thoughts:

- RADIX_TREE_MAX_TAGS=3 so the code could count to 7, with a bit of
futzing around.

- from a style/readability point of view it is unexpected that
__active_pfn_dec_overlap() actually removes radix-tree items. It
would be better to do:

spin_lock_irqsave(&radix_lock, flags);
if (__active_pfn_dec_overlap(entry) == something) {
/*
* Nice comment goes here
*/
radix_tree_delete(...);
}
spin_unlock_irqrestore(&radix_lock, flags);

2014-01-14 02:40:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] dma debug: introduce debug_dma_assert_idle()

On Mon, Jan 13, 2014 at 5:14 PM, Andrew Morton
<[email protected]> wrote:
> On Mon, 13 Jan 2014 16:48:47 -0800 Dan Williams <[email protected]> wrote:
>
>> Record actively mapped pages and provide an api for asserting a given
>> page is dma inactive before execution proceeds. Placing
>> debug_dma_assert_idle() in cow_user_page() flagged the violation of the
>> dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
>> mark broken").
>
> Some discussion of the overlap counter thing would be useful.

Ok, will add:

"The implementation also has the ability to count repeat mappings of
the same page without an intervening unmap. This counter is limited
to the few bits of tag space in a radix tree. This mechanism is added
to mitigate false negative cases where, for example, a page is dma
mapped twice and debug_dma_assert_idle() is called after the page is
un-mapped once."

>> --- a/include/linux/dma-debug.h
>> +++ b/include/linux/dma-debug.h
>>
>> ...
>>
>> +static void __active_pfn_inc_overlap(struct dma_debug_entry *entry)
>> +{
>> + unsigned long pfn = entry->pfn;
>> + int i;
>> +
>> + for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
>> + if (radix_tree_tag_get(&dma_active_pfn, pfn, i) == 0) {
>> + radix_tree_tag_set(&dma_active_pfn, pfn, i);
>> + return;
>> + }
>> + pr_debug("DMA-API: max overlap count (%d) reached for pfn 0x%lx\n",
>> + RADIX_TREE_MAX_TAGS, pfn);
>> +}
>> +
>> +static void __active_pfn_dec_overlap(struct dma_debug_entry *entry)
>> +{
>> + unsigned long pfn = entry->pfn;
>> + int i;
>> +
>> + for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
>> + if (radix_tree_tag_get(&dma_active_pfn, pfn, i)) {
>> + radix_tree_tag_clear(&dma_active_pfn, pfn, i);
>> + return;
>> + }
>> + radix_tree_delete(&dma_active_pfn, pfn);
>> +}
>> +
>> +static int active_pfn_insert(struct dma_debug_entry *entry)
>> +{
>> + unsigned long flags;
>> + int rc;
>> +
>> + spin_lock_irqsave(&radix_lock, flags);
>> + rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
>> + if (rc == -EEXIST)
>> + __active_pfn_inc_overlap(entry);
>> + spin_unlock_irqrestore(&radix_lock, flags);
>> +
>> + return rc;
>> +}
>> +
>> +static void active_pfn_remove(struct dma_debug_entry *entry)
>> +{
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&radix_lock, flags);
>> + __active_pfn_dec_overlap(entry);
>> + spin_unlock_irqrestore(&radix_lock, flags);
>> +}
>
> OK, I think I see what's happening. The tags thing acts as a crude
> counter and if the map/unmap count ends up imbalanced, we deliberately
> leak an entry in the radix-tree so it can later be reported via undescribed
> means. Thoughts:

Certainly the leak will be noticed by debug_dma_assert_idle(), but
there's no guarantee that we trigger that check at the time of the
leak. Hmm, dma_debug_entries would also leak in that case...

> - RADIX_TREE_MAX_TAGS=3 so the code could count to 7, with a bit of
> futzing around.

Yes, if we are going to count might as well leverage the full number
space to help debug implementations that overlap severely. I should
flesh out the error reporting to say that debug_dma_assert_idle() may
give false positives in the case where the overlap counter overflows.

> - from a style/readability point of view it is unexpected that
> __active_pfn_dec_overlap() actually removes radix-tree items. It
> would be better to do:
>
> spin_lock_irqsave(&radix_lock, flags);
> if (__active_pfn_dec_overlap(entry) == something) {
> /*
> * Nice comment goes here
> */
> radix_tree_delete(...);
> }
> spin_unlock_irqrestore(&radix_lock, flags);
>

Yes, I should have noticed the asymmetry with the insert case, will fix.

2014-01-14 05:16:23

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] net_dma: revert 'copied_early'

From: Dan Williams <[email protected]>
Date: Mon, 13 Jan 2014 16:47:14 -0800

> Now that tcp_dma_try_early_copy() is gone nothing ever sets
> copied_early.
>
> Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
> since it is no longer necessary.
>
> Cc: Ali Saidi <[email protected]>
> Cc: James Morris <[email protected]>
> Cc: Patrick McHardy <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: Alexey Kuznetsov <[email protected]>
> Cc: Hideaki YOSHIFUJI <[email protected]>
> Cc: Neal Cardwell <[email protected]>
> Reported-by: Dave Jones <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Acked-by: David S. Miller <[email protected]>

2014-01-14 06:04:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] net_dma: revert 'copied_early'

On Mon, Jan 13, 2014 at 9:16 PM, David Miller <[email protected]> wrote:
> From: Dan Williams <[email protected]>
> Date: Mon, 13 Jan 2014 16:47:14 -0800
>
>> Now that tcp_dma_try_early_copy() is gone nothing ever sets
>> copied_early.
>>
>> Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma"
>> since it is no longer necessary.
>>
>> Cc: Ali Saidi <[email protected]>
>> Cc: James Morris <[email protected]>
>> Cc: Patrick McHardy <[email protected]>
>> Cc: Eric Dumazet <[email protected]>
>> Cc: David S. Miller <[email protected]>
>> Cc: Alexey Kuznetsov <[email protected]>
>> Cc: Hideaki YOSHIFUJI <[email protected]>
>> Cc: Neal Cardwell <[email protected]>
>> Reported-by: Dave Jones <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>
> Acked-by: David S. Miller <[email protected]>

Thank you sir.

2014-01-14 22:04:36

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] dma debug: introduce debug_dma_assert_idle()

On Mon, 2014-01-13 at 17:14 -0800, Andrew Morton wrote:
> On Mon, 13 Jan 2014 16:48:47 -0800 Dan Williams <[email protected]> wrote:
>
> > Record actively mapped pages and provide an api for asserting a given
> > page is dma inactive before execution proceeds. Placing
> > debug_dma_assert_idle() in cow_user_page() flagged the violation of the
> > dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
> > mark broken").
>
> Some discussion of the overlap counter thing would be useful.
>
[..]
> OK, I think I see what's happening. The tags thing acts as a crude
> counter and if the map/unmap count ends up imbalanced, we deliberately
> leak an entry in the radix-tree so it can later be reported via undescribed
> means. Thoughts:
>
> - RADIX_TREE_MAX_TAGS=3 so the code could count to 7, with a bit of
> futzing around.
>
> - from a style/readability point of view it is unexpected that
> __active_pfn_dec_overlap() actually removes radix-tree items. It
> would be better to do:
>
> spin_lock_irqsave(&radix_lock, flags);
> if (__active_pfn_dec_overlap(entry) == something) {
> /*
> * Nice comment goes here
> */
> radix_tree_delete(...);
> }
> spin_unlock_irqrestore(&radix_lock, flags);
>
>

Ok, here is v4, let me know if you prefer a new mail or if the
'scissors' are sufficient:

>8-----------------
From: Dan Williams <[email protected]>
Date: Tue, 17 Dec 2013 12:31:34 -0800
Subject: [PATCH v4] dma debug: introduce debug_dma_assert_idle()

Record actively mapped pages and provide an api for asserting a given
page is dma inactive before execution proceeds. Placing
debug_dma_assert_idle() in cow_user_page() flagged the violation of the
dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma:
mark broken").

The implementation includes the capability to count, in a limited way,
repeat mappings of the same page that occur without an intervening
unmap. This 'overlap' counter is limited to the few bits of tag space
in a radix tree. This mechanism is added to mitigate false negative
cases where, for example, a page is dma mapped twice and
debug_dma_assert_idle() is called after the page is un-mapped once.

Cc: Joerg Roedel <[email protected]>
Cc: Vinod Koul <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Russell King <[email protected]>
Cc: James Bottomley <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/dma-debug.h | 6 ++
lib/Kconfig.debug | 12 +++-
lib/dma-debug.c | 193 ++++++++++++++++++++++++++++++++++++++++++---
mm/memory.c | 3 +
4 files changed, 199 insertions(+), 15 deletions(-)

diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fc0e34ce038f..fe8cb610deac 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -85,6 +85,8 @@ extern void debug_dma_sync_sg_for_device(struct device *dev,

extern void debug_dma_dump_mappings(struct device *dev);

+extern void debug_dma_assert_idle(struct page *page);
+
#else /* CONFIG_DMA_API_DEBUG */

static inline void dma_debug_add_bus(struct bus_type *bus)
@@ -183,6 +185,10 @@ static inline void debug_dma_dump_mappings(struct device *dev)
{
}

+static inline void debug_dma_assert_idle(struct page *page)
+{
+}
+
#endif /* CONFIG_DMA_API_DEBUG */

#endif /* __DMA_DEBUG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index db25707aa41b..20073e7156e4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1575,8 +1575,16 @@ config DMA_API_DEBUG
With this option you will be able to detect common bugs in device
drivers like double-freeing of DMA mappings or freeing mappings that
were never allocated.
- This option causes a performance degredation. Use only if you want
- to debug device drivers. If unsure, say N.
+
+ This also attempts to catch cases where a page owned by DMA is
+ accessed by the cpu in a way that could cause data corruption. For
+ example, this enables cow_user_page() to check that the source page is
+ not undergoing DMA.
+
+ This option causes a performance degradation. Use only if you want to
+ debug device drivers and dma interactions.
+
+ If unsure, say N.

source "samples/Kconfig"

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index d87a17a819d0..c38083871f11 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -53,11 +53,26 @@ enum map_err_types {

#define DMA_DEBUG_STACKTRACE_ENTRIES 5

+/**
+ * struct dma_debug_entry - track a dma_map* or dma_alloc_coherent mapping
+ * @list: node on pre-allocated free_entries list
+ * @dev: 'dev' argument to dma_map_{page|single|sg} or dma_alloc_coherent
+ * @type: single, page, sg, coherent
+ * @pfn: page frame of the start address
+ * @offset: offset of mapping relative to pfn
+ * @size: length of the mapping
+ * @direction: enum dma_data_direction
+ * @sg_call_ents: 'nents' from dma_map_sg
+ * @sg_mapped_ents: 'mapped_ents' from dma_map_sg
+ * @map_err_type: track whether dma_mapping_error() was checked
+ * @stacktrace: support backtraces when a violation is detected
+ */
struct dma_debug_entry {
struct list_head list;
struct device *dev;
int type;
- phys_addr_t paddr;
+ unsigned long pfn;
+ size_t offset;
u64 dev_addr;
u64 size;
int direction;
@@ -372,6 +387,11 @@ static void hash_bucket_del(struct dma_debug_entry *entry)
list_del(&entry->list);
}

+static unsigned long long phys_addr(struct dma_debug_entry *entry)
+{
+ return page_to_phys(pfn_to_page(entry->pfn)) + entry->offset;
+}
+
/*
* Dump mapping entries for debugging purposes
*/
@@ -389,9 +409,9 @@ void debug_dma_dump_mappings(struct device *dev)
list_for_each_entry(entry, &bucket->list, list) {
if (!dev || dev == entry->dev) {
dev_info(entry->dev,
- "%s idx %d P=%Lx D=%Lx L=%Lx %s %s\n",
+ "%s idx %d P=%Lx N=%lx D=%Lx L=%Lx %s %s\n",
type2name[entry->type], idx,
- (unsigned long long)entry->paddr,
+ phys_addr(entry), entry->pfn,
entry->dev_addr, entry->size,
dir2name[entry->direction],
maperr2str[entry->map_err_type]);
@@ -404,6 +424,133 @@ void debug_dma_dump_mappings(struct device *dev)
EXPORT_SYMBOL(debug_dma_dump_mappings);

/*
+ * For each page mapped (initial page in the case of
+ * dma_alloc_coherent/dma_map_{single|page}, or each page in a
+ * scatterlist) insert into this tree using the pfn as the key. At
+ * dma_unmap_{single|sg|page} or dma_free_coherent delete the entry. If
+ * the pfn already exists at insertion time add a tag as a reference
+ * count for the overlapping mappings. For now, the overlap tracking
+ * just ensures that 'unmaps' balance 'maps' before marking the pfn
+ * idle, but we should also be flagging overlaps as an API violation.
+ *
+ * Memory usage is mostly constrained by the maximum number of available
+ * dma-debug entries in that we need a free dma_debug_entry before
+ * inserting into the tree. In the case of dma_map_{single|page} and
+ * dma_alloc_coherent there is only one dma_debug_entry and one pfn to
+ * track per event. dma_map_sg(), on the other hand,
+ * consumes a single dma_debug_entry, but inserts 'nents' entries into
+ * the tree.
+ *
+ * At any time debug_dma_assert_idle() can be called to trigger a
+ * warning if the given page is in the active set.
+ */
+static RADIX_TREE(dma_active_pfn, GFP_NOWAIT);
+static DEFINE_SPINLOCK(radix_lock);
+#define ACTIVE_PFN_MAX_OVERLAP ((1 << RADIX_TREE_MAX_TAGS) - 1)
+
+static int active_pfn_read_overlap(unsigned long pfn)
+{
+ int overlap = 0, i;
+
+ for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
+ if (radix_tree_tag_get(&dma_active_pfn, pfn, i))
+ overlap |= 1 << i;
+ return overlap;
+}
+
+static int active_pfn_set_overlap(unsigned long pfn, int overlap)
+{
+ int i;
+
+ if (overlap > ACTIVE_PFN_MAX_OVERLAP || overlap < 0)
+ return 0;
+
+ for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
+ if (overlap & 1 << i)
+ radix_tree_tag_set(&dma_active_pfn, pfn, i);
+ else
+ radix_tree_tag_clear(&dma_active_pfn, pfn, i);
+
+ return overlap;
+}
+
+static void active_pfn_inc_overlap(unsigned long pfn)
+{
+ int overlap = active_pfn_read_overlap(pfn);
+
+ overlap = active_pfn_set_overlap(pfn, ++overlap);
+
+ /* If we overflowed the overlap counter then we're potentially
+ * leaking dma-mappings. Otherwise, if maps and unmaps are
+ * balanced then this overflow may cause false negatives in
+ * debug_dma_assert_idle() as the pfn may be marked idle
+ * prematurely.
+ */
+ WARN_ONCE(overlap == 0,
+ "DMA-API: exceeded %d overlapping mappings of pfn %lx\n",
+ ACTIVE_PFN_MAX_OVERLAP, pfn);
+}
+
+static int active_pfn_dec_overlap(unsigned long pfn)
+{
+ int overlap = active_pfn_read_overlap(pfn);
+
+ return active_pfn_set_overlap(pfn, --overlap);
+}
+
+static int active_pfn_insert(struct dma_debug_entry *entry)
+{
+ unsigned long flags;
+ int rc;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ rc = radix_tree_insert(&dma_active_pfn, entry->pfn, entry);
+ if (rc == -EEXIST)
+ active_pfn_inc_overlap(entry->pfn);
+ spin_unlock_irqrestore(&radix_lock, flags);
+
+ return rc;
+}
+
+static void active_pfn_remove(struct dma_debug_entry *entry)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ if (active_pfn_dec_overlap(entry->pfn) == 0)
+ radix_tree_delete(&dma_active_pfn, entry->pfn);
+ spin_unlock_irqrestore(&radix_lock, flags);
+}
+
+/**
+ * debug_dma_assert_idle() - assert that a page is not undergoing dma
+ * @page: page to lookup in the dma_active_pfn tree
+ *
+ * Place a call to this routine in cases where the cpu touching the page
+ * before the dma completes (page is dma_unmapped) will lead to data
+ * corruption.
+ */
+void debug_dma_assert_idle(struct page *page)
+{
+ unsigned long flags;
+ struct dma_debug_entry *entry;
+
+ if (!page)
+ return;
+
+ spin_lock_irqsave(&radix_lock, flags);
+ entry = radix_tree_lookup(&dma_active_pfn, page_to_pfn(page));
+ spin_unlock_irqrestore(&radix_lock, flags);
+
+ if (!entry)
+ return;
+
+ err_printk(entry->dev, entry,
+ "DMA-API: cpu touching an active dma mapped page "
+ "[pfn=0x%lx]\n", entry->pfn);
+}
+
+/*
* Wrapper function for adding an entry to the hash.
* This function takes care of locking itself.
*/
@@ -411,10 +558,21 @@ static void add_dma_entry(struct dma_debug_entry *entry)
{
struct hash_bucket *bucket;
unsigned long flags;
+ int rc;

bucket = get_hash_bucket(entry, &flags);
hash_bucket_add(bucket, entry);
put_hash_bucket(bucket, &flags);
+
+ rc = active_pfn_insert(entry);
+ if (rc == -ENOMEM) {
+ pr_err("DMA-API: pfn tracking ENOMEM, dma-debug disabled\n");
+ global_disable = true;
+ }
+
+ /* TODO: report -EEXIST errors here as overlapping mappings are
+ * not supported by the DMA API
+ */
}

static struct dma_debug_entry *__dma_entry_alloc(void)
@@ -469,6 +627,8 @@ static void dma_entry_free(struct dma_debug_entry *entry)
{
unsigned long flags;

+ active_pfn_remove(entry);
+
/*
* add to beginning of the list - this way the entries are
* more likely cache hot when they are reallocated.
@@ -895,15 +1055,15 @@ static void check_unmap(struct dma_debug_entry *ref)
ref->dev_addr, ref->size,
type2name[entry->type], type2name[ref->type]);
} else if ((entry->type == dma_debug_coherent) &&
- (ref->paddr != entry->paddr)) {
+ (phys_addr(ref) != phys_addr(entry))) {
err_printk(ref->dev, entry, "DMA-API: device driver frees "
"DMA memory with different CPU address "
"[device address=0x%016llx] [size=%llu bytes] "
"[cpu alloc address=0x%016llx] "
"[cpu free address=0x%016llx]",
ref->dev_addr, ref->size,
- (unsigned long long)entry->paddr,
- (unsigned long long)ref->paddr);
+ phys_addr(entry),
+ phys_addr(ref));
}

if (ref->sg_call_ents && ref->type == dma_debug_sg &&
@@ -1052,7 +1212,8 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

entry->dev = dev;
entry->type = dma_debug_page;
- entry->paddr = page_to_phys(page) + offset;
+ entry->pfn = page_to_pfn(page);
+ entry->offset = offset,
entry->dev_addr = dma_addr;
entry->size = size;
entry->direction = direction;
@@ -1148,7 +1309,8 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,

entry->type = dma_debug_sg;
entry->dev = dev;
- entry->paddr = sg_phys(s);
+ entry->pfn = page_to_pfn(sg_page(s));
+ entry->offset = s->offset,
entry->size = sg_dma_len(s);
entry->dev_addr = sg_dma_address(s);
entry->direction = direction;
@@ -1198,7 +1360,8 @@ void debug_dma_unmap_sg(struct device *dev, struct scatterlist *sglist,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = dir,
@@ -1233,7 +1396,8 @@ void debug_dma_alloc_coherent(struct device *dev, size_t size,

entry->type = dma_debug_coherent;
entry->dev = dev;
- entry->paddr = virt_to_phys(virt);
+ entry->pfn = page_to_pfn(virt_to_page(virt));
+ entry->offset = (size_t) virt & PAGE_MASK;
entry->size = size;
entry->dev_addr = dma_addr;
entry->direction = DMA_BIDIRECTIONAL;
@@ -1248,7 +1412,8 @@ void debug_dma_free_coherent(struct device *dev, size_t size,
struct dma_debug_entry ref = {
.type = dma_debug_coherent,
.dev = dev,
- .paddr = virt_to_phys(virt),
+ .pfn = page_to_pfn(virt_to_page(virt)),
+ .offset = (size_t) virt & PAGE_MASK,
.dev_addr = addr,
.size = size,
.direction = DMA_BIDIRECTIONAL,
@@ -1356,7 +1521,8 @@ void debug_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = direction,
@@ -1388,7 +1554,8 @@ void debug_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
struct dma_debug_entry ref = {
.type = dma_debug_sg,
.dev = dev,
- .paddr = sg_phys(s),
+ .pfn = page_to_pfn(sg_page(s)),
+ .offset = s->offset,
.dev_addr = sg_dma_address(s),
.size = sg_dma_len(s),
.direction = direction,
diff --git a/mm/memory.c b/mm/memory.c
index 5d9025f3b3e1..c89788436f81 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/dma-debug.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -2559,6 +2560,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,

static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
{
+ debug_dma_assert_idle(src);
+
/*
* If the source page was a PFN mapping, we don't have
* a "struct page" for it. We do a best-effort copy by
--
1.7.7.6


2014-01-15 21:20:32

by saeed bishara

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

Hi Dan,

I'm using net_dma on my system and I achieve meaningful performance
boost when running Iperf receive.

As far as I know the net_dma is used by many embedded systems out
there and might effect their performance.
Can you please elaborate on the exact scenario that cause the memory corruption?

Is the scenario mentioned here caused by "real life" application or
this is more of theoretical issue found through manual testing, I was
trying to find the thread describing the failing scenario and couldn't
find it, any pointer will be appreciated.

Thanks

On Tue, Jan 14, 2014 at 2:46 AM, Dan Williams <[email protected]> wrote:
> Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
> and there is no plan to fix it.
>
> This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
> Reverting the remainder of the net_dma induced changes is deferred to
> subsequent patches.
>
> Cc: Dave Jiang <[email protected]>
> Cc: Vinod Koul <[email protected]>
> Cc: David Whipple <[email protected]>
> Cc: Alexander Duyck <[email protected]>
> Acked-by: David S. Miller <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
>
> No changes since v2
>
>
> Documentation/ABI/removed/net_dma | 8 +
> Documentation/networking/ip-sysctl.txt | 6 -
> drivers/dma/Kconfig | 12 -
> drivers/dma/Makefile | 1
> drivers/dma/dmaengine.c | 104 ------------
> drivers/dma/ioat/dma.c | 1
> drivers/dma/ioat/dma.h | 7 -
> drivers/dma/ioat/dma_v2.c | 1
> drivers/dma/ioat/dma_v3.c | 1
> drivers/dma/iovlock.c | 280 --------------------------------
> include/linux/dmaengine.h | 22 ---
> include/linux/skbuff.h | 8 -
> include/linux/tcp.h | 8 -
> include/net/netdma.h | 32 ----
> include/net/sock.h | 19 --
> include/net/tcp.h | 8 -
> kernel/sysctl_binary.c | 1
> net/core/Makefile | 1
> net/core/dev.c | 10 -
> net/core/sock.c | 6 -
> net/core/user_dma.c | 131 ---------------
> net/dccp/proto.c | 4
> net/ipv4/sysctl_net_ipv4.c | 9 -
> net/ipv4/tcp.c | 147 ++---------------
> net/ipv4/tcp_input.c | 61 -------
> net/ipv4/tcp_ipv4.c | 18 --
> net/ipv6/tcp_ipv6.c | 13 -
> net/llc/af_llc.c | 10 +
> 28 files changed, 35 insertions(+), 894 deletions(-)
> create mode 100644 Documentation/ABI/removed/net_dma
> delete mode 100644 drivers/dma/iovlock.c
> delete mode 100644 include/net/netdma.h
> delete mode 100644 net/core/user_dma.c
>
> diff --git a/Documentation/ABI/removed/net_dma b/Documentation/ABI/removed/net_dma
> new file mode 100644
> index 000000000000..a173aecc2f18
> --- /dev/null
> +++ b/Documentation/ABI/removed/net_dma
> @@ -0,0 +1,8 @@
> +What: tcp_dma_copybreak sysctl
> +Date: Removed in kernel v3.13
> +Contact: Dan Williams <[email protected]>
> +Description:
> + Formerly the lower limit, in bytes, of the size of socket reads
> + that will be offloaded to a DMA copy engine. Removed due to
> + coherency issues of the cpu potentially touching the buffers
> + while dma is in flight.
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 3c12d9a7ed00..bdd8a67f0be2 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -538,12 +538,6 @@ tcp_workaround_signed_windows - BOOLEAN
> not receive a window scaling option from them.
> Default: 0
>
> -tcp_dma_copybreak - INTEGER
> - Lower limit, in bytes, of the size of socket reads that will be
> - offloaded to a DMA copy engine, if one is present in the system
> - and CONFIG_NET_DMA is enabled.
> - Default: 4096
> -
> tcp_thin_linear_timeouts - BOOLEAN
> Enable dynamic triggering of linear timeouts for thin streams.
> If set, a check is performed upon retransmission by timeout to
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index c823daaf9043..b24f13195272 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -351,18 +351,6 @@ config DMA_OF
> comment "DMA Clients"
> depends on DMA_ENGINE
>
> -config NET_DMA
> - bool "Network: TCP receive copy offload"
> - depends on DMA_ENGINE && NET
> - default (INTEL_IOATDMA || FSL_DMA)
> - depends on BROKEN
> - help
> - This enables the use of DMA engines in the network stack to
> - offload receive copy-to-user operations, freeing CPU cycles.
> -
> - Say Y here if you enabled INTEL_IOATDMA or FSL_DMA, otherwise
> - say N.
> -
> config ASYNC_TX_DMA
> bool "Async_tx: Offload support for the async_tx api"
> depends on DMA_ENGINE
> diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
> index 0ce2da97e429..024b008a25de 100644
> --- a/drivers/dma/Makefile
> +++ b/drivers/dma/Makefile
> @@ -6,7 +6,6 @@ obj-$(CONFIG_DMA_VIRTUAL_CHANNELS) += virt-dma.o
> obj-$(CONFIG_DMA_ACPI) += acpi-dma.o
> obj-$(CONFIG_DMA_OF) += of-dma.o
>
> -obj-$(CONFIG_NET_DMA) += iovlock.o
> obj-$(CONFIG_INTEL_MID_DMAC) += intel_mid_dma.o
> obj-$(CONFIG_DMATEST) += dmatest.o
> obj-$(CONFIG_INTEL_IOATDMA) += ioat/
> diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> index ef63b9058f3c..d7f4f4e0d71f 100644
> --- a/drivers/dma/dmaengine.c
> +++ b/drivers/dma/dmaengine.c
> @@ -1029,110 +1029,6 @@ dmaengine_get_unmap_data(struct device *dev, int nr, gfp_t flags)
> }
> EXPORT_SYMBOL(dmaengine_get_unmap_data);
>
> -/**
> - * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
> - * @chan: DMA channel to offload copy to
> - * @dest_pg: destination page
> - * @dest_off: offset in page to copy to
> - * @src_pg: source page
> - * @src_off: offset in page to copy from
> - * @len: length
> - *
> - * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
> - * address according to the DMA mapping API rules for streaming mappings.
> - * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
> - * (kernel memory or locked user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_pg_to_pg(struct dma_chan *chan, struct page *dest_pg,
> - unsigned int dest_off, struct page *src_pg, unsigned int src_off,
> - size_t len)
> -{
> - struct dma_device *dev = chan->device;
> - struct dma_async_tx_descriptor *tx;
> - struct dmaengine_unmap_data *unmap;
> - dma_cookie_t cookie;
> - unsigned long flags;
> -
> - unmap = dmaengine_get_unmap_data(dev->dev, 2, GFP_NOWAIT);
> - if (!unmap)
> - return -ENOMEM;
> -
> - unmap->to_cnt = 1;
> - unmap->from_cnt = 1;
> - unmap->addr[0] = dma_map_page(dev->dev, src_pg, src_off, len,
> - DMA_TO_DEVICE);
> - unmap->addr[1] = dma_map_page(dev->dev, dest_pg, dest_off, len,
> - DMA_FROM_DEVICE);
> - unmap->len = len;
> - flags = DMA_CTRL_ACK;
> - tx = dev->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0],
> - len, flags);
> -
> - if (!tx) {
> - dmaengine_unmap_put(unmap);
> - return -ENOMEM;
> - }
> -
> - dma_set_unmap(tx, unmap);
> - cookie = tx->tx_submit(tx);
> - dmaengine_unmap_put(unmap);
> -
> - preempt_disable();
> - __this_cpu_add(chan->local->bytes_transferred, len);
> - __this_cpu_inc(chan->local->memcpy_count);
> - preempt_enable();
> -
> - return cookie;
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
> -
> -/**
> - * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
> - * @chan: DMA channel to offload copy to
> - * @dest: destination address (virtual)
> - * @src: source address (virtual)
> - * @len: length
> - *
> - * Both @dest and @src must be mappable to a bus address according to the
> - * DMA mapping API rules for streaming mappings.
> - * Both @dest and @src must stay memory resident (kernel memory or locked
> - * user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_buf(struct dma_chan *chan, void *dest,
> - void *src, size_t len)
> -{
> - return dma_async_memcpy_pg_to_pg(chan, virt_to_page(dest),
> - (unsigned long) dest & ~PAGE_MASK,
> - virt_to_page(src),
> - (unsigned long) src & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
> -
> -/**
> - * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
> - * @chan: DMA channel to offload copy to
> - * @page: destination page
> - * @offset: offset in page to copy to
> - * @kdata: source address (virtual)
> - * @len: length
> - *
> - * Both @page/@offset and @kdata must be mappable to a bus address according
> - * to the DMA mapping API rules for streaming mappings.
> - * Both @page/@offset and @kdata must stay memory resident (kernel memory or
> - * locked user space pages)
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_pg(struct dma_chan *chan, struct page *page,
> - unsigned int offset, void *kdata, size_t len)
> -{
> - return dma_async_memcpy_pg_to_pg(chan, page, offset,
> - virt_to_page(kdata),
> - (unsigned long) kdata & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
> -
> void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
> struct dma_chan *chan)
> {
> diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c
> index 1a49c777607c..97fa394ca855 100644
> --- a/drivers/dma/ioat/dma.c
> +++ b/drivers/dma/ioat/dma.c
> @@ -1175,7 +1175,6 @@ int ioat1_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(4096);
> err = ioat_register(device);
> if (err)
> return err;
> diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
> index 11fb877ddca9..664ec9cbd651 100644
> --- a/drivers/dma/ioat/dma.h
> +++ b/drivers/dma/ioat/dma.h
> @@ -214,13 +214,6 @@ __dump_desc_dbg(struct ioat_chan_common *chan, struct ioat_dma_descriptor *hw,
> #define dump_desc_dbg(c, d) \
> ({ if (d) __dump_desc_dbg(&c->base, d->hw, &d->txd, desc_id(d)); 0; })
>
> -static inline void ioat_set_tcp_copy_break(unsigned long copybreak)
> -{
> - #ifdef CONFIG_NET_DMA
> - sysctl_tcp_dma_copybreak = copybreak;
> - #endif
> -}
> -
> static inline struct ioat_chan_common *
> ioat_chan_by_index(struct ioatdma_device *device, int index)
> {
> diff --git a/drivers/dma/ioat/dma_v2.c b/drivers/dma/ioat/dma_v2.c
> index 5d3affe7e976..31e8098e444f 100644
> --- a/drivers/dma/ioat/dma_v2.c
> +++ b/drivers/dma/ioat/dma_v2.c
> @@ -900,7 +900,6 @@ int ioat2_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(2048);
>
> list_for_each_entry(c, &dma->channels, device_node) {
> chan = to_chan_common(c);
> diff --git a/drivers/dma/ioat/dma_v3.c b/drivers/dma/ioat/dma_v3.c
> index 820817e97e62..4bb81346bee2 100644
> --- a/drivers/dma/ioat/dma_v3.c
> +++ b/drivers/dma/ioat/dma_v3.c
> @@ -1652,7 +1652,6 @@ int ioat3_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(262144);
>
> list_for_each_entry(c, &dma->channels, device_node) {
> chan = to_chan_common(c);
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> deleted file mode 100644
> index bb48a57c2fc1..000000000000
> --- a/drivers/dma/iovlock.c
> +++ /dev/null
> @@ -1,280 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/pagemap.h>
> -#include <linux/slab.h>
> -#include <net/tcp.h> /* for memcpy_toiovec */
> -#include <asm/io.h>
> -#include <asm/uaccess.h>
> -
> -static int num_pages_spanned(struct iovec *iov)
> -{
> - return
> - ((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
> - ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT);
> -}
> -
> -/*
> - * Pin down all the iovec pages needed for len bytes.
> - * Return a struct dma_pinned_list to keep track of pages pinned down.
> - *
> - * We are allocating a single chunk of memory, and then carving it up into
> - * 3 sections, the latter 2 whose size depends on the number of iovecs and the
> - * total number of pages, respectively.
> - */
> -struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
> -{
> - struct dma_pinned_list *local_list;
> - struct page **pages;
> - int i;
> - int ret;
> - int nr_iovecs = 0;
> - int iovec_len_used = 0;
> - int iovec_pages_used = 0;
> -
> - /* don't pin down non-user-based iovecs */
> - if (segment_eq(get_fs(), KERNEL_DS))
> - return NULL;
> -
> - /* determine how many iovecs/pages there are, up front */
> - do {
> - iovec_len_used += iov[nr_iovecs].iov_len;
> - iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]);
> - nr_iovecs++;
> - } while (iovec_len_used < len);
> -
> - /* single kmalloc for pinned list, page_list[], and the page arrays */
> - local_list = kmalloc(sizeof(*local_list)
> - + (nr_iovecs * sizeof (struct dma_page_list))
> - + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
> - if (!local_list)
> - goto out;
> -
> - /* list of pages starts right after the page list array */
> - pages = (struct page **) &local_list->page_list[nr_iovecs];
> -
> - local_list->nr_iovecs = 0;
> -
> - for (i = 0; i < nr_iovecs; i++) {
> - struct dma_page_list *page_list = &local_list->page_list[i];
> -
> - len -= iov[i].iov_len;
> -
> - if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len))
> - goto unpin;
> -
> - page_list->nr_pages = num_pages_spanned(&iov[i]);
> - page_list->base_address = iov[i].iov_base;
> -
> - page_list->pages = pages;
> - pages += page_list->nr_pages;
> -
> - /* pin pages down */
> - down_read(&current->mm->mmap_sem);
> - ret = get_user_pages(
> - current,
> - current->mm,
> - (unsigned long) iov[i].iov_base,
> - page_list->nr_pages,
> - 1, /* write */
> - 0, /* force */
> - page_list->pages,
> - NULL);
> - up_read(&current->mm->mmap_sem);
> -
> - if (ret != page_list->nr_pages)
> - goto unpin;
> -
> - local_list->nr_iovecs = i + 1;
> - }
> -
> - return local_list;
> -
> -unpin:
> - dma_unpin_iovec_pages(local_list);
> -out:
> - return NULL;
> -}
> -
> -void dma_unpin_iovec_pages(struct dma_pinned_list *pinned_list)
> -{
> - int i, j;
> -
> - if (!pinned_list)
> - return;
> -
> - for (i = 0; i < pinned_list->nr_iovecs; i++) {
> - struct dma_page_list *page_list = &pinned_list->page_list[i];
> - for (j = 0; j < page_list->nr_pages; j++) {
> - set_page_dirty_lock(page_list->pages[j]);
> - page_cache_release(page_list->pages[j]);
> - }
> - }
> -
> - kfree(pinned_list);
> -}
> -
> -
> -/*
> - * We have already pinned down the pages we will be using in the iovecs.
> - * Each entry in iov array has corresponding entry in pinned_list->page_list.
> - * Using array indexing to keep iov[] and page_list[] in sync.
> - * Initial elements in iov array's iov->iov_len will be 0 if already copied into
> - * by another call.
> - * iov array length remaining guaranteed to be bigger than len.
> - */
> -dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
> - struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len)
> -{
> - int iov_byte_offset;
> - int copy;
> - dma_cookie_t dma_cookie = 0;
> - int iovec_idx;
> - int page_idx;
> -
> - if (!chan)
> - return memcpy_toiovec(iov, kdata, len);
> -
> - iovec_idx = 0;
> - while (iovec_idx < pinned_list->nr_iovecs) {
> - struct dma_page_list *page_list;
> -
> - /* skip already used-up iovecs */
> - while (!iov[iovec_idx].iov_len)
> - iovec_idx++;
> -
> - page_list = &pinned_list->page_list[iovec_idx];
> -
> - iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> - page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> - - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> - /* break up copies to not cross page boundary */
> - while (iov[iovec_idx].iov_len) {
> - copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> - copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> - dma_cookie = dma_async_memcpy_buf_to_pg(chan,
> - page_list->pages[page_idx],
> - iov_byte_offset,
> - kdata,
> - copy);
> - /* poll for a descriptor slot */
> - if (unlikely(dma_cookie < 0)) {
> - dma_async_issue_pending(chan);
> - continue;
> - }
> -
> - len -= copy;
> - iov[iovec_idx].iov_len -= copy;
> - iov[iovec_idx].iov_base += copy;
> -
> - if (!len)
> - return dma_cookie;
> -
> - kdata += copy;
> - iov_byte_offset = 0;
> - page_idx++;
> - }
> - iovec_idx++;
> - }
> -
> - /* really bad if we ever run out of iovecs */
> - BUG();
> - return -EFAULT;
> -}
> -
> -dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
> - struct dma_pinned_list *pinned_list, struct page *page,
> - unsigned int offset, size_t len)
> -{
> - int iov_byte_offset;
> - int copy;
> - dma_cookie_t dma_cookie = 0;
> - int iovec_idx;
> - int page_idx;
> - int err;
> -
> - /* this needs as-yet-unimplemented buf-to-buff, so punt. */
> - /* TODO: use dma for this */
> - if (!chan || !pinned_list) {
> - u8 *vaddr = kmap(page);
> - err = memcpy_toiovec(iov, vaddr + offset, len);
> - kunmap(page);
> - return err;
> - }
> -
> - iovec_idx = 0;
> - while (iovec_idx < pinned_list->nr_iovecs) {
> - struct dma_page_list *page_list;
> -
> - /* skip already used-up iovecs */
> - while (!iov[iovec_idx].iov_len)
> - iovec_idx++;
> -
> - page_list = &pinned_list->page_list[iovec_idx];
> -
> - iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> - page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> - - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> - /* break up copies to not cross page boundary */
> - while (iov[iovec_idx].iov_len) {
> - copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> - copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> - dma_cookie = dma_async_memcpy_pg_to_pg(chan,
> - page_list->pages[page_idx],
> - iov_byte_offset,
> - page,
> - offset,
> - copy);
> - /* poll for a descriptor slot */
> - if (unlikely(dma_cookie < 0)) {
> - dma_async_issue_pending(chan);
> - continue;
> - }
> -
> - len -= copy;
> - iov[iovec_idx].iov_len -= copy;
> - iov[iovec_idx].iov_base += copy;
> -
> - if (!len)
> - return dma_cookie;
> -
> - offset += copy;
> - iov_byte_offset = 0;
> - page_idx++;
> - }
> - iovec_idx++;
> - }
> -
> - /* really bad if we ever run out of iovecs */
> - BUG();
> - return -EFAULT;
> -}
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index 41cf0c399288..890545871af0 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -875,18 +875,6 @@ static inline void dmaengine_put(void)
> }
> #endif
>
> -#ifdef CONFIG_NET_DMA
> -#define net_dmaengine_get() dmaengine_get()
> -#define net_dmaengine_put() dmaengine_put()
> -#else
> -static inline void net_dmaengine_get(void)
> -{
> -}
> -static inline void net_dmaengine_put(void)
> -{
> -}
> -#endif
> -
> #ifdef CONFIG_ASYNC_TX_DMA
> #define async_dmaengine_get() dmaengine_get()
> #define async_dmaengine_put() dmaengine_put()
> @@ -908,16 +896,8 @@ async_dma_find_channel(enum dma_transaction_type type)
> return NULL;
> }
> #endif /* CONFIG_ASYNC_TX_DMA */
> -
> -dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
> - void *dest, void *src, size_t len);
> -dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
> - struct page *page, unsigned int offset, void *kdata, size_t len);
> -dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
> - struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
> - unsigned int src_off, size_t len);
> void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
> - struct dma_chan *chan);
> + struct dma_chan *chan);
>
> static inline void async_tx_ack(struct dma_async_tx_descriptor *tx)
> {
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bec1cc7d5e3c..ac4f84dfa84b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -28,7 +28,6 @@
> #include <linux/textsearch.h>
> #include <net/checksum.h>
> #include <linux/rcupdate.h>
> -#include <linux/dmaengine.h>
> #include <linux/hrtimer.h>
> #include <linux/dma-mapping.h>
> #include <linux/netdev_features.h>
> @@ -496,11 +495,8 @@ struct sk_buff {
> /* 6/8 bit hole (depending on ndisc_nodetype presence) */
> kmemcheck_bitfield_end(flags2);
>
> -#if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
> - union {
> - unsigned int napi_id;
> - dma_cookie_t dma_cookie;
> - };
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> + unsigned int napi_id;
> #endif
> #ifdef CONFIG_NETWORK_SECMARK
> __u32 secmark;
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d68633452d9b..26f16021ce1d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -19,7 +19,6 @@
>
>
> #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
> #include <net/sock.h>
> #include <net/inet_connection_sock.h>
> #include <net/inet_timewait_sock.h>
> @@ -169,13 +168,6 @@ struct tcp_sock {
> struct iovec *iov;
> int memory;
> int len;
> -#ifdef CONFIG_NET_DMA
> - /* members for async copy */
> - struct dma_chan *dma_chan;
> - int wakeup;
> - struct dma_pinned_list *pinned_list;
> - dma_cookie_t dma_cookie;
> -#endif
> } ucopy;
>
> u32 snd_wl1; /* Sequence for window update */
> diff --git a/include/net/netdma.h b/include/net/netdma.h
> deleted file mode 100644
> index 8ba8ce284eeb..000000000000
> --- a/include/net/netdma.h
> +++ /dev/null
> @@ -1,32 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -#ifndef NETDMA_H
> -#define NETDMA_H
> -#ifdef CONFIG_NET_DMA
> -#include <linux/dmaengine.h>
> -#include <linux/skbuff.h>
> -
> -int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
> - struct sk_buff *skb, int offset, struct iovec *to,
> - size_t len, struct dma_pinned_list *pinned_list);
> -
> -#endif /* CONFIG_NET_DMA */
> -#endif /* NETDMA_H */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e3a18ff0c38b..9d5f716e921e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -231,7 +231,6 @@ struct cg_proto;
> * @sk_receive_queue: incoming packets
> * @sk_wmem_alloc: transmit queue bytes committed
> * @sk_write_queue: Packet sending queue
> - * @sk_async_wait_queue: DMA copied packets
> * @sk_omem_alloc: "o" is "option" or "other"
> * @sk_wmem_queued: persistent queue size
> * @sk_forward_alloc: space allocated forward
> @@ -354,10 +353,6 @@ struct sock {
> struct sk_filter __rcu *sk_filter;
> struct socket_wq __rcu *sk_wq;
>
> -#ifdef CONFIG_NET_DMA
> - struct sk_buff_head sk_async_wait_queue;
> -#endif
> -
> #ifdef CONFIG_XFRM
> struct xfrm_policy *sk_policy[2];
> #endif
> @@ -2200,27 +2195,15 @@ void sock_tx_timestamp(struct sock *sk, __u8 *tx_flags);
> * sk_eat_skb - Release a skb if it is no longer needed
> * @sk: socket to eat this skb from
> * @skb: socket buffer to eat
> - * @copied_early: flag indicating whether DMA operations copied this data early
> *
> * This routine must be called with interrupts disabled or with the socket
> * locked so that the sk_buff queue operation is ok.
> */
> -#ifdef CONFIG_NET_DMA
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> -{
> - __skb_unlink(skb, &sk->sk_receive_queue);
> - if (!copied_early)
> - __kfree_skb(skb);
> - else
> - __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> -}
> -#else
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> +static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
> {
> __skb_unlink(skb, &sk->sk_receive_queue);
> __kfree_skb(skb);
> }
> -#endif
>
> static inline
> struct net *sock_net(const struct sock *sk)
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 70e55d200610..084c163e9d40 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -27,7 +27,6 @@
> #include <linux/cache.h>
> #include <linux/percpu.h>
> #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
> #include <linux/crypto.h>
> #include <linux/cryptohash.h>
> #include <linux/kref.h>
> @@ -267,7 +266,6 @@ extern int sysctl_tcp_adv_win_scale;
> extern int sysctl_tcp_tw_reuse;
> extern int sysctl_tcp_frto;
> extern int sysctl_tcp_low_latency;
> -extern int sysctl_tcp_dma_copybreak;
> extern int sysctl_tcp_nometrics_save;
> extern int sysctl_tcp_moderate_rcvbuf;
> extern int sysctl_tcp_tso_win_divisor;
> @@ -1032,12 +1030,6 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp)
> tp->ucopy.len = 0;
> tp->ucopy.memory = 0;
> skb_queue_head_init(&tp->ucopy.prequeue);
> -#ifdef CONFIG_NET_DMA
> - tp->ucopy.dma_chan = NULL;
> - tp->ucopy.wakeup = 0;
> - tp->ucopy.pinned_list = NULL;
> - tp->ucopy.dma_cookie = 0;
> -#endif
> }
>
> bool tcp_prequeue(struct sock *sk, struct sk_buff *skb);
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 653cbbd9e7ad..d457005acedf 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -390,7 +390,6 @@ static const struct bin_table bin_net_ipv4_table[] = {
> { CTL_INT, NET_TCP_MTU_PROBING, "tcp_mtu_probing" },
> { CTL_INT, NET_TCP_BASE_MSS, "tcp_base_mss" },
> { CTL_INT, NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS, "tcp_workaround_signed_windows" },
> - { CTL_INT, NET_TCP_DMA_COPYBREAK, "tcp_dma_copybreak" },
> { CTL_INT, NET_TCP_SLOW_START_AFTER_IDLE, "tcp_slow_start_after_idle" },
> { CTL_INT, NET_CIPSOV4_CACHE_ENABLE, "cipso_cache_enable" },
> { CTL_INT, NET_CIPSOV4_CACHE_BUCKET_SIZE, "cipso_cache_bucket_size" },
> diff --git a/net/core/Makefile b/net/core/Makefile
> index b33b996f5dd6..5f98e5983bd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -16,7 +16,6 @@ obj-y += net-sysfs.o
> obj-$(CONFIG_PROC_FS) += net-procfs.o
> obj-$(CONFIG_NET_PKTGEN) += pktgen.o
> obj-$(CONFIG_NETPOLL) += netpoll.o
> -obj-$(CONFIG_NET_DMA) += user_dma.o
> obj-$(CONFIG_FIB_RULES) += fib_rules.o
> obj-$(CONFIG_TRACEPOINTS) += net-traces.o
> obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ba3b7ea5ebb3..677a5a4dcca7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1262,7 +1262,6 @@ static int __dev_open(struct net_device *dev)
> clear_bit(__LINK_STATE_START, &dev->state);
> else {
> dev->flags |= IFF_UP;
> - net_dmaengine_get();
> dev_set_rx_mode(dev);
> dev_activate(dev);
> add_device_randomness(dev->dev_addr, dev->addr_len);
> @@ -1338,7 +1337,6 @@ static int __dev_close_many(struct list_head *head)
> ops->ndo_stop(dev);
>
> dev->flags &= ~IFF_UP;
> - net_dmaengine_put();
> }
>
> return 0;
> @@ -4362,14 +4360,6 @@ static void net_rx_action(struct softirq_action *h)
> out:
> net_rps_action_and_irq_enable(sd);
>
> -#ifdef CONFIG_NET_DMA
> - /*
> - * There may not be any more sk_buffs coming right now, so push
> - * any pending DMA copies to hardware
> - */
> - dma_issue_pending_all();
> -#endif
> -
> return;
>
> softnet_break:
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ab20ed9b0f31..411dab3a5726 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1461,9 +1461,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> atomic_set(&newsk->sk_omem_alloc, 0);
> skb_queue_head_init(&newsk->sk_receive_queue);
> skb_queue_head_init(&newsk->sk_write_queue);
> -#ifdef CONFIG_NET_DMA
> - skb_queue_head_init(&newsk->sk_async_wait_queue);
> -#endif
>
> spin_lock_init(&newsk->sk_dst_lock);
> rwlock_init(&newsk->sk_callback_lock);
> @@ -2290,9 +2287,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
> skb_queue_head_init(&sk->sk_receive_queue);
> skb_queue_head_init(&sk->sk_write_queue);
> skb_queue_head_init(&sk->sk_error_queue);
> -#ifdef CONFIG_NET_DMA
> - skb_queue_head_init(&sk->sk_async_wait_queue);
> -#endif
>
> sk->sk_send_head = NULL;
>
> diff --git a/net/core/user_dma.c b/net/core/user_dma.c
> deleted file mode 100644
> index 1b5fefdb8198..000000000000
> --- a/net/core/user_dma.c
> +++ /dev/null
> @@ -1,131 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/socket.h>
> -#include <linux/export.h>
> -#include <net/tcp.h>
> -#include <net/netdma.h>
> -
> -#define NET_DMA_DEFAULT_COPYBREAK 4096
> -
> -int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK;
> -EXPORT_SYMBOL(sysctl_tcp_dma_copybreak);
> -
> -/**
> - * dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
> - * @skb - buffer to copy
> - * @offset - offset in the buffer to start copying from
> - * @iovec - io vector to copy to
> - * @len - amount of data to copy from buffer to iovec
> - * @pinned_list - locked iovec buffer data
> - *
> - * Note: the iovec is modified during the copy.
> - */
> -int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
> - struct sk_buff *skb, int offset, struct iovec *to,
> - size_t len, struct dma_pinned_list *pinned_list)
> -{
> - int start = skb_headlen(skb);
> - int i, copy = start - offset;
> - struct sk_buff *frag_iter;
> - dma_cookie_t cookie = 0;
> -
> - /* Copy header. */
> - if (copy > 0) {
> - if (copy > len)
> - copy = len;
> - cookie = dma_memcpy_to_iovec(chan, to, pinned_list,
> - skb->data + offset, copy);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> -
> - /* Copy paged appendix. Hmm... why does this look so complicated? */
> - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> - int end;
> - const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> -
> - WARN_ON(start > offset + len);
> -
> - end = start + skb_frag_size(frag);
> - copy = end - offset;
> - if (copy > 0) {
> - struct page *page = skb_frag_page(frag);
> -
> - if (copy > len)
> - copy = len;
> -
> - cookie = dma_memcpy_pg_to_iovec(chan, to, pinned_list, page,
> - frag->page_offset + offset - start, copy);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> - start = end;
> - }
> -
> - skb_walk_frags(skb, frag_iter) {
> - int end;
> -
> - WARN_ON(start > offset + len);
> -
> - end = start + frag_iter->len;
> - copy = end - offset;
> - if (copy > 0) {
> - if (copy > len)
> - copy = len;
> - cookie = dma_skb_copy_datagram_iovec(chan, frag_iter,
> - offset - start,
> - to, copy,
> - pinned_list);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> - start = end;
> - }
> -
> -end:
> - if (!len) {
> - skb->dma_cookie = cookie;
> - return cookie;
> - }
> -
> -fault:
> - return -EFAULT;
> -}
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index eb892b4f4814..f9076f295b13 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -848,7 +848,7 @@ int dccp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> default:
> dccp_pr_debug("packet_type=%s\n",
> dccp_packet_name(dh->dccph_type));
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> }
> verify_sock_status:
> if (sock_flag(sk, SOCK_DONE)) {
> @@ -905,7 +905,7 @@ verify_sock_status:
> len = skb->len;
> found_fin_ok:
> if (!(flags & MSG_PEEK))
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> break;
> } while (1);
> out:
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 3d69ec8dac57..79a90b92e12d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -642,15 +642,6 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> -#ifdef CONFIG_NET_DMA
> - {
> - .procname = "tcp_dma_copybreak",
> - .data = &sysctl_tcp_dma_copybreak,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = proc_dointvec
> - },
> -#endif
> {
> .procname = "tcp_slow_start_after_idle",
> .data = &sysctl_tcp_slow_start_after_idle,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c4638e6f0238..8dc913dfbaef 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -274,7 +274,6 @@
> #include <net/tcp.h>
> #include <net/xfrm.h>
> #include <net/ip.h>
> -#include <net/netdma.h>
> #include <net/sock.h>
>
> #include <asm/uaccess.h>
> @@ -1409,39 +1408,6 @@ static void tcp_prequeue_process(struct sock *sk)
> tp->ucopy.memory = 0;
> }
>
> -#ifdef CONFIG_NET_DMA
> -static void tcp_service_net_dma(struct sock *sk, bool wait)
> -{
> - dma_cookie_t done, used;
> - dma_cookie_t last_issued;
> - struct tcp_sock *tp = tcp_sk(sk);
> -
> - if (!tp->ucopy.dma_chan)
> - return;
> -
> - last_issued = tp->ucopy.dma_cookie;
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> - do {
> - if (dma_async_is_tx_complete(tp->ucopy.dma_chan,
> - last_issued, &done,
> - &used) == DMA_COMPLETE) {
> - /* Safe to free early-copied skbs now */
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> - break;
> - } else {
> - struct sk_buff *skb;
> - while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
> - (dma_async_is_complete(skb->dma_cookie, done,
> - used) == DMA_COMPLETE)) {
> - __skb_dequeue(&sk->sk_async_wait_queue);
> - kfree_skb(skb);
> - }
> - }
> - } while (wait);
> -}
> -#endif
> -
> static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
> {
> struct sk_buff *skb;
> @@ -1459,7 +1425,7 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
> * splitted a fat GRO packet, while we released socket lock
> * in skb_splice_bits()
> */
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> }
> return NULL;
> }
> @@ -1525,11 +1491,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> continue;
> }
> if (tcp_hdr(skb)->fin) {
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> ++seq;
> break;
> }
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> if (!desc->count)
> break;
> tp->copied_seq = seq;
> @@ -1567,7 +1533,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> int target; /* Read at least this many bytes */
> long timeo;
> struct task_struct *user_recv = NULL;
> - bool copied_early = false;
> struct sk_buff *skb;
> u32 urg_hole = 0;
>
> @@ -1610,28 +1575,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
>
> -#ifdef CONFIG_NET_DMA
> - tp->ucopy.dma_chan = NULL;
> - preempt_disable();
> - skb = skb_peek_tail(&sk->sk_receive_queue);
> - {
> - int available = 0;
> -
> - if (skb)
> - available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
> - if ((available < target) &&
> - (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
> - !sysctl_tcp_low_latency &&
> - net_dma_find_channel()) {
> - preempt_enable_no_resched();
> - tp->ucopy.pinned_list =
> - dma_pin_iovec_pages(msg->msg_iov, len);
> - } else {
> - preempt_enable_no_resched();
> - }
> - }
> -#endif
> -
> do {
> u32 offset;
>
> @@ -1762,16 +1705,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> /* __ Set realtime policy in scheduler __ */
> }
>
> -#ifdef CONFIG_NET_DMA
> - if (tp->ucopy.dma_chan) {
> - if (tp->rcv_wnd == 0 &&
> - !skb_queue_empty(&sk->sk_async_wait_queue)) {
> - tcp_service_net_dma(sk, true);
> - tcp_cleanup_rbuf(sk, copied);
> - } else
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> - }
> -#endif
> if (copied >= target) {
> /* Do not sleep, just process backlog. */
> release_sock(sk);
> @@ -1779,11 +1712,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> } else
> sk_wait_data(sk, &timeo);
>
> -#ifdef CONFIG_NET_DMA
> - tcp_service_net_dma(sk, false); /* Don't block */
> - tp->ucopy.wakeup = 0;
> -#endif
> -
> if (user_recv) {
> int chunk;
>
> @@ -1841,43 +1769,13 @@ do_prequeue:
> }
>
> if (!(flags & MSG_TRUNC)) {
> -#ifdef CONFIG_NET_DMA
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> -
> - if (tp->ucopy.dma_chan) {
> - tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
> - tp->ucopy.dma_chan, skb, offset,
> - msg->msg_iov, used,
> - tp->ucopy.pinned_list);
> -
> - if (tp->ucopy.dma_cookie < 0) {
> -
> - pr_alert("%s: dma_cookie < 0\n",
> - __func__);
> -
> - /* Exception. Bailout! */
> - if (!copied)
> - copied = -EFAULT;
> - break;
> - }
> -
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> - if ((offset + used) == skb->len)
> - copied_early = true;
> -
> - } else
> -#endif
> - {
> - err = skb_copy_datagram_iovec(skb, offset,
> - msg->msg_iov, used);
> - if (err) {
> - /* Exception. Bailout! */
> - if (!copied)
> - copied = -EFAULT;
> - break;
> - }
> + err = skb_copy_datagram_iovec(skb, offset,
> + msg->msg_iov, used);
> + if (err) {
> + /* Exception. Bailout! */
> + if (!copied)
> + copied = -EFAULT;
> + break;
> }
> }
>
> @@ -1897,19 +1795,15 @@ skip_copy:
>
> if (tcp_hdr(skb)->fin)
> goto found_fin_ok;
> - if (!(flags & MSG_PEEK)) {
> - sk_eat_skb(sk, skb, copied_early);
> - copied_early = false;
> - }
> + if (!(flags & MSG_PEEK))
> + sk_eat_skb(sk, skb);
> continue;
>
> found_fin_ok:
> /* Process the FIN. */
> ++*seq;
> - if (!(flags & MSG_PEEK)) {
> - sk_eat_skb(sk, skb, copied_early);
> - copied_early = false;
> - }
> + if (!(flags & MSG_PEEK))
> + sk_eat_skb(sk, skb);
> break;
> } while (len > 0);
>
> @@ -1932,16 +1826,6 @@ skip_copy:
> tp->ucopy.len = 0;
> }
>
> -#ifdef CONFIG_NET_DMA
> - tcp_service_net_dma(sk, true); /* Wait for queue to drain */
> - tp->ucopy.dma_chan = NULL;
> -
> - if (tp->ucopy.pinned_list) {
> - dma_unpin_iovec_pages(tp->ucopy.pinned_list);
> - tp->ucopy.pinned_list = NULL;
> - }
> -#endif
> -
> /* According to UNIX98, msg_name/msg_namelen are ignored
> * on connected socket. I was just happy when found this 8) --ANK
> */
> @@ -2285,9 +2169,6 @@ int tcp_disconnect(struct sock *sk, int flags)
> __skb_queue_purge(&sk->sk_receive_queue);
> tcp_write_queue_purge(sk);
> __skb_queue_purge(&tp->out_of_order_queue);
> -#ifdef CONFIG_NET_DMA
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
>
> inet->inet_dport = 0;
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c53b7f35c51d..33ef18e550c5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -73,7 +73,6 @@
> #include <net/inet_common.h>
> #include <linux/ipsec.h>
> #include <asm/unaligned.h>
> -#include <net/netdma.h>
>
> int sysctl_tcp_timestamps __read_mostly = 1;
> int sysctl_tcp_window_scaling __read_mostly = 1;
> @@ -4967,53 +4966,6 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
> __tcp_checksum_complete_user(sk, skb);
> }
>
> -#ifdef CONFIG_NET_DMA
> -static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
> - int hlen)
> -{
> - struct tcp_sock *tp = tcp_sk(sk);
> - int chunk = skb->len - hlen;
> - int dma_cookie;
> - bool copied_early = false;
> -
> - if (tp->ucopy.wakeup)
> - return false;
> -
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> -
> - if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
> -
> - dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
> - skb, hlen,
> - tp->ucopy.iov, chunk,
> - tp->ucopy.pinned_list);
> -
> - if (dma_cookie < 0)
> - goto out;
> -
> - tp->ucopy.dma_cookie = dma_cookie;
> - copied_early = true;
> -
> - tp->ucopy.len -= chunk;
> - tp->copied_seq += chunk;
> - tcp_rcv_space_adjust(sk);
> -
> - if ((tp->ucopy.len == 0) ||
> - (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||
> - (atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1))) {
> - tp->ucopy.wakeup = 1;
> - sk->sk_data_ready(sk, 0);
> - }
> - } else if (chunk > 0) {
> - tp->ucopy.wakeup = 1;
> - sk->sk_data_ready(sk, 0);
> - }
> -out:
> - return copied_early;
> -}
> -#endif /* CONFIG_NET_DMA */
> -
> /* Does PAWS and seqno based validation of an incoming segment, flags will
> * play significant role here.
> */
> @@ -5198,14 +5150,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>
> if (tp->copied_seq == tp->rcv_nxt &&
> len - tcp_header_len <= tp->ucopy.len) {
> -#ifdef CONFIG_NET_DMA
> - if (tp->ucopy.task == current &&
> - sock_owned_by_user(sk) &&
> - tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
> - copied_early = 1;
> - eaten = 1;
> - }
> -#endif
> if (tp->ucopy.task == current &&
> sock_owned_by_user(sk) && !copied_early) {
> __set_current_state(TASK_RUNNING);
> @@ -5271,11 +5215,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
> if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
> __tcp_ack_snd_check(sk, 0);
> no_ack:
> -#ifdef CONFIG_NET_DMA
> - if (copied_early)
> - __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> - else
> -#endif
> if (eaten)
> kfree_skb_partial(skb, fragstolen);
> sk->sk_data_ready(sk, 0);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 59a6f8b90cd9..dc92ba9d0350 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -72,7 +72,6 @@
> #include <net/inet_common.h>
> #include <net/timewait_sock.h>
> #include <net/xfrm.h>
> -#include <net/netdma.h>
> #include <net/secure_seq.h>
> #include <net/tcp_memcontrol.h>
> #include <net/busy_poll.h>
> @@ -2000,18 +1999,8 @@ process:
> bh_lock_sock_nested(sk);
> ret = 0;
> if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> - struct tcp_sock *tp = tcp_sk(sk);
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> - if (tp->ucopy.dma_chan)
> + if (!tcp_prequeue(sk, skb))
> ret = tcp_v4_do_rcv(sk, skb);
> - else
> -#endif
> - {
> - if (!tcp_prequeue(sk, skb))
> - ret = tcp_v4_do_rcv(sk, skb);
> - }
> } else if (unlikely(sk_add_backlog(sk, skb,
> sk->sk_rcvbuf + sk->sk_sndbuf))) {
> bh_unlock_sock(sk);
> @@ -2170,11 +2159,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
> }
> #endif
>
> -#ifdef CONFIG_NET_DMA
> - /* Cleans up our sk_async_wait_queue */
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
> -
> /* Clean prequeue, it must be empty really */
> __skb_queue_purge(&tp->ucopy.prequeue);
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 0740f93a114a..e27972590379 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -59,7 +59,6 @@
> #include <net/snmp.h>
> #include <net/dsfield.h>
> #include <net/timewait_sock.h>
> -#include <net/netdma.h>
> #include <net/inet_common.h>
> #include <net/secure_seq.h>
> #include <net/tcp_memcontrol.h>
> @@ -1504,18 +1503,8 @@ process:
> bh_lock_sock_nested(sk);
> ret = 0;
> if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> - struct tcp_sock *tp = tcp_sk(sk);
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> - if (tp->ucopy.dma_chan)
> + if (!tcp_prequeue(sk, skb))
> ret = tcp_v6_do_rcv(sk, skb);
> - else
> -#endif
> - {
> - if (!tcp_prequeue(sk, skb))
> - ret = tcp_v6_do_rcv(sk, skb);
> - }
> } else if (unlikely(sk_add_backlog(sk, skb,
> sk->sk_rcvbuf + sk->sk_sndbuf))) {
> bh_unlock_sock(sk);
> diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
> index 7b01b9f5846c..e1b46709f8d6 100644
> --- a/net/llc/af_llc.c
> +++ b/net/llc/af_llc.c
> @@ -838,7 +838,7 @@ static int llc_ui_recvmsg(struct kiocb *iocb, struct socket *sock,
>
> if (!(flags & MSG_PEEK)) {
> spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> *seq = 0;
> }
> @@ -860,10 +860,10 @@ copy_uaddr:
> llc_cmsg_rcv(msg, skb);
>
> if (!(flags & MSG_PEEK)) {
> - spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> - sk_eat_skb(sk, skb, false);
> - spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> - *seq = 0;
> + spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> + sk_eat_skb(sk, skb);
> + spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> + *seq = 0;
> }
>
> goto out;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-15 21:31:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <[email protected]> wrote:
> Hi Dan,
>
> I'm using net_dma on my system and I achieve meaningful performance
> boost when running Iperf receive.
>
> As far as I know the net_dma is used by many embedded systems out
> there and might effect their performance.
> Can you please elaborate on the exact scenario that cause the memory corruption?
>
> Is the scenario mentioned here caused by "real life" application or
> this is more of theoretical issue found through manual testing, I was
> trying to find the thread describing the failing scenario and couldn't
> find it, any pointer will be appreciated.

Did you see the referenced commit?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c

This is a real issue in that any app that forks() while receiving data
can cause the dma data to be lost. The problem is that the copy
operation falls back to cpu at many locations. Any one of those
instance could touch a mapped page and trigger a copy-on-write event.
The dma completes to the wrong location.

--
Dan

2014-01-15 21:33:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <[email protected]> wrote:
> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <[email protected]> wrote:
>> Hi Dan,
>>
>> I'm using net_dma on my system and I achieve meaningful performance
>> boost when running Iperf receive.
>>
>> As far as I know the net_dma is used by many embedded systems out
>> there and might effect their performance.
>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>
>> Is the scenario mentioned here caused by "real life" application or
>> this is more of theoretical issue found through manual testing, I was
>> trying to find the thread describing the failing scenario and couldn't
>> find it, any pointer will be appreciated.
>
> Did you see the referenced commit?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>
> This is a real issue in that any app that forks() while receiving data
> can cause the dma data to be lost. The problem is that the copy
> operation falls back to cpu at many locations. Any one of those
> instance could touch a mapped page and trigger a copy-on-write event.
> The dma completes to the wrong location.
>

Btw, do you have benchmark data showing that NET_DMA is beneficial on
these platforms? I would have expected worse performance on platforms
without i/o coherent caches.

2014-01-17 20:16:30

by saeed bishara

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

Dan,

isn't this issue similar to direct io case?
can you please look at the following article
http://lwn.net/Articles/322795/

regarding performance improvement using NET_DMA, I don't have concrete
numbers, but it should be around 15-20%. my system is i/o coherent.

saeed

On Wed, Jan 15, 2014 at 11:33 PM, Dan Williams <[email protected]> wrote:
> On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <[email protected]> wrote:
>> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <[email protected]> wrote:
>>> Hi Dan,
>>>
>>> I'm using net_dma on my system and I achieve meaningful performance
>>> boost when running Iperf receive.
>>>
>>> As far as I know the net_dma is used by many embedded systems out
>>> there and might effect their performance.
>>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>>
>>> Is the scenario mentioned here caused by "real life" application or
>>> this is more of theoretical issue found through manual testing, I was
>>> trying to find the thread describing the failing scenario and couldn't
>>> find it, any pointer will be appreciated.
>>
>> Did you see the referenced commit?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>>
>> This is a real issue in that any app that forks() while receiving data
>> can cause the dma data to be lost. The problem is that the copy
>> operation falls back to cpu at many locations. Any one of those
>> instance could touch a mapped page and trigger a copy-on-write event.
>> The dma completes to the wrong location.
>>
>
> Btw, do you have benchmark data showing that NET_DMA is beneficial on
> these platforms? I would have expected worse performance on platforms
> without i/o coherent caches.

2014-01-21 09:45:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

On Fri, Jan 17, 2014 at 12:16 PM, saeed bishara <[email protected]> wrote:
> Dan,
>
> isn't this issue similar to direct io case?
> can you please look at the following article
> http://lwn.net/Articles/322795/

I guess it's similar, but the NET_DMA dma api violation is more
blatant. The same thread that requested DMA is also writing to those
same pages with the cpu. The fix is either guaranteeing that only the
dma engine ever touches the gup'd pages or synchronizing dma before
every cpu fallback.

> regarding performance improvement using NET_DMA, I don't have concrete
> numbers, but it should be around 15-20%. my system is i/o coherent.

That sounds too high... is that throughput or cpu utilization? It
sounds high because NET_DMA also makes the data cache cold while the
cpu copy warms the data before handing it to the application.

Can you measure relative numbers and share your testing details? You
will need to fix the data corruption and verify that the performance
advantage is still there before proposing NET_DMA be restored.

I have a new dma_debug capability in Andrew's tree that can you help
you identify holes in the implementation.

http://ozlabs.org/~akpm/mmots/broken-out/dma-debug-introduce-debug_dma_assert_idle.patch

--
Dan

>
> saeed
>
> On Wed, Jan 15, 2014 at 11:33 PM, Dan Williams <[email protected]> wrote:
>> On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <[email protected]> wrote:
>>> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <[email protected]> wrote:
>>>> Hi Dan,
>>>>
>>>> I'm using net_dma on my system and I achieve meaningful performance
>>>> boost when running Iperf receive.
>>>>
>>>> As far as I know the net_dma is used by many embedded systems out
>>>> there and might effect their performance.
>>>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>>>
>>>> Is the scenario mentioned here caused by "real life" application or
>>>> this is more of theoretical issue found through manual testing, I was
>>>> trying to find the thread describing the failing scenario and couldn't
>>>> find it, any pointer will be appreciated.
>>>
>>> Did you see the referenced commit?
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>>>
>>> This is a real issue in that any app that forks() while receiving data
>>> can cause the dma data to be lost. The problem is that the copy
>>> operation falls back to cpu at many locations. Any one of those
>>> instance could touch a mapped page and trigger a copy-on-write event.
>>> The dma completes to the wrong location.
>>>
>>
>> Btw, do you have benchmark data showing that NET_DMA is beneficial on
>> these platforms? I would have expected worse performance on platforms
>> without i/o coherent caches.

2014-01-22 10:38:32

by saeed bishara

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] net_dma: simple removal

On Tue, Jan 21, 2014 at 11:44 AM, Dan Williams <[email protected]> wrote:
> On Fri, Jan 17, 2014 at 12:16 PM, saeed bishara <[email protected]> wrote:
>> Dan,
>>
>> isn't this issue similar to direct io case?
>> can you please look at the following article
>> http://lwn.net/Articles/322795/
>
> I guess it's similar, but the NET_DMA dma api violation is more
> blatant. The same thread that requested DMA is also writing to those
> same pages with the cpu. The fix is either guaranteeing that only the
> dma engine ever touches the gup'd pages or synchronizing dma before
> every cpu fallback.
>
>> regarding performance improvement using NET_DMA, I don't have concrete
>> numbers, but it should be around 15-20%. my system is i/o coherent.
>
> That sounds too high... is that throughput or cpu utilization? It
that's the throughput improvement, my test is iperf server (no special
flags, 1500 mtu).
the iperf and 10G eth interrupts bound to same cpu which is the
bottleneck in my case.
I ran the following configurations
a. NET_DMA=n
b. NET_DMA=y
c. NET_DMA=y + your dma_debug patch below,
d. same as 3. by with my simple fix path.

results in Gbps:
a. 5.41
b. 6.17 (+14%)
c. 5.93 (+9%)
d. 5.92 (+9%)

Dan, my simple fix is just to call tcp_service_net_dma(sk, true)
whenever the cpu is going to copy the data. proper fix ofcourse can be
smarter.
do you think this is sufficient?


--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1295,6 +1295,7 @@ static int tcp_recv_urg(struct sock *sk, struct
msghdr *msg, int len, int flags)
*/
return -EAGAIN;
}
+static void tcp_service_net_dma(struct sock *sk, bool wait);

static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
{
@@ -1302,6 +1303,7 @@ static int tcp_peek_sndq(struct sock *sk, struct
msghdr *msg, int len)
int copied = 0, err = 0;

/* XXX -- need to support SO_PEEK_OFF */
+ tcp_service_net_dma(sk, true); /* Wait for queue to drain */

skb_queue_walk(&sk->sk_write_queue, skb) {
err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, skb->len);
@@ -1861,6 +1863,8 @@ do_prequeue:
} else
#endif
{
+ tcp_service_net_dma(sk, true); /*
Wait for queue to drain */
+
err = skb_copy_datagram_iovec(skb, offset,
msg->msg_iov, used);
if (err) {




> sounds high because NET_DMA also makes the data cache cold while the
> cpu copy warms the data before handing it to the application.
for iperf case the test doesn't touch the data.
also, for some applications, specially storage, the data can also be
moved using dma.
so this actually can be big advantage.
>
> Can you measure relative numbers and share your testing details? You
> will need to fix the data corruption and verify that the performance
> advantage is still there before proposing NET_DMA be restored.
see above.
>
> I have a new dma_debug capability in Andrew's tree that can you help
> you identify holes in the implementation.
>
> http://ozlabs.org/~akpm/mmots/broken-out/dma-debug-introduce-debug_dma_assert_idle.patch
>
> --
> Dan
>
>>
>> saeed
>>
>> On Wed, Jan 15, 2014 at 11:33 PM, Dan Williams <[email protected]> wrote:
>>> On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <[email protected]> wrote:
>>>> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <[email protected]> wrote:
>>>>> Hi Dan,
>>>>>
>>>>> I'm using net_dma on my system and I achieve meaningful performance
>>>>> boost when running Iperf receive.
>>>>>
>>>>> As far as I know the net_dma is used by many embedded systems out
>>>>> there and might effect their performance.
>>>>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>>>>
>>>>> Is the scenario mentioned here caused by "real life" application or
>>>>> this is more of theoretical issue found through manual testing, I was
>>>>> trying to find the thread describing the failing scenario and couldn't
>>>>> find it, any pointer will be appreciated.
>>>>
>>>> Did you see the referenced commit?
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>>>>
>>>> This is a real issue in that any app that forks() while receiving data
>>>> can cause the dma data to be lost. The problem is that the copy
>>>> operation falls back to cpu at many locations. Any one of those
>>>> instance could touch a mapped page and trigger a copy-on-write event.
>>>> The dma completes to the wrong location.
>>>>
>>>
>>> Btw, do you have benchmark data showing that NET_DMA is beneficial on
>>> these platforms? I would have expected worse performance on platforms
>>> without i/o coherent caches.