2012-10-04 00:38:25

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

While working on 10Gb/s routing performance I found a significant amount of
time was being spent in the swiotlb DMA handler. Further digging found that a
significant amount of this was due to the fact that virtual to physical
address translation and calling the function that did it. It accounted for
nearly 60% of the total overhead.

This patch set works to resolve that by changing the io_tlb_start address and
io_tlb_overflow_buffer address from virtual addresses to physical addresses.
By doing this, devices that are not making use of bounce buffers can
significantly reduce their overhead. In addition I followed through with the
cleanup to the point that the only functions that really require the virtual
address for the dma buffer are the init, free, and bounce functions.

When running a routing throughput test using small packets I saw roughly a 5%
increase in packets rates after applying these patches. This appears to match
up with the CPU overhead reduction I was tracking via perf.

Before:
Results 10.29Mps
# Overhead Symbol
# ........ ...........................................................................................................
#
1.97% [k] __phys_addr
|
|--24.97%-- swiotlb_sync_single
|
|--16.55%-- is_swiotlb_buffer
|
|--11.25%-- unmap_single
|
--2.71%-- swiotlb_dma_mapping_error
1.66% [k] swiotlb_sync_single
1.45% [k] is_swiotlb_buffer
0.53% [k] unmap_single
0.52% [k] swiotlb_map_page
0.47% [k] swiotlb_sync_single_for_device
0.43% [k] swiotlb_sync_single_for_cpu
0.42% [k] swiotlb_dma_mapping_error
0.34% [k] swiotlb_unmap_page

After:
Results 10.99Mps
# Overhead Symbol
# ........ ...........................................................................................................
#
0.50% [k] swiotlb_map_page
0.50% [k] swiotlb_sync_single
0.36% [k] swiotlb_sync_single_for_cpu
0.35% [k] swiotlb_sync_single_for_device
0.25% [k] swiotlb_unmap_page
0.17% [k] swiotlb_dma_mapping_error

---

Alexander Duyck (7):
swiotlb: Do not export swiotlb_bounce since there are no external consumers
swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single
swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
swiotlb: Make io_tlb_overflow_buffer a physical address
swiotlb: Make io_tlb_start a physical address instead of a virtual address
swiotlb: Instead of tracking the end of the swiotlb region just calculate it


drivers/xen/swiotlb-xen.c | 25 ++---
include/linux/swiotlb.h | 20 ++--
lib/swiotlb.c | 247 +++++++++++++++++++++++----------------------
3 files changed, 150 insertions(+), 142 deletions(-)

--


2012-10-04 00:38:31

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

In the case of swiotlb we already have the start of the region and the number
of slabs that give us the region size. Instead of having to call
virt_to_phys on two pointers we can just take advantage of the fact that the
region is linear and just compute the end based on the start plus the size.

Signed-off-by: Alexander Duyck <[email protected]>
---

lib/swiotlb.c | 25 ++++++++++++-------------
1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index f114bf6..5cc4d4e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,11 +57,11 @@ int swiotlb_force;
* swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
* API.
*/
-static char *io_tlb_start, *io_tlb_end;
+static char *io_tlb_start;

/*
- * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
- * io_tlb_end. This is command line adjustable via setup_io_tlb_npages.
+ * The number of IO TLB blocks (in groups of 64).
+ * This is command line adjustable via setup_io_tlb_npages.
*/
static unsigned long io_tlb_nslabs;

@@ -128,11 +128,11 @@ void swiotlb_print_info(void)
phys_addr_t pstart, pend;

pstart = virt_to_phys(io_tlb_start);
- pend = virt_to_phys(io_tlb_end);
+ pend = pstart + bytes;

printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) mapped at [%p-%p]\n",
(unsigned long long)pstart, (unsigned long long)pend - 1,
- bytes >> 20, io_tlb_start, io_tlb_end - 1);
+ bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
}

void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -143,12 +143,10 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)

io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
- io_tlb_end = io_tlb_start + bytes;

/*
* Allocate and initialize the free list array. This array is used
* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
- * between io_tlb_start and io_tlb_end.
*/
io_tlb_list = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
for (i = 0; i < io_tlb_nslabs; i++)
@@ -254,14 +252,12 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)

io_tlb_nslabs = nslabs;
io_tlb_start = tlb;
- io_tlb_end = io_tlb_start + bytes;

memset(io_tlb_start, 0, bytes);

/*
* Allocate and initialize the free list array. This array is used
* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
- * between io_tlb_start and io_tlb_end.
*/
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
get_order(io_tlb_nslabs * sizeof(int)));
@@ -304,7 +300,6 @@ cleanup3:
sizeof(int)));
io_tlb_list = NULL;
cleanup2:
- io_tlb_end = NULL;
io_tlb_start = NULL;
io_tlb_nslabs = 0;
return -ENOMEM;
@@ -339,8 +334,10 @@ void __init swiotlb_free(void)

static int is_swiotlb_buffer(phys_addr_t paddr)
{
- return paddr >= virt_to_phys(io_tlb_start) &&
- paddr < virt_to_phys(io_tlb_end);
+ phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
+
+ return paddr >= swiotlb_start &&
+ paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
}

/*
@@ -938,6 +935,8 @@ EXPORT_SYMBOL(swiotlb_dma_mapping_error);
int
swiotlb_dma_supported(struct device *hwdev, u64 mask)
{
- return swiotlb_virt_to_bus(hwdev, io_tlb_end - 1) <= mask;
+ unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+
+ return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
}
EXPORT_SYMBOL(swiotlb_dma_supported);

2012-10-04 00:38:41

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

This change makes it so that io_tlb_start contains a physical address instead
of a virtual address. The advantage to this is that we can avoid costly
translations between virtual and physical addresses when comparing the
io_tlb_start against DMA addresses.

Signed-off-by: Alexander Duyck <[email protected]>
---

lib/swiotlb.c | 61 +++++++++++++++++++++++++++++----------------------------
1 files changed, 31 insertions(+), 30 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 5cc4d4e..02abb72 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -57,7 +57,7 @@ int swiotlb_force;
* swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
* API.
*/
-static char *io_tlb_start;
+phys_addr_t io_tlb_start;

/*
* The number of IO TLB blocks (in groups of 64).
@@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device *hwdev,
void swiotlb_print_info(void)
{
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
- phys_addr_t pstart, pend;
+ unsigned char *vstart, *vend;

- pstart = virt_to_phys(io_tlb_start);
- pend = pstart + bytes;
+ vstart = phys_to_virt(io_tlb_start);
+ vend = vstart + bytes;

printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) mapped at [%p-%p]\n",
- (unsigned long long)pstart, (unsigned long long)pend - 1,
- bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
+ (unsigned long long)io_tlb_start,
+ (unsigned long long)io_tlb_start + bytes - 1,
+ bytes >> 20, vstart, vend - 1);
}

void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
@@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
bytes = nslabs << IO_TLB_SHIFT;

io_tlb_nslabs = nslabs;
- io_tlb_start = tlb;
+ io_tlb_start = __pa(tlb);

/*
* Allocate and initialize the free list array. This array is used
@@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
static void __init
swiotlb_init_with_default_size(size_t default_size, int verbose)
{
+ unsigned char *vstart;
unsigned long bytes;

if (!io_tlb_nslabs) {
@@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, int verbose)
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
- if (!io_tlb_start)
+ vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
+ if (!vstart)
panic("Cannot allocate SWIOTLB buffer");

- swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
+ swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
}

void __init
@@ -205,6 +207,7 @@ int
swiotlb_late_init_with_default_size(size_t default_size)
{
unsigned long bytes, req_nslabs = io_tlb_nslabs;
+ unsigned char *vstart = NULL;
unsigned int order;
int rc = 0;

@@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;

while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
- io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
- order);
- if (io_tlb_start)
+ vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+ order);
+ if (vstart)
break;
order--;
}

- if (!io_tlb_start) {
+ if (!vstart) {
io_tlb_nslabs = req_nslabs;
return -ENOMEM;
}
@@ -237,9 +240,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
"for software IO TLB\n", (PAGE_SIZE << order) >> 20);
io_tlb_nslabs = SLABS_PER_PAGE << order;
}
- rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
+ rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
if (rc)
- free_pages((unsigned long)io_tlb_start, order);
+ free_pages((unsigned long)vstart, order);
return rc;
}

@@ -251,9 +254,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
bytes = nslabs << IO_TLB_SHIFT;

io_tlb_nslabs = nslabs;
- io_tlb_start = tlb;
+ io_tlb_start = virt_to_phys(tlb);

- memset(io_tlb_start, 0, bytes);
+ memset(tlb, 0, bytes);

/*
* Allocate and initialize the free list array. This array is used
@@ -300,7 +303,7 @@ cleanup3:
sizeof(int)));
io_tlb_list = NULL;
cleanup2:
- io_tlb_start = NULL;
+ io_tlb_start = 0;
io_tlb_nslabs = 0;
return -ENOMEM;
}
@@ -317,7 +320,7 @@ void __init swiotlb_free(void)
get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
sizeof(int)));
- free_pages((unsigned long)io_tlb_start,
+ free_pages((unsigned long)phys_to_virt(io_tlb_start),
get_order(io_tlb_nslabs << IO_TLB_SHIFT));
} else {
free_bootmem_late(__pa(io_tlb_overflow_buffer),
@@ -326,7 +329,7 @@ void __init swiotlb_free(void)
PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
free_bootmem_late(__pa(io_tlb_list),
PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
- free_bootmem_late(__pa(io_tlb_start),
+ free_bootmem_late(io_tlb_start,
PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
}
io_tlb_nslabs = 0;
@@ -334,10 +337,8 @@ void __init swiotlb_free(void)

static int is_swiotlb_buffer(phys_addr_t paddr)
{
- phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
-
- return paddr >= swiotlb_start &&
- paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
+ return paddr >= io_tlb_start &&
+ paddr < (io_tlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
}

/*
@@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
io_tlb_list[i] = 0;
for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
io_tlb_list[i] = ++count;
- dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
+ dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);

/*
* Update the indices to avoid searching in the next
@@ -494,7 +495,7 @@ static void *
map_single(struct device *hwdev, phys_addr_t phys, size_t size,
enum dma_data_direction dir)
{
- dma_addr_t start_dma_addr = swiotlb_virt_to_bus(hwdev, io_tlb_start);
+ dma_addr_t start_dma_addr = phys_to_dma(hwdev, io_tlb_start);

return swiotlb_tbl_map_single(hwdev, start_dma_addr, phys, size, dir);
}
@@ -508,7 +509,7 @@ swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
{
unsigned long flags;
int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
- int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
+ int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];

/*
@@ -549,7 +550,7 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
enum dma_data_direction dir,
enum dma_sync_target target)
{
- int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
+ int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];

phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
@@ -937,6 +938,6 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask)
{
unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;

- return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
+ return phys_to_dma(hwdev, io_tlb_start + bytes - 1) <= mask;
}
EXPORT_SYMBOL(swiotlb_dma_supported);

2012-10-04 00:38:48

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 4/7] swiotlb: Return physical addresses when calling swiotlb_tbl_map_single

This change makes it so that swiotlb_tbl_map_single will return a physical
address instead of a virtual address when called. The advantage to this once
again is that we are avoiding a number of virt_to_phys and phys_to_virt
translations by working with everything as a physical address.

One change I had to make in order to support using physical addresses is that
I could no longer trust 0 to be a invalid physical address on all platforms.
So instead I made it so that ~0 is returned on error. This should never be a
valid return value as it implies that only one byte would be available for
use.

Signed-off-by: Alexander Duyck <[email protected]>
---

drivers/xen/swiotlb-xen.c | 22 +++++++-------
include/linux/swiotlb.h | 11 +++++--
lib/swiotlb.c | 73 +++++++++++++++++++++++----------------------
3 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 58db6df..8a6035a 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -338,9 +338,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
enum dma_data_direction dir,
struct dma_attrs *attrs)
{
- phys_addr_t phys = page_to_phys(page) + offset;
+ phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = xen_phys_to_bus(phys);
- void *map;

BUG_ON(dir == DMA_NONE);
/*
@@ -356,16 +355,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
* Oh well, have to allocate and map a bounce buffer.
*/
map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
- if (!map)
+ if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;

- dev_addr = xen_virt_to_bus(map);
+ dev_addr = xen_phys_to_bus(map);

/*
* Ensure that the address returned is DMA'ble
*/
if (!dma_capable(dev, dev_addr, size)) {
- swiotlb_tbl_unmap_single(dev, map, size, dir);
+ swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -494,11 +493,12 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
if (swiotlb_force ||
!dma_capable(hwdev, dev_addr, sg->length) ||
range_straddles_page_boundary(paddr, sg->length)) {
- void *map = swiotlb_tbl_map_single(hwdev,
- start_dma_addr,
- sg_phys(sg),
- sg->length, dir);
- if (!map) {
+ phys_addr_t map = swiotlb_tbl_map_single(hwdev,
+ start_dma_addr,
+ sg_phys(sg),
+ sg->length,
+ dir);
+ if (map == SWIOTLB_MAP_ERROR) {
/* Don't panic here, we expect map_sg users
to do proper error handling. */
xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
@@ -506,7 +506,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
sgl[0].dma_length = 0;
return DMA_ERROR_CODE;
}
- sg->dma_address = xen_virt_to_bus(map);
+ sg->dma_address = xen_phys_to_bus(map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8d08b3e..1995f3e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -34,9 +34,14 @@ enum dma_sync_target {
SYNC_FOR_CPU = 0,
SYNC_FOR_DEVICE = 1,
};
-extern void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
- phys_addr_t phys, size_t size,
- enum dma_data_direction dir);
+
+/* define the last possible byte of physical address space as a mapping error */
+#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0)
+
+extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+ dma_addr_t tbl_dma_addr,
+ phys_addr_t phys, size_t size,
+ enum dma_data_direction dir);

extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
size_t size, enum dma_data_direction dir);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 62848fb..55e052e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -389,12 +389,13 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
}
EXPORT_SYMBOL_GPL(swiotlb_bounce);

-void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
- phys_addr_t phys, size_t size,
- enum dma_data_direction dir)
+phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
+ dma_addr_t tbl_dma_addr,
+ phys_addr_t phys, size_t size,
+ enum dma_data_direction dir)
{
unsigned long flags;
- char *dma_addr;
+ phys_addr_t dma_addr;
unsigned int nslots, stride, index, wrap;
int i;
unsigned long mask;
@@ -458,7 +459,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
io_tlb_list[i] = 0;
for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
io_tlb_list[i] = ++count;
- dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
+ dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);

/*
* Update the indices to avoid searching in the next
@@ -476,7 +477,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,

not_found:
spin_unlock_irqrestore(&io_tlb_lock, flags);
- return NULL;
+ return SWIOTLB_MAP_ERROR;
found:
spin_unlock_irqrestore(&io_tlb_lock, flags);

@@ -488,7 +489,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
- swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+ swiotlb_bounce(phys, phys_to_virt(dma_addr), size, DMA_TO_DEVICE);

return dma_addr;
}
@@ -498,9 +499,8 @@ EXPORT_SYMBOL_GPL(swiotlb_tbl_map_single);
* Allocates bounce buffer and returns its kernel virtual address.
*/

-static void *
-map_single(struct device *hwdev, phys_addr_t phys, size_t size,
- enum dma_data_direction dir)
+phys_addr_t map_single(struct device *hwdev, phys_addr_t phys, size_t size,
+ enum dma_data_direction dir)
{
dma_addr_t start_dma_addr = phys_to_dma(hwdev, io_tlb_start);

@@ -594,12 +594,15 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_mask = hwdev->coherent_dma_mask;

ret = (void *)__get_free_pages(flags, order);
- if (ret && swiotlb_virt_to_bus(hwdev, ret) + size - 1 > dma_mask) {
- /*
- * The allocated memory isn't reachable by the device.
- */
- free_pages((unsigned long) ret, order);
- ret = NULL;
+ if (ret) {
+ dev_addr = swiotlb_virt_to_bus(hwdev, ret);
+ if (dev_addr + size - 1 > dma_mask) {
+ /*
+ * The allocated memory isn't reachable by the device.
+ */
+ free_pages((unsigned long) ret, order);
+ ret = NULL;
+ }
}
if (!ret) {
/*
@@ -607,13 +610,13 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
* GFP_DMA memory; fall back on map_single(), which
* will grab memory from the lowest available address range.
*/
- ret = map_single(hwdev, 0, size, DMA_FROM_DEVICE);
- if (!ret)
+ phys_addr_t paddr = map_single(hwdev, 0, size, DMA_FROM_DEVICE);
+ if (paddr == SWIOTLB_MAP_ERROR)
return NULL;
- }

- memset(ret, 0, size);
- dev_addr = swiotlb_virt_to_bus(hwdev, ret);
+ ret = phys_to_virt(paddr);
+ dev_addr = phys_to_dma(hwdev, paddr);
+ }

/* Confirm address can be DMA'd by device */
if (dev_addr + size - 1 > dma_mask) {
@@ -625,7 +628,10 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,
swiotlb_tbl_unmap_single(hwdev, ret, size, DMA_TO_DEVICE);
return NULL;
}
+
*dma_handle = dev_addr;
+ memset(ret, 0, size);
+
return ret;
}
EXPORT_SYMBOL(swiotlb_alloc_coherent);
@@ -682,9 +688,8 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
enum dma_data_direction dir,
struct dma_attrs *attrs)
{
- phys_addr_t phys = page_to_phys(page) + offset;
+ phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = phys_to_dma(dev, phys);
- void *map;

BUG_ON(dir == DMA_NONE);
/*
@@ -695,22 +700,18 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
if (dma_capable(dev, dev_addr, size) && !swiotlb_force)
return dev_addr;

- /*
- * Oh well, have to allocate and map a bounce buffer.
- */
+ /* Oh well, have to allocate and map a bounce buffer. */
map = map_single(dev, phys, size, dir);
- if (!map) {
+ if (map == SWIOTLB_MAP_ERROR) {
swiotlb_full(dev, size, dir, 1);
return phys_to_dma(dev, io_tlb_overflow_buffer);
}

- dev_addr = swiotlb_virt_to_bus(dev, map);
+ dev_addr = phys_to_dma(dev, map);

- /*
- * Ensure that the address returned is DMA'ble
- */
+ /* Ensure that the address returned is DMA'ble */
if (!dma_capable(dev, dev_addr, size)) {
- swiotlb_tbl_unmap_single(dev, map, size, dir);
+ swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
return phys_to_dma(dev, io_tlb_overflow_buffer);
}

@@ -836,9 +837,9 @@ swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems,

if (swiotlb_force ||
!dma_capable(hwdev, dev_addr, sg->length)) {
- void *map = map_single(hwdev, sg_phys(sg),
- sg->length, dir);
- if (!map) {
+ phys_addr_t map = map_single(hwdev, sg_phys(sg),
+ sg->length, dir);
+ if (map == SWIOTLB_MAP_ERROR) {
/* Don't panic here, we expect map_sg users
to do proper error handling. */
swiotlb_full(hwdev, sg->length, dir, 0);
@@ -847,7 +848,7 @@ swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl, int nelems,
sgl[0].dma_length = 0;
return 0;
}
- sg->dma_address = swiotlb_virt_to_bus(hwdev, map);
+ sg->dma_address = phys_to_dma(hwdev, map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;

2012-10-04 00:38:53

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 5/7] swiotlb: Use physical addresses for swiotlb_tbl_unmap_single

This change makes it so that the unmap functionality also uses physical
addresses. This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

Signed-off-by: Alexander Duyck <[email protected]>
---

drivers/xen/swiotlb-xen.c | 4 ++--
include/linux/swiotlb.h | 3 ++-
lib/swiotlb.c | 35 ++++++++++++++++++-----------------
3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 8a6035a..4cedc28 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -364,7 +364,7 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
* Ensure that the address returned is DMA'ble
*/
if (!dma_capable(dev, dev_addr, size)) {
- swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
+ swiotlb_tbl_unmap_single(dev, map, size, dir);
dev_addr = 0;
}
return dev_addr;
@@ -388,7 +388,7 @@ static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr,

/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
- swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+ swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
return;
}

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 1995f3e..5a5a654 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -43,7 +43,8 @@ extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
phys_addr_t phys, size_t size,
enum dma_data_direction dir);

-extern void swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_unmap_single(struct device *hwdev,
+ phys_addr_t dma_addr,
size_t size, enum dma_data_direction dir);

extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 55e052e..41e1d9a 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -510,20 +510,20 @@ phys_addr_t map_single(struct device *hwdev, phys_addr_t phys, size_t size,
/*
* dma_addr is the kernel virtual address of the bounce buffer to unmap.
*/
-void
-swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
- enum dma_data_direction dir)
+void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir)
{
unsigned long flags;
int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
- int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;
+ int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];

/*
* First, sync the memory before unmapping the entry
*/
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
- swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+ swiotlb_bounce(phys, phys_to_virt(dma_addr),
+ size, DMA_FROM_DEVICE);

/*
* Return the buffer to the free list by setting the corresponding
@@ -616,17 +616,18 @@ swiotlb_alloc_coherent(struct device *hwdev, size_t size,

ret = phys_to_virt(paddr);
dev_addr = phys_to_dma(hwdev, paddr);
- }

- /* Confirm address can be DMA'd by device */
- if (dev_addr + size - 1 > dma_mask) {
- printk("hwdev DMA mask = 0x%016Lx, dev_addr = 0x%016Lx\n",
- (unsigned long long)dma_mask,
- (unsigned long long)dev_addr);
+ /* Confirm address can be DMA'd by device */
+ if (dev_addr + size - 1 > dma_mask) {
+ printk("hwdev DMA mask = 0x%016Lx, dev_addr = 0x%016Lx\n",
+ (unsigned long long)dma_mask,
+ (unsigned long long)dev_addr);

- /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
- swiotlb_tbl_unmap_single(hwdev, ret, size, DMA_TO_DEVICE);
- return NULL;
+ /* DMA_TO_DEVICE to avoid memcpy in unmap_single */
+ swiotlb_tbl_unmap_single(hwdev, paddr,
+ size, DMA_TO_DEVICE);
+ return NULL;
+ }
}

*dma_handle = dev_addr;
@@ -647,7 +648,7 @@ swiotlb_free_coherent(struct device *hwdev, size_t size, void *vaddr,
free_pages((unsigned long)vaddr, get_order(size));
else
/* DMA_TO_DEVICE to avoid memcpy in swiotlb_tbl_unmap_single */
- swiotlb_tbl_unmap_single(hwdev, vaddr, size, DMA_TO_DEVICE);
+ swiotlb_tbl_unmap_single(hwdev, paddr, size, DMA_TO_DEVICE);
}
EXPORT_SYMBOL(swiotlb_free_coherent);

@@ -711,7 +712,7 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,

/* Ensure that the address returned is DMA'ble */
if (!dma_capable(dev, dev_addr, size)) {
- swiotlb_tbl_unmap_single(dev, phys_to_virt(map), size, dir);
+ swiotlb_tbl_unmap_single(dev, map, size, dir);
return phys_to_dma(dev, io_tlb_overflow_buffer);
}

@@ -735,7 +736,7 @@ static void unmap_single(struct device *hwdev, dma_addr_t dev_addr,
BUG_ON(dir == DMA_NONE);

if (is_swiotlb_buffer(paddr)) {
- swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+ swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
return;
}

2012-10-04 00:39:00

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 6/7] swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single

This change makes it so that the sync functionality also uses physical
addresses. This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.

Signed-off-by: Alexander Duyck <[email protected]>
---

drivers/xen/swiotlb-xen.c | 3 +--
include/linux/swiotlb.h | 3 ++-
lib/swiotlb.c | 18 +++++++++---------
3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4cedc28..af47e75 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -433,8 +433,7 @@ xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,

/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
- swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
- target);
+ swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5a5a654..ba1bd38 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -47,7 +47,8 @@ extern void swiotlb_tbl_unmap_single(struct device *hwdev,
phys_addr_t dma_addr,
size_t size, enum dma_data_direction dir);

-extern void swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr,
+extern void swiotlb_tbl_sync_single(struct device *hwdev,
+ phys_addr_t dma_addr,
size_t size, enum dma_data_direction dir,
enum dma_sync_target target);

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 41e1d9a..7cfe850 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -552,12 +552,11 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t dma_addr,
}
EXPORT_SYMBOL_GPL(swiotlb_tbl_unmap_single);

-void
-swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
- enum dma_data_direction dir,
- enum dma_sync_target target)
+void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir,
+ enum dma_sync_target target)
{
- int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;
+ int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
phys_addr_t phys = io_tlb_orig_addr[index];

phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
@@ -565,13 +564,15 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
+ swiotlb_bounce(phys, phys_to_virt(dma_addr),
+ size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
+ swiotlb_bounce(phys, phys_to_virt(dma_addr),
+ size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;
@@ -780,8 +781,7 @@ swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
BUG_ON(dir == DMA_NONE);

if (is_swiotlb_buffer(paddr)) {
- swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
- target);
+ swiotlb_tbl_sync_single(hwdev, paddr, size, dir, target);
return;
}

2012-10-04 00:39:06

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 7/7] swiotlb: Do not export swiotlb_bounce since there are no external consumers

Currently swiotlb is the only consumer for swiotlb_bounce. Since that is the
case it doesn't make much sense to be exporting it so make it a static
function only.

In addition we can save a few more lines of code by making it so that it
accepts the DMA address as a physical address instead of a virtual one. This
is the last piece in essentially pushing all of the DMA address values to use
physical addresses in swiotlb.

Signed-off-by: Alexander Duyck <[email protected]>
---

include/linux/swiotlb.h | 3 ---
lib/swiotlb.c | 30 +++++++++++++-----------------
2 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ba1bd38..8e635d1 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -53,9 +53,6 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
enum dma_sync_target target);

/* Accessory functions. */
-extern void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
- enum dma_data_direction dir);
-
extern void
*swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 7cfe850..a2ad781 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -351,10 +351,11 @@ static int is_swiotlb_buffer(phys_addr_t paddr)
/*
* Bounce: copy the swiotlb buffer back to the original dma location
*/
-void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
- enum dma_data_direction dir)
+static void swiotlb_bounce(phys_addr_t phys, phys_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir)
{
unsigned long pfn = PFN_DOWN(phys);
+ unsigned char *vaddr = phys_to_virt(dma_addr);

if (PageHighMem(pfn_to_page(pfn))) {
/* The buffer does not have a mapping. Map it in and copy */
@@ -369,25 +370,23 @@ void swiotlb_bounce(phys_addr_t phys, char *dma_addr, size_t size,
local_irq_save(flags);
buffer = kmap_atomic(pfn_to_page(pfn));
if (dir == DMA_TO_DEVICE)
- memcpy(dma_addr, buffer + offset, sz);
+ memcpy(vaddr, buffer + offset, sz);
else
- memcpy(buffer + offset, dma_addr, sz);
+ memcpy(buffer + offset, vaddr, sz);
kunmap_atomic(buffer);
local_irq_restore(flags);

size -= sz;
pfn++;
- dma_addr += sz;
+ vaddr += sz;
offset = 0;
}
+ } else if (dir == DMA_TO_DEVICE) {
+ memcpy(vaddr, phys_to_virt(phys), size);
} else {
- if (dir == DMA_TO_DEVICE)
- memcpy(dma_addr, phys_to_virt(phys), size);
- else
- memcpy(phys_to_virt(phys), dma_addr, size);
+ memcpy(phys_to_virt(phys), vaddr, size);
}
}
-EXPORT_SYMBOL_GPL(swiotlb_bounce);

phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
dma_addr_t tbl_dma_addr,
@@ -489,7 +488,7 @@ found:
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = phys + (i << IO_TLB_SHIFT);
if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
- swiotlb_bounce(phys, phys_to_virt(dma_addr), size, DMA_TO_DEVICE);
+ swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);

return dma_addr;
}
@@ -522,8 +521,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t dma_addr,
* First, sync the memory before unmapping the entry
*/
if (phys && ((dir == DMA_FROM_DEVICE) || (dir == DMA_BIDIRECTIONAL)))
- swiotlb_bounce(phys, phys_to_virt(dma_addr),
- size, DMA_FROM_DEVICE);
+ swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);

/*
* Return the buffer to the free list by setting the corresponding
@@ -564,15 +562,13 @@ void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t dma_addr,
switch (target) {
case SYNC_FOR_CPU:
if (likely(dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, phys_to_virt(dma_addr),
- size, DMA_FROM_DEVICE);
+ swiotlb_bounce(phys, dma_addr, size, DMA_FROM_DEVICE);
else
BUG_ON(dir != DMA_TO_DEVICE);
break;
case SYNC_FOR_DEVICE:
if (likely(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
- swiotlb_bounce(phys, phys_to_virt(dma_addr),
- size, DMA_TO_DEVICE);
+ swiotlb_bounce(phys, dma_addr, size, DMA_TO_DEVICE);
else
BUG_ON(dir != DMA_FROM_DEVICE);
break;

2012-10-04 00:39:44

by Duyck, Alexander H

[permalink] [raw]
Subject: [RFC PATCH 3/7] swiotlb: Make io_tlb_overflow_buffer a physical address

This change makes it so that we can avoid virt_to_phys overhead when using the
io_tlb_overflow_buffer. My original plan was to completely remove the value
and replace it with a constant but I had seen that there were recent patches
that stated this couldn't be done until all device drivers that depended on
that functionality be updated.

Signed-off-by: Alexander Duyck <[email protected]>
---

lib/swiotlb.c | 61 ++++++++++++++++++++++++++++++++-------------------------
1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 02abb72..62848fb 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -70,7 +70,7 @@ static unsigned long io_tlb_nslabs;
*/
static unsigned long io_tlb_overflow = 32*1024;

-static void *io_tlb_overflow_buffer;
+phys_addr_t io_tlb_overflow_buffer;

/*
* This is a free list describing the number of free entries available from
@@ -138,6 +138,7 @@ void swiotlb_print_info(void)

void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
{
+ void *v_overflow_buffer;
unsigned long i, bytes;

bytes = nslabs << IO_TLB_SHIFT;
@@ -146,6 +147,15 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
io_tlb_start = __pa(tlb);

/*
+ * Get the overflow emergency buffer
+ */
+ v_overflow_buffer = alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
+ if (!v_overflow_buffer)
+ panic("Cannot allocate SWIOTLB overflow buffer!\n");
+
+ io_tlb_overflow_buffer = __pa(v_overflow_buffer);
+
+ /*
* Allocate and initialize the free list array. This array is used
* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
*/
@@ -155,12 +165,6 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
io_tlb_index = 0;
io_tlb_orig_addr = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));

- /*
- * Get the overflow emergency buffer
- */
- io_tlb_overflow_buffer = alloc_bootmem_low_pages(PAGE_ALIGN(io_tlb_overflow));
- if (!io_tlb_overflow_buffer)
- panic("Cannot allocate SWIOTLB overflow buffer!\n");
if (verbose)
swiotlb_print_info();
}
@@ -250,6 +254,7 @@ int
swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
{
unsigned long i, bytes;
+ unsigned char *v_overflow_buffer;

bytes = nslabs << IO_TLB_SHIFT;

@@ -259,13 +264,23 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
memset(tlb, 0, bytes);

/*
+ * Get the overflow emergency buffer
+ */
+ v_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
+ get_order(io_tlb_overflow));
+ if (!v_overflow_buffer)
+ goto cleanup2;
+
+ io_tlb_overflow_buffer = virt_to_phys(v_overflow_buffer);
+
+ /*
* Allocate and initialize the free list array. This array is used
* to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
*/
io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
get_order(io_tlb_nslabs * sizeof(int)));
if (!io_tlb_list)
- goto cleanup2;
+ goto cleanup3;

for (i = 0; i < io_tlb_nslabs; i++)
io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
@@ -276,18 +291,10 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
get_order(io_tlb_nslabs *
sizeof(phys_addr_t)));
if (!io_tlb_orig_addr)
- goto cleanup3;
+ goto cleanup4;

memset(io_tlb_orig_addr, 0, io_tlb_nslabs * sizeof(phys_addr_t));

- /*
- * Get the overflow emergency buffer
- */
- io_tlb_overflow_buffer = (void *)__get_free_pages(GFP_DMA,
- get_order(io_tlb_overflow));
- if (!io_tlb_overflow_buffer)
- goto cleanup4;
-
swiotlb_print_info();

late_alloc = 1;
@@ -295,13 +302,13 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
return 0;

cleanup4:
- free_pages((unsigned long)io_tlb_orig_addr,
- get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
- io_tlb_orig_addr = NULL;
-cleanup3:
free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
sizeof(int)));
io_tlb_list = NULL;
+cleanup3:
+ free_pages((unsigned long)v_overflow_buffer,
+ get_order(io_tlb_overflow));
+ io_tlb_overflow_buffer = 0;
cleanup2:
io_tlb_start = 0;
io_tlb_nslabs = 0;
@@ -310,11 +317,11 @@ cleanup2:

void __init swiotlb_free(void)
{
- if (!io_tlb_overflow_buffer)
+ if (!io_tlb_orig_addr)
return;

if (late_alloc) {
- free_pages((unsigned long)io_tlb_overflow_buffer,
+ free_pages((unsigned long)phys_to_virt(io_tlb_overflow_buffer),
get_order(io_tlb_overflow));
free_pages((unsigned long)io_tlb_orig_addr,
get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
@@ -323,7 +330,7 @@ void __init swiotlb_free(void)
free_pages((unsigned long)phys_to_virt(io_tlb_start),
get_order(io_tlb_nslabs << IO_TLB_SHIFT));
} else {
- free_bootmem_late(__pa(io_tlb_overflow_buffer),
+ free_bootmem_late(io_tlb_overflow_buffer,
PAGE_ALIGN(io_tlb_overflow));
free_bootmem_late(__pa(io_tlb_orig_addr),
PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
@@ -694,7 +701,7 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
map = map_single(dev, phys, size, dir);
if (!map) {
swiotlb_full(dev, size, dir, 1);
- map = io_tlb_overflow_buffer;
+ return phys_to_dma(dev, io_tlb_overflow_buffer);
}

dev_addr = swiotlb_virt_to_bus(dev, map);
@@ -704,7 +711,7 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
*/
if (!dma_capable(dev, dev_addr, size)) {
swiotlb_tbl_unmap_single(dev, map, size, dir);
- dev_addr = swiotlb_virt_to_bus(dev, io_tlb_overflow_buffer);
+ return phys_to_dma(dev, io_tlb_overflow_buffer);
}

return dev_addr;
@@ -923,7 +930,7 @@ EXPORT_SYMBOL(swiotlb_sync_sg_for_device);
int
swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
{
- return (dma_addr == swiotlb_virt_to_bus(hwdev, io_tlb_overflow_buffer));
+ return (dma_addr == phys_to_dma(hwdev, io_tlb_overflow_buffer));
}
EXPORT_SYMBOL(swiotlb_dma_mapping_error);

2012-10-04 13:07:15

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
> While working on 10Gb/s routing performance I found a significant amount of
> time was being spent in the swiotlb DMA handler. Further digging found that a
> significant amount of this was due to the fact that virtual to physical
> address translation and calling the function that did it. It accounted for
> nearly 60% of the total overhead.
>
> This patch set works to resolve that by changing the io_tlb_start address and
> io_tlb_overflow_buffer address from virtual addresses to physical addresses.
> By doing this, devices that are not making use of bounce buffers can
> significantly reduce their overhead. In addition I followed through with the

.. but are still using SWIOTLB for their DMA operations, right?

> cleanup to the point that the only functions that really require the virtual
> address for the dma buffer are the init, free, and bounce functions.
>
> When running a routing throughput test using small packets I saw roughly a 5%
> increase in packets rates after applying these patches. This appears to match
> up with the CPU overhead reduction I was tracking via perf.
>
> Before:
> Results 10.29Mps
> # Overhead Symbol
> # ........ ...........................................................................................................
> #
> 1.97% [k] __phys_addr
> |
> |--24.97%-- swiotlb_sync_single
> |
> |--16.55%-- is_swiotlb_buffer
> |
> |--11.25%-- unmap_single
> |
> --2.71%-- swiotlb_dma_mapping_error
> 1.66% [k] swiotlb_sync_single
> 1.45% [k] is_swiotlb_buffer
> 0.53% [k] unmap_single
> 0.52% [k] swiotlb_map_page
> 0.47% [k] swiotlb_sync_single_for_device
> 0.43% [k] swiotlb_sync_single_for_cpu
> 0.42% [k] swiotlb_dma_mapping_error
> 0.34% [k] swiotlb_unmap_page
>
> After:
> Results 10.99Mps
> # Overhead Symbol
> # ........ ...........................................................................................................
> #
> 0.50% [k] swiotlb_map_page
> 0.50% [k] swiotlb_sync_single
> 0.36% [k] swiotlb_sync_single_for_cpu
> 0.35% [k] swiotlb_sync_single_for_device
> 0.25% [k] swiotlb_unmap_page
> 0.17% [k] swiotlb_dma_mapping_error
>
> ---
>
> Alexander Duyck (7):
> swiotlb: Do not export swiotlb_bounce since there are no external consumers
> swiotlb: Use physical addresses instead of virtual in swiotlb_tbl_sync_single
> swiotlb: Use physical addresses for swiotlb_tbl_unmap_single
> swiotlb: Return physical addresses when calling swiotlb_tbl_map_single
> swiotlb: Make io_tlb_overflow_buffer a physical address
> swiotlb: Make io_tlb_start a physical address instead of a virtual address
> swiotlb: Instead of tracking the end of the swiotlb region just calculate it
>
>
> drivers/xen/swiotlb-xen.c | 25 ++---
> include/linux/swiotlb.h | 20 ++--
> lib/swiotlb.c | 247 +++++++++++++++++++++++----------------------
> 3 files changed, 150 insertions(+), 142 deletions(-)
>
> --
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-04 13:12:36

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

On Wed, Oct 03, 2012 at 05:38:47PM -0700, Alexander Duyck wrote:
> In the case of swiotlb we already have the start of the region and the number
> of slabs that give us the region size. Instead of having to call
> virt_to_phys on two pointers we can just take advantage of the fact that the
> region is linear and just compute the end based on the start plus the size.

Why not take advantage of 'the fact that the region is linear' and just
pre-compute the end in swiotlb_init_with_tbl?

That way the logic in is_swiotlb_buffer is even simpler?

>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
>
> lib/swiotlb.c | 25 ++++++++++++-------------
> 1 files changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index f114bf6..5cc4d4e 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -57,11 +57,11 @@ int swiotlb_force;
> * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
> * API.
> */
> -static char *io_tlb_start, *io_tlb_end;
> +static char *io_tlb_start;
>
> /*
> - * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
> - * io_tlb_end. This is command line adjustable via setup_io_tlb_npages.
> + * The number of IO TLB blocks (in groups of 64).
> + * This is command line adjustable via setup_io_tlb_npages.
> */
> static unsigned long io_tlb_nslabs;
>
> @@ -128,11 +128,11 @@ void swiotlb_print_info(void)
> phys_addr_t pstart, pend;
>
> pstart = virt_to_phys(io_tlb_start);
> - pend = virt_to_phys(io_tlb_end);
> + pend = pstart + bytes;
>
> printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) mapped at [%p-%p]\n",
> (unsigned long long)pstart, (unsigned long long)pend - 1,
> - bytes >> 20, io_tlb_start, io_tlb_end - 1);
> + bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
> }
>
> void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> @@ -143,12 +143,10 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>
> io_tlb_nslabs = nslabs;
> io_tlb_start = tlb;
> - io_tlb_end = io_tlb_start + bytes;
>
> /*
> * Allocate and initialize the free list array. This array is used
> * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
> - * between io_tlb_start and io_tlb_end.
> */
> io_tlb_list = alloc_bootmem_pages(PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
> for (i = 0; i < io_tlb_nslabs; i++)
> @@ -254,14 +252,12 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
>
> io_tlb_nslabs = nslabs;
> io_tlb_start = tlb;
> - io_tlb_end = io_tlb_start + bytes;
>
> memset(io_tlb_start, 0, bytes);
>
> /*
> * Allocate and initialize the free list array. This array is used
> * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
> - * between io_tlb_start and io_tlb_end.
> */
> io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
> get_order(io_tlb_nslabs * sizeof(int)));
> @@ -304,7 +300,6 @@ cleanup3:
> sizeof(int)));
> io_tlb_list = NULL;
> cleanup2:
> - io_tlb_end = NULL;
> io_tlb_start = NULL;
> io_tlb_nslabs = 0;
> return -ENOMEM;
> @@ -339,8 +334,10 @@ void __init swiotlb_free(void)
>
> static int is_swiotlb_buffer(phys_addr_t paddr)
> {
> - return paddr >= virt_to_phys(io_tlb_start) &&
> - paddr < virt_to_phys(io_tlb_end);
> + phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
> +
> + return paddr >= swiotlb_start &&
> + paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
> }
>
> /*
> @@ -938,6 +935,8 @@ EXPORT_SYMBOL(swiotlb_dma_mapping_error);
> int
> swiotlb_dma_supported(struct device *hwdev, u64 mask)
> {
> - return swiotlb_virt_to_bus(hwdev, io_tlb_end - 1) <= mask;
> + unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
> +
> + return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
> }
> EXPORT_SYMBOL(swiotlb_dma_supported);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-04 13:29:59

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

On Wed, Oct 03, 2012 at 05:38:53PM -0700, Alexander Duyck wrote:
> This change makes it so that io_tlb_start contains a physical address instead
> of a virtual address. The advantage to this is that we can avoid costly
> translations between virtual and physical addresses when comparing the
> io_tlb_start against DMA addresses.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
>
> lib/swiotlb.c | 61 +++++++++++++++++++++++++++++----------------------------
> 1 files changed, 31 insertions(+), 30 deletions(-)
>
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 5cc4d4e..02abb72 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -57,7 +57,7 @@ int swiotlb_force;
> * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
> * API.
> */
> -static char *io_tlb_start;
> +phys_addr_t io_tlb_start;
>
> /*
> * The number of IO TLB blocks (in groups of 64).
> @@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device *hwdev,
> void swiotlb_print_info(void)
> {
> unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
> - phys_addr_t pstart, pend;
> + unsigned char *vstart, *vend;
>
> - pstart = virt_to_phys(io_tlb_start);
> - pend = pstart + bytes;
> + vstart = phys_to_virt(io_tlb_start);
> + vend = vstart + bytes;
>
> printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) mapped at [%p-%p]\n",
> - (unsigned long long)pstart, (unsigned long long)pend - 1,
> - bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
> + (unsigned long long)io_tlb_start,
> + (unsigned long long)io_tlb_start + bytes - 1,
> + bytes >> 20, vstart, vend - 1);
> }
>
> void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> @@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> bytes = nslabs << IO_TLB_SHIFT;
>
> io_tlb_nslabs = nslabs;
> - io_tlb_start = tlb;
> + io_tlb_start = __pa(tlb);

Why not 'virt_to_phys' ?

>
> /*
> * Allocate and initialize the free list array. This array is used
> @@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
> static void __init
> swiotlb_init_with_default_size(size_t default_size, int verbose)
> {
> + unsigned char *vstart;
> unsigned long bytes;
>
> if (!io_tlb_nslabs) {
> @@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, int verbose)
> /*
> * Get IO TLB memory from the low pages
> */
> - io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
> - if (!io_tlb_start)
> + vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
> + if (!vstart)
> panic("Cannot allocate SWIOTLB buffer");
>
> - swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
> + swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
> }
>
> void __init
> @@ -205,6 +207,7 @@ int
> swiotlb_late_init_with_default_size(size_t default_size)
> {
> unsigned long bytes, req_nslabs = io_tlb_nslabs;
> + unsigned char *vstart = NULL;
> unsigned int order;
> int rc = 0;
>
> @@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
> bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>
> while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
> - io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
> - order);
> - if (io_tlb_start)
> + vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
> + order);
> + if (vstart)
> break;
> order--;
> }
>
> - if (!io_tlb_start) {
> + if (!vstart) {
> io_tlb_nslabs = req_nslabs;
> return -ENOMEM;
> }
> @@ -237,9 +240,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
> "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
> io_tlb_nslabs = SLABS_PER_PAGE << order;
> }
> - rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
> + rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
> if (rc)
> - free_pages((unsigned long)io_tlb_start, order);
> + free_pages((unsigned long)vstart, order);
> return rc;
> }
>
> @@ -251,9 +254,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
> bytes = nslabs << IO_TLB_SHIFT;
>
> io_tlb_nslabs = nslabs;
> - io_tlb_start = tlb;
> + io_tlb_start = virt_to_phys(tlb);
>
> - memset(io_tlb_start, 0, bytes);
> + memset(tlb, 0, bytes);
>
> /*
> * Allocate and initialize the free list array. This array is used
> @@ -300,7 +303,7 @@ cleanup3:
> sizeof(int)));
> io_tlb_list = NULL;
> cleanup2:
> - io_tlb_start = NULL;
> + io_tlb_start = 0;
> io_tlb_nslabs = 0;
> return -ENOMEM;
> }
> @@ -317,7 +320,7 @@ void __init swiotlb_free(void)
> get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
> free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
> sizeof(int)));
> - free_pages((unsigned long)io_tlb_start,
> + free_pages((unsigned long)phys_to_virt(io_tlb_start),
> get_order(io_tlb_nslabs << IO_TLB_SHIFT));
> } else {
> free_bootmem_late(__pa(io_tlb_overflow_buffer),
> @@ -326,7 +329,7 @@ void __init swiotlb_free(void)
> PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
> free_bootmem_late(__pa(io_tlb_list),
> PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
> - free_bootmem_late(__pa(io_tlb_start),
> + free_bootmem_late(io_tlb_start,
> PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
> }
> io_tlb_nslabs = 0;
> @@ -334,10 +337,8 @@ void __init swiotlb_free(void)
>
> static int is_swiotlb_buffer(phys_addr_t paddr)
> {
> - phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
> -
> - return paddr >= swiotlb_start &&
> - paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
> + return paddr >= io_tlb_start &&
> + paddr < (io_tlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
> }
>
> /*
> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
> io_tlb_list[i] = 0;
> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
> io_tlb_list[i] = ++count;
> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);

I think this is going to fall flat with the other user of
swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
and does it magic to make sure its under 4GB - the io_tlb_start swath
of memory, ends up consisting of 2MB chunks of contingous spaces. But each
chunk is not linearly in the DMA space (thought it is in the CPU space).

Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
and so on (depending on the availability of memory under 4GB).

There is a clear virt_to_phys(x) != virt_to_dma(x).

>
> /*
> * Update the indices to avoid searching in the next
> @@ -494,7 +495,7 @@ static void *
> map_single(struct device *hwdev, phys_addr_t phys, size_t size,
> enum dma_data_direction dir)
> {
> - dma_addr_t start_dma_addr = swiotlb_virt_to_bus(hwdev, io_tlb_start);
> + dma_addr_t start_dma_addr = phys_to_dma(hwdev, io_tlb_start);
>
> return swiotlb_tbl_map_single(hwdev, start_dma_addr, phys, size, dir);
> }
> @@ -508,7 +509,7 @@ swiotlb_tbl_unmap_single(struct device *hwdev, char *dma_addr, size_t size,
> {
> unsigned long flags;
> int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
> - int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
> + int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;

Ditto with explanation above. This I think will cause issues.

> phys_addr_t phys = io_tlb_orig_addr[index];
>
> /*
> @@ -549,7 +550,7 @@ swiotlb_tbl_sync_single(struct device *hwdev, char *dma_addr, size_t size,
> enum dma_data_direction dir,
> enum dma_sync_target target)
> {
> - int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT;
> + int index = (dma_addr - (char *)phys_to_virt(io_tlb_start)) >> IO_TLB_SHIFT;
> phys_addr_t phys = io_tlb_orig_addr[index];
>
> phys += ((unsigned long)dma_addr & ((1 << IO_TLB_SHIFT) - 1));
> @@ -937,6 +938,6 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask)
> {
> unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>
> - return swiotlb_virt_to_bus(hwdev, io_tlb_start + bytes - 1) <= mask;
> + return phys_to_dma(hwdev, io_tlb_start + bytes - 1) <= mask;
> }
> EXPORT_SYMBOL(swiotlb_dma_supported);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-04 13:44:49

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
> While working on 10Gb/s routing performance I found a significant amount of
> time was being spent in the swiotlb DMA handler. Further digging found that a
> significant amount of this was due to the fact that virtual to physical
> address translation and calling the function that did it. It accounted for
> nearly 60% of the total overhead.
>
> This patch set works to resolve that by changing the io_tlb_start address and
> io_tlb_overflow_buffer address from virtual addresses to physical addresses.

The assertion in your patches is that the DMA addresses (bus address)
are linear is not applicable (unfortunatly). Meaning virt_to_phys() !=
virt_to_dma().

Now, on x86 and ia64 it is true - but one of the users of swiotlb
library is the Xen swiotlb - which cannot guarantee that the physical
address are 1-1 with the bus addresses. Hence the bulk of dealing with
figuring out the right physical to bus address is done in Xen-SWIOTLB
and the basics of finding an entry in the bounce buffer (if needed),
mapping it, unmapping ,etc is carried out by the generic library.

This is sad - b/c your patches are a good move forward.

Perhaps another way to do this is by having your patches split the
lookups in "chunks". Wherein we validate in swiotlb_init_*tbl that the
'tbl' (so the bounce buffer) is linear - if not, we split it up in
chunks. Perhaps the various backends can be responsible for this since
they would know which of their memory regions are linear or not. But
that sounds complicated and we don't want to complicate this library.

Or another way would be to provide 'phys_to_virt' and 'virt_to_phys'
functions for the swiotlb_tbl_{map|unmap}_single and the main library
(lib/swiotlb.c) can decide to use them. If they are NULL, then it
would do what your patches suggested. If they are defined, then
carry out those lookups on the 'empty-to-be-used' bounce buffer
address. Hm, that sounds like a better way of doing it.

2012-10-04 15:49:59

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/04/2012 05:55 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler. Further digging found that a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it. It accounted for
>> nearly 60% of the total overhead.
>>
>> This patch set works to resolve that by changing the io_tlb_start address and
>> io_tlb_overflow_buffer address from virtual addresses to physical addresses.
>> By doing this, devices that are not making use of bounce buffers can
>> significantly reduce their overhead. In addition I followed through with the
> .. but are still using SWIOTLB for their DMA operations, right?
>

That is correct. I tested with the bounce buffers in use as well, but
didn't really see any difference since almost all of the overhead was
due to the locking required in obtaining and releasing the bounce
buffers in map/unmap calls.

Thanks,

Alex

2012-10-04 15:54:05

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

On 10/04/2012 06:01 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:47PM -0700, Alexander Duyck wrote:
>> In the case of swiotlb we already have the start of the region and the number
>> of slabs that give us the region size. Instead of having to call
>> virt_to_phys on two pointers we can just take advantage of the fact that the
>> region is linear and just compute the end based on the start plus the size.
> Why not take advantage of 'the fact that the region is linear' and just
> pre-compute the end in swiotlb_init_with_tbl?
>
> That way the logic in is_swiotlb_buffer is even simpler?
>

Using a pre-computed end point based on a virtual address is more
expensive in the x86_64 case. The calls to __phys_addr require a
separate function call. By just using the physical address of the start
and adding the offset I can avoid the second call and the compiler will
take advantage of the smaller function size. The result is that
is_swiotlb_buffer will be inlined.

Thanks,

Alex

2012-10-04 16:42:52

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] swiotlb: Instead of tracking the end of the swiotlb region just calculate it

On Thu, Oct 04, 2012 at 08:54:09AM -0700, Alexander Duyck wrote:
> On 10/04/2012 06:01 AM, Konrad Rzeszutek Wilk wrote:
> > On Wed, Oct 03, 2012 at 05:38:47PM -0700, Alexander Duyck wrote:
> >> In the case of swiotlb we already have the start of the region and the number
> >> of slabs that give us the region size. Instead of having to call
> >> virt_to_phys on two pointers we can just take advantage of the fact that the
> >> region is linear and just compute the end based on the start plus the size.
> > Why not take advantage of 'the fact that the region is linear' and just
> > pre-compute the end in swiotlb_init_with_tbl?
> >
> > That way the logic in is_swiotlb_buffer is even simpler?
> >
>
> Using a pre-computed end point based on a virtual address is more
> expensive in the x86_64 case. The calls to __phys_addr require a
> separate function call. By just using the physical address of the start
> and adding the offset I can avoid the second call and the compiler will
> take advantage of the smaller function size. The result is that
> is_swiotlb_buffer will be inlined.

Right, that is not what I was thinking. My thought was that since you
are already computing the start of the DMA address, you can also
compute the end - and save both values in a global variable.

Then use said global variables to check for your DMA instead of
doing the extra compution of shifting to find the end of the DMA address.

>
> Thanks,
>
> Alex

2012-10-04 17:11:24

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

On 10/04/2012 06:18 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:53PM -0700, Alexander Duyck wrote:
>> This change makes it so that io_tlb_start contains a physical address instead
>> of a virtual address. The advantage to this is that we can avoid costly
>> translations between virtual and physical addresses when comparing the
>> io_tlb_start against DMA addresses.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>> ---
>>
>> lib/swiotlb.c | 61 +++++++++++++++++++++++++++++----------------------------
>> 1 files changed, 31 insertions(+), 30 deletions(-)
>>
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 5cc4d4e..02abb72 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -57,7 +57,7 @@ int swiotlb_force;
>> * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
>> * API.
>> */
>> -static char *io_tlb_start;
>> +phys_addr_t io_tlb_start;
>>
>> /*
>> * The number of IO TLB blocks (in groups of 64).
>> @@ -125,14 +125,15 @@ static dma_addr_t swiotlb_virt_to_bus(struct device *hwdev,
>> void swiotlb_print_info(void)
>> {
>> unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>> - phys_addr_t pstart, pend;
>> + unsigned char *vstart, *vend;
>>
>> - pstart = virt_to_phys(io_tlb_start);
>> - pend = pstart + bytes;
>> + vstart = phys_to_virt(io_tlb_start);
>> + vend = vstart + bytes;
>>
>> printk(KERN_INFO "software IO TLB [mem %#010llx-%#010llx] (%luMB) mapped at [%p-%p]\n",
>> - (unsigned long long)pstart, (unsigned long long)pend - 1,
>> - bytes >> 20, io_tlb_start, io_tlb_start + bytes - 1);
>> + (unsigned long long)io_tlb_start,
>> + (unsigned long long)io_tlb_start + bytes - 1,
>> + bytes >> 20, vstart, vend - 1);
>> }
>>
>> void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>> @@ -142,7 +143,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>> bytes = nslabs << IO_TLB_SHIFT;
>>
>> io_tlb_nslabs = nslabs;
>> - io_tlb_start = tlb;
>> + io_tlb_start = __pa(tlb);
> Why not 'virt_to_phys' ?

I had originally done it as a virt_to_phys, however I then noticed in
swiotlb_free that the bootmem was being converted to a physical address
via __pa. I did a bit of digging and everything seemed to indicate that
the preferred approach in early boot to get a physical address was __pa
so I decided to switch it from virt_to_phys to __pa for the early init
versions of the calls. If virt_to_phys is preferred though I can switch
it back.

>>
>> /*
>> * Allocate and initialize the free list array. This array is used
>> @@ -171,6 +172,7 @@ void __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
>> static void __init
>> swiotlb_init_with_default_size(size_t default_size, int verbose)
>> {
>> + unsigned char *vstart;
>> unsigned long bytes;
>>
>> if (!io_tlb_nslabs) {
>> @@ -183,11 +185,11 @@ swiotlb_init_with_default_size(size_t default_size, int verbose)
>> /*
>> * Get IO TLB memory from the low pages
>> */
>> - io_tlb_start = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
>> - if (!io_tlb_start)
>> + vstart = alloc_bootmem_low_pages(PAGE_ALIGN(bytes));
>> + if (!vstart)
>> panic("Cannot allocate SWIOTLB buffer");
>>
>> - swiotlb_init_with_tbl(io_tlb_start, io_tlb_nslabs, verbose);
>> + swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose);
>> }
>>
>> void __init
>> @@ -205,6 +207,7 @@ int
>> swiotlb_late_init_with_default_size(size_t default_size)
>> {
>> unsigned long bytes, req_nslabs = io_tlb_nslabs;
>> + unsigned char *vstart = NULL;
>> unsigned int order;
>> int rc = 0;
>>
>> @@ -221,14 +224,14 @@ swiotlb_late_init_with_default_size(size_t default_size)
>> bytes = io_tlb_nslabs << IO_TLB_SHIFT;
>>
>> while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
>> - io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
>> - order);
>> - if (io_tlb_start)
>> + vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
>> + order);
>> + if (vstart)
>> break;
>> order--;
>> }
>>
>> - if (!io_tlb_start) {
>> + if (!vstart) {
>> io_tlb_nslabs = req_nslabs;
>> return -ENOMEM;
>> }
>> @@ -237,9 +240,9 @@ swiotlb_late_init_with_default_size(size_t default_size)
>> "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
>> io_tlb_nslabs = SLABS_PER_PAGE << order;
>> }
>> - rc = swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
>> + rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
>> if (rc)
>> - free_pages((unsigned long)io_tlb_start, order);
>> + free_pages((unsigned long)vstart, order);
>> return rc;
>> }
>>
>> @@ -251,9 +254,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
>> bytes = nslabs << IO_TLB_SHIFT;
>>
>> io_tlb_nslabs = nslabs;
>> - io_tlb_start = tlb;
>> + io_tlb_start = virt_to_phys(tlb);
>>
>> - memset(io_tlb_start, 0, bytes);
>> + memset(tlb, 0, bytes);
>>
>> /*
>> * Allocate and initialize the free list array. This array is used
>> @@ -300,7 +303,7 @@ cleanup3:
>> sizeof(int)));
>> io_tlb_list = NULL;
>> cleanup2:
>> - io_tlb_start = NULL;
>> + io_tlb_start = 0;
>> io_tlb_nslabs = 0;
>> return -ENOMEM;
>> }
>> @@ -317,7 +320,7 @@ void __init swiotlb_free(void)
>> get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
>> free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
>> sizeof(int)));
>> - free_pages((unsigned long)io_tlb_start,
>> + free_pages((unsigned long)phys_to_virt(io_tlb_start),
>> get_order(io_tlb_nslabs << IO_TLB_SHIFT));
>> } else {
>> free_bootmem_late(__pa(io_tlb_overflow_buffer),
>> @@ -326,7 +329,7 @@ void __init swiotlb_free(void)
>> PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
>> free_bootmem_late(__pa(io_tlb_list),
>> PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
>> - free_bootmem_late(__pa(io_tlb_start),
>> + free_bootmem_late(io_tlb_start,
>> PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
>> }
>> io_tlb_nslabs = 0;
>> @@ -334,10 +337,8 @@ void __init swiotlb_free(void)
>>
>> static int is_swiotlb_buffer(phys_addr_t paddr)
>> {
>> - phys_addr_t swiotlb_start = virt_to_phys(io_tlb_start);
>> -
>> - return paddr >= swiotlb_start &&
>> - paddr < (swiotlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
>> + return paddr >= io_tlb_start &&
>> + paddr < (io_tlb_start + (io_tlb_nslabs << IO_TLB_SHIFT));
>> }
>>
>> /*
>> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
>> io_tlb_list[i] = 0;
>> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
>> io_tlb_list[i] = ++count;
>> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
>> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
> I think this is going to fall flat with the other user of
> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
> and does it magic to make sure its under 4GB - the io_tlb_start swath
> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
> chunk is not linearly in the DMA space (thought it is in the CPU space).
>
> Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
> and so on (depending on the availability of memory under 4GB).
>
> There is a clear virt_to_phys(x) != virt_to_dma(x).

Just to be sure I understand you are talking about DMA address space,
not physical address space correct? I am fully aware that DMA address
space can be all over the place. When I was writing the patch set the
big reason why I decided to stop at physical address space was because
DMA address spaces are device specific.

I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
however that is not my assertion. My assertion is (virt_to_phys(x) + y)
== virt_to_phys(x + y). This should be true for any large block of
contiguous memory that is DMA accessible since the CPU and the device
should be able to view the memory in the same layout. If that wasn't
true I don't think is_swiotlb_buffer would be working correctly since it
is essentially operating on the same assumption prior to my patches.

If you take a look at patches 4 and 5 I do address changes that end up
needing to be made to Xen SWIOTLB since it makes use of
swiotlb_tbl_map_single. All that I effectively end up changing is that
instead of messing with a void pointer we instead are dealing with a
physical address, and instead of calling xen_virt_to_bus we end up
calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
the process.

Thanks,

Alex

2012-10-04 17:30:54

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

> >> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
> >> io_tlb_list[i] = 0;
> >> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
> >> io_tlb_list[i] = ++count;
> >> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
> >> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
> > I think this is going to fall flat with the other user of
> > swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
> > and does it magic to make sure its under 4GB - the io_tlb_start swath
> > of memory, ends up consisting of 2MB chunks of contingous spaces. But each
> > chunk is not linearly in the DMA space (thought it is in the CPU space).
> >
> > Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
> > of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
> > and so on (depending on the availability of memory under 4GB).
> >
> > There is a clear virt_to_phys(x) != virt_to_dma(x).
>
> Just to be sure I understand you are talking about DMA address space,
> not physical address space correct? I am fully aware that DMA address
> space can be all over the place. When I was writing the patch set the
> big reason why I decided to stop at physical address space was because
> DMA address spaces are device specific.
>
> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
> however that is not my assertion. My assertion is (virt_to_phys(x) + y)
> == virt_to_phys(x + y). This should be true for any large block of
> contiguous memory that is DMA accessible since the CPU and the device
> should be able to view the memory in the same layout. If that wasn't

That is true mostly for x86 but not all platforms do this.

> true I don't think is_swiotlb_buffer would be working correctly since it
> is essentially operating on the same assumption prior to my patches.

There are two pieces here - the is_swiotlb_buffer and the swiotlb_tbl_[map|unmap]
functions.

The is_swiotlb_buffer is operating on that principle (and your change
to reflect that is OK). The swiotlb_tbl_[*] is not.
>
> If you take a look at patches 4 and 5 I do address changes that end up
> needing to be made to Xen SWIOTLB since it makes use of
> swiotlb_tbl_map_single. All that I effectively end up changing is that
> instead of messing with a void pointer we instead are dealing with a
> physical address, and instead of calling xen_virt_to_bus we end up
> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
> the process.

Sure that is OK. All of those changes when we bypass the bounce
buffer look OK (thought I should double-check again the patch to make
sure and also just take it for a little test spin).

The issue is when we do _use_ the bounce buffer. At that point we
run into the allocation from the bounce buffer where the patches
assume that the 64MB swath of bounce buffer memory is bus (or DMA)
memory contingous. And that is not the case sadly.

>
> Thanks,
>
> Alex

2012-10-04 17:56:57

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/04/2012 06:33 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 03, 2012 at 05:38:41PM -0700, Alexander Duyck wrote:
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler. Further digging found that a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it. It accounted for
>> nearly 60% of the total overhead.
>>
>> This patch set works to resolve that by changing the io_tlb_start address and
>> io_tlb_overflow_buffer address from virtual addresses to physical addresses.
> The assertion in your patches is that the DMA addresses (bus address)
> are linear is not applicable (unfortunatly). Meaning virt_to_phys() !=
> virt_to_dma().

That isn't my assertion. My assertion is that virt_to_phys(x + y) ==
(virt_to_phys(x) + y).

> Now, on x86 and ia64 it is true - but one of the users of swiotlb
> library is the Xen swiotlb - which cannot guarantee that the physical
> address are 1-1 with the bus addresses. Hence the bulk of dealing with
> figuring out the right physical to bus address is done in Xen-SWIOTLB
> and the basics of finding an entry in the bounce buffer (if needed),
> mapping it, unmapping ,etc is carried out by the generic library.
>
> This is sad - b/c your patches are a good move forward.

I think if you take a second look you will find you might be taking
things one logical step too far. From what I can tell my assertion is
correct. I believe the bits you are talking about don't apply until you
use the xen_virt _to_bus/xen_phys_to_bus call, and the only difference
between those two calls is a virt_to_phys which is what I am eliminating.

> Perhaps another way to do this is by having your patches split the
> lookups in "chunks". Wherein we validate in swiotlb_init_*tbl that the
> 'tbl' (so the bounce buffer) is linear - if not, we split it up in
> chunks. Perhaps the various backends can be responsible for this since
> they would know which of their memory regions are linear or not. But
> that sounds complicated and we don't want to complicate this library.
>
> Or another way would be to provide 'phys_to_virt' and 'virt_to_phys'
> functions for the swiotlb_tbl_{map|unmap}_single and the main library
> (lib/swiotlb.c) can decide to use them. If they are NULL, then it
> would do what your patches suggested. If they are defined, then
> carry out those lookups on the 'empty-to-be-used' bounce buffer
> address. Hm, that sounds like a better way of doing it.
>

I don't see any special phys_to_virt or virt_to_phys calls available for
Xen. The only calls I do see are phys_to_machine and machine_to_phys
which seem to be translating between physical addresses and those used
for DMA. If that is the case I should be fine because I am not going as
far as translating the io_tlb_start into a DMA address I am only taking
it to a physical one.

I am not asserting that the DMA memory is contiguous. I am asserting
that from the CPU perspective the physical memory is contiguous which I
believe you already agreed with. From what I can tell this should be
fine since almost all of the virt_to_phys calls out there are just doing
offset manipulation and not breaking the memory up into discreet
chunks. The sectioning up of the segments in Xen should happen after we
have taken care of the virt_to_phys work so the bounce buffer case
should work out correctly.

Thanks,

Alex

2012-10-04 20:23:01

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

On 10/04/2012 10:19 AM, Konrad Rzeszutek Wilk wrote:
>>>> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
>>>> io_tlb_list[i] = 0;
>>>> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
>>>> io_tlb_list[i] = ++count;
>>>> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
>>>> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
>>> I think this is going to fall flat with the other user of
>>> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
>>> and does it magic to make sure its under 4GB - the io_tlb_start swath
>>> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
>>> chunk is not linearly in the DMA space (thought it is in the CPU space).
>>>
>>> Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
>>> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
>>> and so on (depending on the availability of memory under 4GB).
>>>
>>> There is a clear virt_to_phys(x) != virt_to_dma(x).
>> Just to be sure I understand you are talking about DMA address space,
>> not physical address space correct? I am fully aware that DMA address
>> space can be all over the place. When I was writing the patch set the
>> big reason why I decided to stop at physical address space was because
>> DMA address spaces are device specific.
>>
>> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
>> however that is not my assertion. My assertion is (virt_to_phys(x) + y)
>> == virt_to_phys(x + y). This should be true for any large block of
>> contiguous memory that is DMA accessible since the CPU and the device
>> should be able to view the memory in the same layout. If that wasn't
> That is true mostly for x86 but not all platforms do this.
>
>> true I don't think is_swiotlb_buffer would be working correctly since it
>> is essentially operating on the same assumption prior to my patches.
> There are two pieces here - the is_swiotlb_buffer and the swiotlb_tbl_[map|unmap]
> functions.
>
> The is_swiotlb_buffer is operating on that principle (and your change
> to reflect that is OK). The swiotlb_tbl_[*] is not.
>> If you take a look at patches 4 and 5 I do address changes that end up
>> needing to be made to Xen SWIOTLB since it makes use of
>> swiotlb_tbl_map_single. All that I effectively end up changing is that
>> instead of messing with a void pointer we instead are dealing with a
>> physical address, and instead of calling xen_virt_to_bus we end up
>> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
>> the process.
> Sure that is OK. All of those changes when we bypass the bounce
> buffer look OK (thought I should double-check again the patch to make
> sure and also just take it for a little test spin).

I'm interesting in finding out what the results of your test spin are.

> The issue is when we do _use_ the bounce buffer. At that point we
> run into the allocation from the bounce buffer where the patches
> assume that the 64MB swath of bounce buffer memory is bus (or DMA)
> memory contingous. And that is not the case sadly.

I think I understand what you are saying now. However, I don't think
the issue applies to my patches.

If I am not mistaken what you are talking about is the pseudo-physical
memory versus machine memory. I understand the 64MB block is not
machine-memory contiguous, but it should be pseudo-physical contiguous
memory. As such using the pseudo-physical addresses instead of virtual
addresses should function the same way as using true physical addresses
to replace virtual addresses. All of the physical memory translation to
machine memory translation is happening in xen_phys_to_bus and all of
the changes I have made take place before that so the bounce buffers
should still be working correctly. In addition none of the changes I
have made change the bounce buffer boundary assumptions so we should
have no bounce buffers mapped across the 2MB boundaries.

Thanks,

Alex

2012-10-05 16:55:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

Alexander Duyck <[email protected]> writes:

> While working on 10Gb/s routing performance I found a significant amount of
> time was being spent in the swiotlb DMA handler. Further digging found that a
> significant amount of this was due to the fact that virtual to physical
> address translation and calling the function that did it. It accounted for
> nearly 60% of the total overhead.

Can you find out why that is? Traditionally virt_to_phys was just a
subtraction. Then later on it was a if and a subtraction.

It cannot really be that expensive. Do you have some debugging enabled?

Really virt_to_phys should be fixed. Such fundamental operations
shouldn't slow. I don't think hacking up all the users to work
around this is the r ight way.

Looking at the code a bit someone (crazy) made it out of line.
But that cannot explain that much overhead.


-Andi

--
[email protected] -- Speaking for myself only

2012-10-05 19:35:11

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/05/2012 09:55 AM, Andi Kleen wrote:
> Alexander Duyck <[email protected]> writes:
>
>> While working on 10Gb/s routing performance I found a significant amount of
>> time was being spent in the swiotlb DMA handler. Further digging found that a
>> significant amount of this was due to the fact that virtual to physical
>> address translation and calling the function that did it. It accounted for
>> nearly 60% of the total overhead.
> Can you find out why that is? Traditionally virt_to_phys was just a
> subtraction. Then later on it was a if and a subtraction.
>
> It cannot really be that expensive. Do you have some debugging enabled?
>
> Really virt_to_phys should be fixed. Such fundamental operations
> shouldn't slow. I don't think hacking up all the users to work
> around this is the r ight way.
>
> Looking at the code a bit someone (crazy) made it out of line.
> But that cannot explain that much overhead.
>
>
> -Andi
>

I was thinking the issue was all of the calls to relatively small
functions occurring in quick succession. The way most of this code is
setup it seems like it is one small function call in turn calling
another, and then another, and I would imagine the code fragmentation
can have a significant negative impact.

For example just the first patch in the series is enough to see a
significant performance gain and that is simply due to the fact that
is_swiotlb_buffer becomes inlined when I built it on my system. The
basic idea I had with these patches was to avoid making multiple calls
in quick succession and instead just to have all the data right there so
that all of the swiotlb functions don't need to make many external
calls, at least not until they are actually dealing with bounce buffers
which are slower due to locking anyway.

Thanks,

Alex

2012-10-05 20:02:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

> I was thinking the issue was all of the calls to relatively small
> functions occurring in quick succession. The way most of this code is
> setup it seems like it is one small function call in turn calling
> another, and then another, and I would imagine the code fragmentation
> can have a significant negative impact.

Maybe. Can you just inline everything and see if it it's faster then?

This was out of line when the "text cost at all costs" drive was still
envogue, but luckily we're not doing that anymore.

-Andiu

--
[email protected] -- Speaking for myself only.

2012-10-05 23:23:26

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/05/2012 01:02 PM, Andi Kleen wrote:
>> I was thinking the issue was all of the calls to relatively small
>> functions occurring in quick succession. The way most of this code is
>> setup it seems like it is one small function call in turn calling
>> another, and then another, and I would imagine the code fragmentation
>> can have a significant negative impact.
> Maybe. Can you just inline everything and see if it it's faster then?
>
> This was out of line when the "text cost at all costs" drive was still
> envogue, but luckily we're not doing that anymore.
>
> -Andiu
>

Inlining everything did speed things up a bit, but I still didn't reach
the same speed I achieved using the patch set. However I did notice the
resulting swiotlb code was considerably larger.

I did a bit more digging and the issue may actually be simple repetition
of the calls. By my math it would seem we would end up calling
is_swiotlb_buffer 3 times per packet in the routing test case, once in
sync_for_cpu and once for sync_for_device in the Rx cleanup path, and
once in unmap_page in the Tx cleanup path. Each call to
is_swiotlb_buffer will result in 2 calls to __phys_addr. In freeing the
skb we end up doing a call to virt_to_head_page which will call
__phys_addr. In addition we end up mapping the skb using map_single so
we end up using __phys_addr to do a virt_to_page translation in the
xmit_frame_ring path, and then call __phys_addr when we check
dma_mapping_error. So in total that ends up being 3 calls to
is_swiotlb_buffer, and 9 calls to __phys_addr per packet routed.

With the patches the is_swiotlb_buffer function, which was 25 lines of
assembly, is replaced with 8 lines of assembly and becomes inline. In
addition we drop the number of calls to __phys_addr from 9 to 2 by
dropping them all from swiotlb. By my math I am probably saving about
120 instructions per packet. I suspect all of that would probably be
cutting the number of instructions per packet enough to probably account
for a 5% difference when you consider I am running at about 1.5Mpps per
core on a 2.7Ghz processor.

Thanks,

Alex

2012-10-06 17:57:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

> Inlining everything did speed things up a bit, but I still didn't reach
> the same speed I achieved using the patch set. However I did notice the
> resulting swiotlb code was considerably larger.

Thanks. So your patch makes sense, but imho should pursue the inlining
in parallel for other call sites.

> assembly, is replaced with 8 lines of assembly and becomes inline. In
> addition we drop the number of calls to __phys_addr from 9 to 2 by
> dropping them all from swiotlb. By my math I am probably saving about
> 120 instructions per packet. I suspect all of that would probably be
> cutting the number of instructions per packet enough to probably account
> for a 5% difference when you consider I am running at about 1.5Mpps per
> core on a 2.7Ghz processor.

Maybe it's just me, but that's somehow sad for one if() and a su
btraction

BTW __pa used to be a simple subtraction, the if () was just added to
handle the few call sites for x86-64 that do __pa(&text_symbol).
Maybe we should just go back to the old __pa_symbol() for those cases,
then __pa could be the simple subtraction it used to was again
and it could be inlined and everyone would be happy.

-Andi
--
[email protected] -- Speaking for myself only.

2012-10-06 18:56:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/06/2012 10:57 AM, Andi Kleen wrote:
>
> Maybe it's just me, but that's somehow sad for one if() and a su
> btraction
>
> BTW __pa used to be a simple subtraction, the if () was just added to
> handle the few call sites for x86-64 that do __pa(&text_symbol).
> Maybe we should just go back to the old __pa_symbol() for those cases,
> then __pa could be the simple subtraction it used to was again
> and it could be inlined and everyone would be happy.
>

I wonder how much the double-mapping actually buys us, if we could get a
PIE-type memory model that doesn't try to create GOT references.

We would unambiguously lose out the cases where we currently can do
symbol(index) but I wonder how much the impact actually is.

The advantage would be that it would give us way more room to do kernel
space ASR.

-hpa

2012-10-08 15:43:15

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/06/2012 10:57 AM, Andi Kleen wrote:
>> Inlining everything did speed things up a bit, but I still didn't reach
>> the same speed I achieved using the patch set. However I did notice the
>> resulting swiotlb code was considerably larger.
> Thanks. So your patch makes sense, but imho should pursue the inlining
> in parallel for other call sites.

I'll try to take a look at getting that done this morning.

>> assembly, is replaced with 8 lines of assembly and becomes inline. In
>> addition we drop the number of calls to __phys_addr from 9 to 2 by
>> dropping them all from swiotlb. By my math I am probably saving about
>> 120 instructions per packet. I suspect all of that would probably be
>> cutting the number of instructions per packet enough to probably account
>> for a 5% difference when you consider I am running at about 1.5Mpps per
>> core on a 2.7Ghz processor.
> Maybe it's just me, but that's somehow sad for one if() and a subtraction

Well there is also all of the setup of the call on the function stack.
By my count just the portion that is used in the standard case is about
9 lines of assembly. By inlining it and dropping the if case we can
probably drop it to 1.

> BTW __pa used to be a simple subtraction, the if () was just added to
> handle the few call sites for x86-64 that do __pa(&text_symbol).
> Maybe we should just go back to the old __pa_symbol() for those cases,
> then __pa could be the simple subtraction it used to was again
> and it could be inlined and everyone would be happy.
>
> -Andi

What I am probably looking at doing is splitting the function in two as
you suggest where we have a separate function for the text symbol case.
I will probably also take the 32 bit approach and add a debug version
that is still a separate function for uses such as determining if we
have any callers who should be using __pa_symbol instead of __pa.

Thanks,

Alex

2012-10-09 16:55:35

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

On Thu, Oct 04, 2012 at 01:22:58PM -0700, Alexander Duyck wrote:
> On 10/04/2012 10:19 AM, Konrad Rzeszutek Wilk wrote:
> >>>> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
> >>>> io_tlb_list[i] = 0;
> >>>> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
> >>>> io_tlb_list[i] = ++count;
> >>>> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
> >>>> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
> >>> I think this is going to fall flat with the other user of
> >>> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
> >>> and does it magic to make sure its under 4GB - the io_tlb_start swath
> >>> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
> >>> chunk is not linearly in the DMA space (thought it is in the CPU space).
> >>>
> >>> Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
> >>> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
> >>> and so on (depending on the availability of memory under 4GB).
> >>>
> >>> There is a clear virt_to_phys(x) != virt_to_dma(x).
> >> Just to be sure I understand you are talking about DMA address space,
> >> not physical address space correct? I am fully aware that DMA address
> >> space can be all over the place. When I was writing the patch set the
> >> big reason why I decided to stop at physical address space was because
> >> DMA address spaces are device specific.
> >>
> >> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
> >> however that is not my assertion. My assertion is (virt_to_phys(x) + y)
> >> == virt_to_phys(x + y). This should be true for any large block of
> >> contiguous memory that is DMA accessible since the CPU and the device
> >> should be able to view the memory in the same layout. If that wasn't
> > That is true mostly for x86 but not all platforms do this.
> >
> >> true I don't think is_swiotlb_buffer would be working correctly since it
> >> is essentially operating on the same assumption prior to my patches.
> > There are two pieces here - the is_swiotlb_buffer and the swiotlb_tbl_[map|unmap]
> > functions.
> >
> > The is_swiotlb_buffer is operating on that principle (and your change
> > to reflect that is OK). The swiotlb_tbl_[*] is not.
> >> If you take a look at patches 4 and 5 I do address changes that end up
> >> needing to be made to Xen SWIOTLB since it makes use of
> >> swiotlb_tbl_map_single. All that I effectively end up changing is that
> >> instead of messing with a void pointer we instead are dealing with a
> >> physical address, and instead of calling xen_virt_to_bus we end up
> >> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
> >> the process.
> > Sure that is OK. All of those changes when we bypass the bounce
> > buffer look OK (thought I should double-check again the patch to make
> > sure and also just take it for a little test spin).
>
> I'm interesting in finding out what the results of your test spin are.

Haven't gotten to that yet :-(
>
> > The issue is when we do _use_ the bounce buffer. At that point we
> > run into the allocation from the bounce buffer where the patches
> > assume that the 64MB swath of bounce buffer memory is bus (or DMA)
> > memory contingous. And that is not the case sadly.
>
> I think I understand what you are saying now. However, I don't think
> the issue applies to my patches.

Great.
>
> If I am not mistaken what you are talking about is the pseudo-physical
> memory versus machine memory. I understand the 64MB block is not
> machine-memory contiguous, but it should be pseudo-physical contiguous
> memory. As such using the pseudo-physical addresses instead of virtual
> addresses should function the same way as using true physical addresses
> to replace virtual addresses. All of the physical memory translation to
> machine memory translation is happening in xen_phys_to_bus and all of
> the changes I have made take place before that so the bounce buffers
> should still be working correctly. In addition none of the changes I

OK.
> have made change the bounce buffer boundary assumptions so we should
> have no bounce buffers mapped across the 2MB boundaries.
>
> Thanks,
>
> Alex
>

2012-10-09 19:05:12

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses

On 10/08/2012 08:43 AM, Alexander Duyck wrote:
> On 10/06/2012 10:57 AM, Andi Kleen wrote:
>> BTW __pa used to be a simple subtraction, the if () was just added to
>> handle the few call sites for x86-64 that do __pa(&text_symbol).
>> Maybe we should just go back to the old __pa_symbol() for those cases,
>> then __pa could be the simple subtraction it used to was again
>> and it could be inlined and everyone would be happy.
>>
>> -Andi
> What I am probably looking at doing is splitting the function in two as
> you suggest where we have a separate function for the text symbol case.
> I will probably also take the 32 bit approach and add a debug version
> that is still a separate function for uses such as determining if we
> have any callers who should be using __pa_symbol instead of __pa.
>
> Thanks,
>
> Alex

I gave up on trying to split __pa and __pa_symbol. Yesterday I
realized there is way too much code that depends on the two resolving to
the same function, and many cases are pretty well hidden. Instead I
just mailed out a patch that inlines an optimized version of
__phys_addr. I figure it is probably as good as it is going to get
without having to rip the entire x86 portion of the kernel apart to
separate uses of __pa and __pa_symbol.

Thanks,

Alex

2012-10-09 19:11:41

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] swiotlb: Make io_tlb_start a physical address instead of a virtual address

On 10/09/2012 09:43 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 04, 2012 at 01:22:58PM -0700, Alexander Duyck wrote:
>> On 10/04/2012 10:19 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> @@ -450,7 +451,7 @@ void *swiotlb_tbl_map_single(struct device *hwdev, dma_addr_t tbl_dma_addr,
>>>>>> io_tlb_list[i] = 0;
>>>>>> for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
>>>>>> io_tlb_list[i] = ++count;
>>>>>> - dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
>>>>>> + dma_addr = (char *)phys_to_virt(io_tlb_start) + (index << IO_TLB_SHIFT);
>>>>> I think this is going to fall flat with the other user of
>>>>> swiotlb_tbl_map_single - Xen SWIOTLB. When it allocates the io_tlb_start
>>>>> and does it magic to make sure its under 4GB - the io_tlb_start swath
>>>>> of memory, ends up consisting of 2MB chunks of contingous spaces. But each
>>>>> chunk is not linearly in the DMA space (thought it is in the CPU space).
>>>>>
>>>>> Meaning the io_tlb_start region 0-2MB can fall within the DMA address space
>>>>> of 2048MB->2032MB, and io_tlb_start offset 2MB->4MB, can fall within 1024MB->1026MB,
>>>>> and so on (depending on the availability of memory under 4GB).
>>>>>
>>>>> There is a clear virt_to_phys(x) != virt_to_dma(x).
>>>> Just to be sure I understand you are talking about DMA address space,
>>>> not physical address space correct? I am fully aware that DMA address
>>>> space can be all over the place. When I was writing the patch set the
>>>> big reason why I decided to stop at physical address space was because
>>>> DMA address spaces are device specific.
>>>>
>>>> I understand that virt_to_phys(x) != virt_to_dma(x) for many platforms,
>>>> however that is not my assertion. My assertion is (virt_to_phys(x) + y)
>>>> == virt_to_phys(x + y). This should be true for any large block of
>>>> contiguous memory that is DMA accessible since the CPU and the device
>>>> should be able to view the memory in the same layout. If that wasn't
>>> That is true mostly for x86 but not all platforms do this.
>>>
>>>> true I don't think is_swiotlb_buffer would be working correctly since it
>>>> is essentially operating on the same assumption prior to my patches.
>>> There are two pieces here - the is_swiotlb_buffer and the swiotlb_tbl_[map|unmap]
>>> functions.
>>>
>>> The is_swiotlb_buffer is operating on that principle (and your change
>>> to reflect that is OK). The swiotlb_tbl_[*] is not.
>>>> If you take a look at patches 4 and 5 I do address changes that end up
>>>> needing to be made to Xen SWIOTLB since it makes use of
>>>> swiotlb_tbl_map_single. All that I effectively end up changing is that
>>>> instead of messing with a void pointer we instead are dealing with a
>>>> physical address, and instead of calling xen_virt_to_bus we end up
>>>> calling xen_phys_to_bus and thereby drop one extra virt_to_phys call in
>>>> the process.
>>> Sure that is OK. All of those changes when we bypass the bounce
>>> buffer look OK (thought I should double-check again the patch to make
>>> sure and also just take it for a little test spin).
>> I'm interesting in finding out what the results of your test spin are.
> Haven't gotten to that yet :-(
>>> The issue is when we do _use_ the bounce buffer. At that point we
>>> run into the allocation from the bounce buffer where the patches
>>> assume that the 64MB swath of bounce buffer memory is bus (or DMA)
>>> memory contingous. And that is not the case sadly.
>> I think I understand what you are saying now. However, I don't think
>> the issue applies to my patches.
> Great.
>> If I am not mistaken what you are talking about is the pseudo-physical
>> memory versus machine memory. I understand the 64MB block is not
>> machine-memory contiguous, but it should be pseudo-physical contiguous
>> memory. As such using the pseudo-physical addresses instead of virtual
>> addresses should function the same way as using true physical addresses
>> to replace virtual addresses. All of the physical memory translation to
>> machine memory translation is happening in xen_phys_to_bus and all of
>> the changes I have made take place before that so the bounce buffers
>> should still be working correctly. In addition none of the changes I
> OK.
>

I don't know if you saw but I submitted the patches, non-RFC.

I think I might have realized the point of confusion and attempted to
address it. I went through and renamed several variables in the updated
patches. Specifically I renamed things like "char *dma_addr" to
"phys_addr_t tlb_addr". I figure it will help to avoid any confusion as
I can see how "phys_addr_t dma_addr" could make someone think I am using
physical addresses as DMA addresses.

Thanks,

Alex