2015-06-05 21:21:50

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 0/9] introduce __pfn_t, evacuate struct page from sgls

Introduce __pfn_t which:

1/ Allows kernel internal DAX mappings to adhere to the lifetime of the
the underlying block device. In general, it enables a mechanism to
allow any device driver to advertise "device memory" (CONFIG_DEV_PFN)
to other parts of the kernel.

2/ Replaces usage of struct page in struct scatterlist. A scatterlist
need only carry enough information to generate a dma address, and
removing struct page from scatterlists is a precursor to allowing DMA to
device memory. Some dma mapping implementations are not ready for a
scatterlist-pfn to reference unampped device memory, those
implementations are disabled by CONFIG_DEV_PFN=y.

Changes since v4 [1]:

1/ Drop the bio_vec conversion of struct page to __pfn_t for now. Wait
until there's a hierarchical block driver that would make use of direct
dma to pmem. (Christoph)

2/ Reorder the patch set to put the dax fixes first.

3/ Unconditionally convert struct scatterlist to use a pfn. Strictly
speaking the scatterlist conversion could also be deferred until we have
a driver that attempts dma to pmem, but struct scatterlist really has no
valid reason to carry a struct page. (Christoph)

4/ Rebased on block.git/for-next

---

Dan Williams (9):
introduce __pfn_t for scatterlists and pmem
x86: support kmap_atomic_pfn_t() for persistent memory
dax: drop size parameter to ->direct_access()
dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
dma-mapping: allow archs to optionally specify a ->map_pfn() operation
scatterlist: use sg_phys()
scatterlist: cleanup sg_chain() and sg_unmark_end()
scatterlist: convert to __pfn_t
x86: convert dma_map_ops to support mapping a __pfn_t.


arch/Kconfig | 6 +
arch/arm/mm/dma-mapping.c | 2
arch/microblaze/kernel/dma.c | 2
arch/powerpc/sysdev/axonram.c | 26 ++++--
arch/x86/Kconfig | 7 ++
arch/x86/kernel/amd_gart_64.c | 22 ++++-
arch/x86/kernel/pci-nommu.c | 22 ++++-
arch/x86/kernel/pci-swiotlb.c | 4 +
arch/x86/pci/sta2x11-fixup.c | 4 +
block/blk-merge.c | 2
drivers/block/brd.c | 9 --
drivers/block/pmem.c | 16 +++
drivers/crypto/omap-sham.c | 2
drivers/dma/imx-dma.c | 8 --
drivers/dma/ste_dma40.c | 5 -
drivers/iommu/amd_iommu.c | 21 +++--
drivers/iommu/intel-iommu.c | 26 ++++--
drivers/iommu/iommu.c | 2
drivers/mmc/card/queue.c | 4 -
drivers/pci/Kconfig | 2
drivers/s390/block/dcssblk.c | 26 +++++-
drivers/staging/android/ion/ion_chunk_heap.c | 4 -
fs/block_dev.c | 4 -
fs/dax.c | 62 +++++++++++--
include/asm-generic/dma-mapping-common.h | 30 +++++++
include/asm-generic/memory_model.h | 1
include/asm-generic/pfn.h | 120 ++++++++++++++++++++++++++
include/crypto/scatterwalk.h | 9 --
include/linux/blkdev.h | 7 +-
include/linux/dma-debug.h | 23 ++++-
include/linux/dma-mapping.h | 8 ++
include/linux/highmem.h | 23 +++++
include/linux/mm.h | 1
include/linux/scatterlist.h | 103 ++++++++++++++++------
include/linux/swiotlb.h | 4 +
init/Kconfig | 13 +++
lib/dma-debug.c | 10 +-
lib/swiotlb.c | 20 +++-
mm/Makefile | 1
mm/pfn.c | 98 +++++++++++++++++++++
samples/kfifo/dma-example.c | 8 +-
41 files changed, 626 insertions(+), 141 deletions(-)
create mode 100644 include/asm-generic/pfn.h
create mode 100644 mm/pfn.c


2015-06-05 21:22:27

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 1/9] introduce __pfn_t for scatterlists and pmem

Introduce a type that encapsulates a page-frame-number that can also be
used to encode other information. This other information is the
traditional "page_link" encoding in a scatterlist, but can also denote
"device memory". Where "device memory" is a set of pfns that are not
part of the kernel's linear mapping, but are accessed via the same
memory controller as ram. The motivation for this conversion is large
capacity persistent memory that does not enjoy struct page coverage,
entries in memmap, by default.

This type will be used in replace usage of 'struct page *' in cases
where only a pfn is required, i.e. scatterlists for drivers, dma mapping
api, and later biovecs for the block layer. The operations in those i/o
paths that formerly required a 'struct page *' are converted to
use __pfn_t aware equivalent helpers. The goal is that existing use
cases of data structures referencing struct page are binary identical to
the __pfn_t converted type. Only new use cases for pmem will require
the new __pfn_t helper routines, i.e. __pfn_t is a struct page alias in
the absence of pmem.

It turns out that while 'struct page' references are used broadly in the
kernel I/O stacks the usage of 'struct page' based capabilities is very
shallow for block-i/o. It is only used for populating bio_vecs and
scatterlists for the retrieval of dma addresses, and for temporary
kernel mappings (kmap). Aside from kmap, these usages can be trivially
converted to operate on a pfn.

Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
via struct page, to setup and track temporary kernel mappings. It would
be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
bonds and leaked into 64-bit code. Thankfully, it seems all that is
needed here is to convert kmap_atomic() callers, that want to opt-in to
supporting persistent memory, to use a new kmap_atomic_pfn_t(). Where
kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
established by the driver for persistent memory.

Note, that as far as conceptually understanding __pfn_t is concerned,
'persistent memory' is really any address range in host memory not
covered by memmap. Contrast this with pure iomem that is on an mmio
mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
PAGE_SHIFT".

Cc: H. Peter Anvin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/asm-generic/memory_model.h | 1
include/asm-generic/pfn.h | 86 ++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1
init/Kconfig | 13 +++++
4 files changed, 100 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/pfn.h

diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..1b0ae21fd8ff 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -70,7 +70,6 @@
#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */

#define page_to_pfn __page_to_pfn
-#define pfn_to_page __pfn_to_page

#endif /* __ASSEMBLY__ */

diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
new file mode 100644
index 000000000000..2f4ae40dc6a7
--- /dev/null
+++ b/include/asm-generic/pfn.h
@@ -0,0 +1,86 @@
+#ifndef __ASM_PFN_H
+#define __ASM_PFN_H
+
+/*
+ * Default pfn to physical address conversion, like most arch
+ * page_to_phys() implementations this resolves to a dma_addr_t as it
+ * should be the size needed for a device to reference this address.
+ */
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline struct page *pfn_to_page(unsigned long pfn)
+{
+ return __pfn_to_page(pfn);
+}
+
+/*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page). This type will be used in place of a
+ * 'struct page *' instance in contexts where unmapped memory (usually
+ * persistent memory) is being referenced (scatterlists for drivers,
+ * biovecs for the block layer, etc). Whether a __pfn_t has a struct
+ * page backing is indicated by flags in the low bits of @data.
+ */
+typedef struct {
+ union {
+ unsigned long data;
+ struct page *page;
+ };
+} __pfn_t;
+
+enum {
+#if BITS_PER_LONG == 64
+ PFN_SHIFT = 3,
+ /* device-pfn not covered by memmap */
+ PFN_DEV = (1UL << 2),
+#else
+ PFN_SHIFT = 2,
+#endif
+ PFN_MASK = (1UL << PFN_SHIFT) - 1,
+ PFN_SG_CHAIN = (1UL << 0),
+ PFN_SG_LAST = (1UL << 1),
+};
+
+#ifdef CONFIG_DEV_PFN
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+ return (pfn.data & PFN_MASK) == 0;
+}
+
+#else
+static inline bool __pfn_t_has_page(__pfn_t pfn)
+{
+ return true;
+}
+#endif
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+ if (!__pfn_t_has_page(pfn))
+ return NULL;
+ return pfn.page;
+}
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+ if (__pfn_t_has_page(pfn))
+ return page_to_pfn(pfn.page);
+ return pfn.data >> PFN_SHIFT;
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+ if (!__pfn_t_has_page(pfn))
+ return __pfn_to_phys(__pfn_t_to_pfn(pfn));
+ return __pfn_to_phys(page_to_pfn(pfn.page));
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+ __pfn_t pfn = { .page = page };
+
+ return pfn;
+}
+#endif /* __ASM_PFN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4024543b4203..ae6f9965f3dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -53,6 +53,7 @@ extern int sysctl_legacy_va_layout;
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
+#include <asm-generic/pfn.h>

#ifndef __pa_symbol
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
diff --git a/init/Kconfig b/init/Kconfig
index d4f763332f9f..907ab91a5557 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1769,6 +1769,19 @@ config PROFILING
Say Y here to enable the extended profiling support mechanisms used
by profilers such as OProfile.

+config DEV_PFN
+ depends on 64BIT
+ bool "Support for device provided (pmem, graphics, etc) memory" if EXPERT
+ help
+ Say Y here to enable I/O to/from device provided memory,
+ i.e. reference memory that is not mapped. This is usually
+ the case if you have large quantities of persistent memory
+ relative to DRAM. Enabling this option may increase the
+ kernel size by a few kilobytes as it instructs the kernel
+ that a __pfn_t may reference unmapped memory. Disabling
+ this option instructs the kernel that a __pfn_t always
+ references mapped platform memory.
+
#
# Place an empty function call at each tracepoint site. Can be
# dynamically changed for a probe function.

2015-06-05 21:21:59

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

It would be unfortunate if the kmap infrastructure escaped its current
32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user
has enabled CONFIG_DEV_PFN we direct the kmap_atomic_pfn_t()
implementation to scan a list of pre-mapped persistent memory address
ranges inserted by the pmem driver.

The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
but there are two mitigating factors:

1/ The number of persistent memory ranges is bounded by the number of
DIMMs which is on the order of 10s of DIMMs, not hundreds.

2/ The lookup yields the entire range, if it becomes inefficient to do a
kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
advantage of the fact that the lookup can be amortized for all kmap
operations it needs to perform in a given range.

Signed-off-by: Dan Williams <[email protected]>
---
arch/Kconfig | 3 +
arch/x86/Kconfig | 5 ++
include/linux/highmem.h | 23 +++++++++++
mm/Makefile | 1
mm/pfn.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 130 insertions(+)
create mode 100644 mm/pfn.c

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..9b98732e0f3b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
config HAVE_DMA_CONTIGUOUS
bool

+config HAVE_KMAP_PFN
+ bool
+
config GENERIC_SMP_IDLE_THREAD
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..130d1a4c2efc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1432,6 +1432,11 @@ config X86_PMEM_LEGACY

Say Y if unsure.

+config X86_PMEM_DMA
+ depends on !HIGHMEM
+ def_bool DEV_PFN
+ select HAVE_KMAP_PFN
+
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
depends on HIGHMEM
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 9286a46b7d69..85fd52d43a9a 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -83,6 +83,29 @@ static inline void __kunmap_atomic(void *addr)

#endif /* CONFIG_HIGHMEM */

+#ifdef CONFIG_HAVE_KMAP_PFN
+extern void *kmap_atomic_pfn_t(__pfn_t pfn);
+extern void kunmap_atomic_pfn_t(void *addr);
+extern int devm_register_kmap_pfn_range(struct device *dev,
+ struct resource *res, void *base);
+#else
+static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+ return kmap_atomic(__pfn_t_to_page(pfn));
+}
+
+static inline void kunmap_atomic_pfn_t(void *addr)
+{
+ __kunmap_atomic(addr);
+}
+
+static inline int devm_register_kmap_pfn_range(struct device *dev,
+ struct resource *res, void *base)
+{
+ return 0;
+}
+#endif /* CONFIG_HAVE_KMAP_PFN */
+
#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)

DECLARE_PER_CPU(int, __kmap_atomic_idx);
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eaeabdcb..66e30c2addfe 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HAVE_KMAP_PFN) += pfn.o
diff --git a/mm/pfn.c b/mm/pfn.c
new file mode 100644
index 000000000000..0e046b49aebf
--- /dev/null
+++ b/mm/pfn.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <linux/highmem.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+static LIST_HEAD(ranges);
+
+struct kmap {
+ struct list_head list;
+ struct resource *res;
+ struct device *dev;
+ void *base;
+};
+
+static void teardown_kmap(void *data)
+{
+ struct kmap *kmap = data;
+
+ dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
+ list_del_rcu(&kmap->list);
+ synchronize_rcu();
+ kfree(kmap);
+}
+
+int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
+ void *base)
+{
+ struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
+ int rc;
+
+ if (!kmap)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&kmap->list);
+ kmap->res = res;
+ kmap->base = base;
+ kmap->dev = dev;
+ rc = devm_add_action(dev, teardown_kmap, kmap);
+ if (rc) {
+ kfree(kmap);
+ return rc;
+ }
+ dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
+ list_add_rcu(&kmap->list, &ranges);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
+
+void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+ struct page *page = __pfn_t_to_page(pfn);
+ resource_size_t addr;
+ struct kmap *kmap;
+
+ if (page)
+ return kmap_atomic(page);
+ addr = __pfn_t_to_phys(pfn);
+ rcu_read_lock();
+ list_for_each_entry_rcu(kmap, &ranges, list)
+ if (addr >= kmap->res->start && addr <= kmap->res->end)
+ return kmap->base + addr - kmap->res->start;
+
+ /* only unlock in the error case */
+ rcu_read_unlock();
+ return NULL;
+}
+EXPORT_SYMBOL(kmap_atomic_pfn_t);
+
+void kunmap_atomic_pfn_t(void *addr)
+{
+ /*
+ * If the original __pfn_t had an entry in the memmap then
+ * 'addr' will be outside of vmalloc space i.e. it came from
+ * page_address()
+ */
+ if (!is_vmalloc_addr(addr)) {
+ kunmap_atomic(addr);
+ return;
+ }
+
+ /* signal that we are done with the range */
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(kunmap_atomic_pfn_t);

2015-06-05 21:22:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 3/9] dax: drop size parameter to ->direct_access()

None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
drivers/block/brd.c | 6 +-----
drivers/block/pmem.c | 2 +-
drivers/s390/block/dcssblk.c | 4 ++--
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 2 +-
6 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ee90db17b097..e8657d3bc588 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,7 +141,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
*/
static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, unsigned long *pfn)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 64ab4951e9d6..41528857c70d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,

#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, unsigned long *pfn)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -384,10 +384,6 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- /*
- * TODO: If size > PAGE_SIZE, we could look to see if the next page in
- * the file happens to be mapped to the next page of physical RAM.
- */
return PAGE_SIZE;
}
#else
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index eabf4a8d0085..1f5b5a6288c0 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -98,7 +98,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
}

static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, unsigned long *pfn)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
size_t offset = sector << 9;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da212813f2d5..2f1734ba0e22 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn, long size);
+ void **kaddr, unsigned long *pfn);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -879,7 +879,7 @@ fail:

static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, unsigned long *pfn)
{
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f04c873a7365..19750f058495 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -453,7 +453,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ avail = ops->direct_access(bdev, sector, addr, pfn);
if (!avail)
return -ERANGE;
return min(avail, size);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 31cd832cebb7..535b82f790a6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1553,7 +1553,7 @@ struct block_device_operations {
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
long (*direct_access)(struct block_device *, sector_t,
- void **, unsigned long *pfn, long size);
+ void **, unsigned long *pfn);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */

2015-06-05 21:25:08

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

The primary source for non-page-backed page-frames to enter the system
is via the pmem driver's ->direct_access() method. The pfns returned by
the top-level bdev_direct_access() may be passed to any other subsystem
in the kernel and those sub-systems either need to assume that the pfn
is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page
backed case (CONFIG_DEV_PFN=y). Currently the pfns returned by
->direct_access() are only ever used by vm_insert_mixed() which does not
care if the pfn is mapped. As we go to add more usages of these pfns
add the type-safety of __pfn_t.

This also simplifies the calling convention of ->direct_access() by not
returning the virtual address in the same call. This annotates cases
where the kernel is directly accessing pmem outside the driver, and
makes the valid lifetime of the reference explicit. This property may
be useful in the future for invalidating mappings to pmem, but for now
it provides some protection against the "pmem disable vs still-in-use"
race.

Note that axon_ram_direct_access and dcssblk_direct_access were
previously making potentially incorrect assumptions about the addresses
they passed to virt_to_phys().

Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 26 ++++++++++++-----
drivers/block/brd.c | 5 +--
drivers/block/pmem.c | 16 ++++++++---
drivers/s390/block/dcssblk.c | 26 ++++++++++++++---
fs/block_dev.c | 4 +--
fs/dax.c | 62 ++++++++++++++++++++++++++++++++---------
include/asm-generic/pfn.h | 25 +++++++++++++++++
include/linux/blkdev.h | 7 ++---
8 files changed, 132 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index e8657d3bc588..20725006592e 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,22 +139,23 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static long
+static __maybe_unused long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ __pfn_t *pfn)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

- *kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(bank->ph_addr + offset);

return bank->size - offset;
}

static const struct block_device_operations axon_ram_devops = {
.owner = THIS_MODULE,
+#ifdef HAVE_KMAP_PFN
.direct_access = axon_ram_direct_access
+#endif
};

/**
@@ -165,9 +166,13 @@ static int axon_ram_probe(struct platform_device *device)
{
static int axon_ram_bank_id = -1;
struct axon_ram_bank *bank;
- struct resource resource;
+ struct resource *resource;
int rc = 0;

+ resource = devm_kzalloc(&device->dev, sizeof(*resource), GFP_KERNEL);
+ if (!resource)
+ return -ENOMEM;
+
axon_ram_bank_id++;

dev_info(&device->dev, "Found memory controller on %s\n",
@@ -184,13 +189,13 @@ static int axon_ram_probe(struct platform_device *device)

bank->device = device;

- if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0) {
+ if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
dev_err(&device->dev, "Cannot access device tree\n");
rc = -EFAULT;
goto failed;
}

- bank->size = resource_size(&resource);
+ bank->size = resource_size(resource);

if (bank->size == 0) {
dev_err(&device->dev, "No DDR2 memory found for %s%d\n",
@@ -202,7 +207,7 @@ static int axon_ram_probe(struct platform_device *device)
dev_info(&device->dev, "Register DDR2 memory device %s%d with %luMB\n",
AXON_RAM_DEVICE_NAME, axon_ram_bank_id, bank->size >> 20);

- bank->ph_addr = resource.start;
+ bank->ph_addr = resource->start;
bank->io_addr = (unsigned long) ioremap_prot(
bank->ph_addr, bank->size, _PAGE_NO_CACHE);
if (bank->io_addr == 0) {
@@ -211,6 +216,11 @@ static int axon_ram_probe(struct platform_device *device)
goto failed;
}

+ rc = devm_register_kmap_pfn_range(&device->dev, resource,
+ (void *) bank->io_addr);
+ if (rc)
+ goto failed;
+
bank->disk = alloc_disk(AXON_RAM_MINORS_PER_DISK);
if (bank->disk == NULL) {
dev_err(&device->dev, "Cannot register disk\n");
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 41528857c70d..6c4b21a4e915 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,

#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ __pfn_t *pfn)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -381,8 +381,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
- *kaddr = page_address(page);
- *pfn = page_to_pfn(page);
+ *pfn = page_to_pfn_t(page);

return PAGE_SIZE;
}
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 1f5b5a6288c0..0cb66aa00b4b 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -23,6 +23,8 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>

#define PMEM_MINORS 16

@@ -97,8 +99,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return 0;
}

-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static __maybe_unused long pmem_direct_access(struct block_device *bdev,
+ sector_t sector, __pfn_t *pfn)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
size_t offset = sector << 9;
@@ -106,8 +108,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
if (!pmem)
return -ENODEV;

- *kaddr = pmem->virt_addr + offset;
- *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(pmem->phys_addr + offset);

return pmem->size - offset;
}
@@ -115,7 +116,9 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
+#ifdef HAVE_KMAP_PFN
.direct_access = pmem_direct_access,
+#endif
};

static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
@@ -147,6 +150,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
if (!pmem->virt_addr)
goto out_release_region;

+ err = devm_register_kmap_pfn_range(dev, res, pmem->virt_addr);
+ if (err)
+ goto out_unmap;
+
+ err = -ENOMEM;
pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
if (!pmem->pmem_queue)
goto out_unmap;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 2f1734ba0e22..a7b9743c546f 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+ __pfn_t *pfn);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -38,7 +38,9 @@ static const struct block_device_operations dcssblk_devops = {
.owner = THIS_MODULE,
.open = dcssblk_open,
.release = dcssblk_release,
+#ifdef HAVE_KMAP_PFN
.direct_access = dcssblk_direct_access,
+#endif
};

struct dcssblk_dev_info {
@@ -520,12 +522,18 @@ static const struct attribute_group *dcssblk_dev_attr_groups[] = {
static ssize_t
dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
{
+ struct resource *res = devm_kzalloc(dev, sizeof(*res), GFP_KERNEL);
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
char *local_buf;
unsigned long seg_byte_size;

+ if (!res) {
+ rc = -ENOMEM;
+ goto out_nobuf;
+ }
+
dev_info = NULL;
seg_info = NULL;
if (dev != dcssblk_root_dev) {
@@ -652,6 +660,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
if (rc)
goto put_dev;

+ res->start = dev_info->start;
+ res->end = dev_info->end - 1;
+ rc = devm_register_kmap_pfn_range(&dev_info->dev, res,
+ (void *) dev_info->start);
+ if (rc)
+ goto put_dev;
+
get_device(&dev_info->dev);
add_disk(dev_info->gd);

@@ -699,6 +714,8 @@ seg_list_del:
out:
kfree(local_buf);
out_nobuf:
+ if (res)
+ devm_kfree(dev, res);
return rc;
}

@@ -877,9 +894,9 @@ fail:
bio_io_error(bio);
}

-static long
+static __maybe_unused long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ __pfn_t *pfn)
{
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
@@ -889,8 +906,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
return -ENODEV;
dev_sz = dev_info->end - dev_info->start;
offset = secnum * 512;
- *kaddr = (void *) (dev_info->start + offset);
- *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(dev_info->start + offset);

return dev_sz - offset;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 19750f058495..0f05b33b924f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -438,7 +438,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
* accessible at this address.
*/
long bdev_direct_access(struct block_device *bdev, sector_t sector,
- void **addr, unsigned long *pfn, long size)
+ __pfn_t *pfn, long size)
{
long avail;
const struct block_device_operations *ops = bdev->bd_disk->fops;
@@ -453,7 +453,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, addr, pfn);
+ avail = ops->direct_access(bdev, sector, pfn);
if (!avail)
return -ERANGE;
return min(avail, size);
diff --git a/fs/dax.c b/fs/dax.c
index 6f65f00e58ec..7d302bfff48a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -35,13 +35,16 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
might_sleep();
do {
void *addr;
- unsigned long pfn;
+ __pfn_t pfn;
long count;

- count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+ count = bdev_direct_access(bdev, sector, &pfn, size);
if (count < 0)
return count;
BUG_ON(size < count);
+ addr = kmap_atomic_pfn_t(pfn);
+ if (!addr)
+ return -EIO;
while (count > 0) {
unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
if (pgsz > count)
@@ -57,17 +60,17 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
sector += pgsz / 512;
cond_resched();
}
+ kunmap_atomic_pfn_t(addr);
} while (size);

return 0;
}
EXPORT_SYMBOL_GPL(dax_clear_blocks);

-static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+static long dax_get_pfn(struct buffer_head *bh, __pfn_t *pfn, unsigned blkbits)
{
- unsigned long pfn;
sector_t sector = bh->b_blocknr << (blkbits - 9);
- return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+ return bdev_direct_access(bh->b_bdev, sector, pfn, bh->b_size);
}

static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
@@ -106,7 +109,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
loff_t pos = start;
loff_t max = start;
loff_t bh_max = start;
- void *addr;
+ void *addr = NULL, *kmap = NULL;
+ __pfn_t pfn;
bool hole = false;

if (iov_iter_rw(iter) != WRITE)
@@ -142,9 +146,19 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
addr = NULL;
size = bh->b_size - first;
} else {
- retval = dax_get_addr(bh, &addr, blkbits);
+ if (kmap) {
+ kunmap_atomic_pfn_t(kmap);
+ kmap = NULL;
+ }
+ retval = dax_get_pfn(bh, &pfn, blkbits);
if (retval < 0)
break;
+ kmap = kmap_atomic_pfn_t(pfn);
+ if (!kmap) {
+ retval = -EIO;
+ break;
+ }
+ addr = kmap;
if (buffer_unwritten(bh) || buffer_new(bh))
dax_new_buf(addr, retval, first, pos,
end);
@@ -168,6 +182,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
addr += len;
}

+ if (kmap)
+ kunmap_atomic_pfn_t(kmap);
+
return (pos == start) ? retval : pos - start;
}

@@ -259,11 +276,17 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
unsigned blkbits, unsigned long vaddr)
{
void *vfrom, *vto;
- if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+ __pfn_t pfn;
+
+ if (dax_get_pfn(bh, &pfn, blkbits) < 0)
+ return -EIO;
+ vfrom = kmap_atomic_pfn_t(pfn);
+ if (!vfrom)
return -EIO;
vto = kmap_atomic(to);
copy_user_page(vto, vfrom, vaddr, to);
kunmap_atomic(vto);
+ kunmap_atomic_pfn_t(vfrom);
return 0;
}

@@ -274,7 +297,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
unsigned long vaddr = (unsigned long)vmf->virtual_address;
void *addr;
- unsigned long pfn;
+ __pfn_t pfn;
pgoff_t size;
int error;

@@ -293,7 +316,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
goto out;
}

- error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
+ error = bdev_direct_access(bh->b_bdev, sector, &pfn, bh->b_size);
if (error < 0)
goto out;
if (error < PAGE_SIZE) {
@@ -301,10 +324,17 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
goto out;
}

- if (buffer_unwritten(bh) || buffer_new(bh))
+ if (buffer_unwritten(bh) || buffer_new(bh)) {
+ addr = kmap_atomic_pfn_t(pfn);
+ if (!addr) {
+ error = -EIO;
+ goto out;
+ }
clear_page(addr);
+ kunmap_atomic_pfn_t(addr);
+ }

- error = vm_insert_mixed(vma, vaddr, pfn);
+ error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));

out:
i_mmap_unlock_read(mapping);
@@ -517,10 +547,16 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
return err;
if (buffer_written(&bh)) {
void *addr;
- err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+ __pfn_t pfn;
+
+ err = dax_get_pfn(&bh, &pfn, inode->i_blkbits);
if (err < 0)
return err;
+ addr = kmap_atomic_pfn_t(pfn);
+ if (!addr)
+ return -EIO;
memset(addr + offset, 0, length);
+ kunmap_atomic_pfn_t(addr);
}

return 0;
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
index 2f4ae40dc6a7..e9fed20d606a 100644
--- a/include/asm-generic/pfn.h
+++ b/include/asm-generic/pfn.h
@@ -49,7 +49,32 @@ static inline bool __pfn_t_has_page(__pfn_t pfn)
return (pfn.data & PFN_MASK) == 0;
}

+static inline __pfn_t pfn_to_pfn_t(unsigned long pfn)
+{
+ __pfn_t pfn_t = { .data = (pfn << PFN_SHIFT) | PFN_DEV };
+
+ return pfn_t;
+}
+
+static inline __pfn_t phys_to_pfn_t(dma_addr_t addr)
+{
+ return pfn_to_pfn_t(addr >> PAGE_SHIFT);
+}
#else
+static inline __pfn_t phys_to_pfn_t(dma_addr_t addr)
+{
+ __pfn_t pfn_t = { .data = 0 };
+
+ /*
+ * With CONFIG_DEV_PFN=n the kernel expects every __pfn_t to
+ * have a corresponding entry in the kernel linear address map
+ * (struct page).
+ */
+ BUG();
+
+ return pfn_t;
+}
+
static inline bool __pfn_t_has_page(__pfn_t pfn)
{
return true;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 535b82f790a6..35e404d598b2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1552,8 +1552,7 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- long (*direct_access)(struct block_device *, sector_t,
- void **, unsigned long *pfn);
+ long (*direct_access)(struct block_device *, sector_t, __pfn_t *pfn);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1571,8 +1570,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
-extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
- unsigned long *pfn, long size);
+extern long bdev_direct_access(struct block_device *, sector_t,
+ __pfn_t *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;

2015-06-05 21:22:16

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 5/9] dma-mapping: allow archs to optionally specify a ->map_pfn() operation

Initially this enables the conversion of struct scatterlist to carry a __pfn_t
instead of a struct page * reference. Later this enables block device drivers
to perform DMA to/from persistent memory which may not have a backing struct
page entry.

Signed-off-by: Dan Williams <[email protected]>
---
arch/Kconfig | 3 +++
include/asm-generic/dma-mapping-common.h | 30 ++++++++++++++++++++++++++++++
include/linux/dma-debug.h | 23 +++++++++++++++++++----
include/linux/dma-mapping.h | 8 +++++++-
lib/dma-debug.c | 10 ++++++----
5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9b98732e0f3b..69d3a3fa21af 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
config HAVE_DMA_CONTIGUOUS
bool

+config HAVE_DMA_PFN
+ bool
+
config HAVE_KMAP_PFN
bool

diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index 940d5ec122c9..7305efb1bac6 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,

kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
+#ifdef CONFIG_HAVE_DMA_PFN
+ addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+ (unsigned long)ptr & ~PAGE_MASK, size,
+ dir, attrs);
+#else
addr = ops->map_page(dev, virt_to_page(ptr),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, attrs);
+#endif
debug_dma_map_page(dev, virt_to_page(ptr),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, addr, true);
@@ -73,6 +79,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
}

+#ifdef CONFIG_HAVE_DMA_PFN
+static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+ struct dma_map_ops *ops = get_dma_ops(dev);
+ dma_addr_t addr;
+
+ BUG_ON(!valid_dma_direction(dir));
+ addr = ops->map_pfn(dev, pfn, offset, size, dir, NULL);
+ debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
+
+ return addr;
+}
+
+static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+ kmemcheck_mark_initialized(page_address(page) + offset, size);
+ return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+}
+#else
static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
size_t offset, size_t size,
enum dma_data_direction dir)
@@ -87,6 +116,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,

return addr;
}
+#endif /* CONFIG_HAVE_DMA_PFN */

static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir)
diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fe8cb610deac..a3b4c8c0cd68 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);

extern int dma_debug_resize_entries(u32 num_entries);

-extern void debug_dma_map_page(struct device *dev, struct page *page,
- size_t offset, size_t size,
- int direction, dma_addr_t dma_addr,
- bool map_single);
+extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
+ size_t size, int direction, dma_addr_t dma_addr,
+ bool map_single);
+
+static inline void debug_dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single)
+{
+ return debug_dma_map_pfn(dev, page_to_pfn_t(page), offset, size,
+ direction, dma_addr, map_single);
+}

extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);

@@ -109,6 +117,13 @@ static inline void debug_dma_map_page(struct device *dev, struct page *page,
{
}

+static inline void debug_dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single)
+{
+}
+
static inline void debug_dma_mapping_error(struct device *dev,
dma_addr_t dma_addr)
{
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff090919..d6437b493300 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -26,11 +26,17 @@ struct dma_map_ops {

int (*get_sgtable)(struct device *dev, struct sg_table *sgt, void *,
dma_addr_t, size_t, struct dma_attrs *attrs);
-
+#ifdef CONFIG_HAVE_DMA_PFN
+ dma_addr_t (*map_pfn)(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+#else
dma_addr_t (*map_page)(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
struct dma_attrs *attrs);
+#endif
void (*unmap_page)(struct device *dev, dma_addr_t dma_handle,
size_t size, enum dma_data_direction dir,
struct dma_attrs *attrs);
diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index ae4b65e17e64..c24de1cd8f81 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -1250,11 +1250,12 @@ out:
put_hash_bucket(bucket, &flags);
}

-void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
+void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
size_t size, int direction, dma_addr_t dma_addr,
bool map_single)
{
struct dma_debug_entry *entry;
+ struct page *page;

if (unlikely(dma_debug_disabled()))
return;
@@ -1268,7 +1269,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

entry->dev = dev;
entry->type = dma_debug_page;
- entry->pfn = page_to_pfn(page);
+ entry->pfn = __pfn_t_to_pfn(pfn);
entry->offset = offset,
entry->dev_addr = dma_addr;
entry->size = size;
@@ -1278,7 +1279,8 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
if (map_single)
entry->type = dma_debug_single;

- if (!PageHighMem(page)) {
+ page = __pfn_t_to_page(pfn);
+ if (page && !PageHighMem(page)) {
void *addr = page_address(page) + offset;

check_for_stack(dev, addr);
@@ -1287,7 +1289,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

add_dma_entry(entry);
}
-EXPORT_SYMBOL(debug_dma_map_page);
+EXPORT_SYMBOL(debug_dma_map_pfn);

void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
{

2015-06-05 21:22:32

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 6/9] scatterlist: use sg_phys()

Coccinelle cleanup to replace open coded sg to physical address
translations. This is in preparation for introducing scatterlists that
reference __pfn_t.

// sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
// usage: make coccicheck COCCI=sg_phys.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg)) + sg->offset
+ sg_phys(sg)

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg))
+ sg_phys(sg) - sg->offset

Cc: Julia Lawall <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/mm/dma-mapping.c | 2 +-
arch/microblaze/kernel/dma.c | 2 +-
drivers/iommu/intel-iommu.c | 4 ++--
drivers/iommu/iommu.c | 2 +-
drivers/staging/android/ion/ion_chunk_heap.c | 4 ++--
5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 09c5fe3d30c2..43cc6a8fdacc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
return -ENOMEM;

for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
- phys_addr_t phys = page_to_phys(sg_page(s));
+ phys_addr_t phys = sg_phys(s) - s->offset;
unsigned int len = PAGE_ALIGN(s->offset + s->length);

if (!is_coherent &&
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ed7ba8a11822..dcb3c594d626 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg);
- __dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
+ __dma_sync(sg_phys(sg),
sg->length, direction);
}

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 68d43beccb7e..9b9ada71e0d3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
sg_res = aligned_nrpages(sg->offset, sg->length);
sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (sg_phys(sg) - sg->offset) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}

@@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,

for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
- sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
+ sg->dma_address = sg_phys(sg);
sg->dma_length = sg->length;
}
return nelems;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d4f527e56679..59808fc9110d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);

for_each_sg(sg, s, nents, i) {
- phys_addr_t phys = page_to_phys(sg_page(s)) + s->offset;
+ phys_addr_t phys = sg_phys(s);

/*
* We are mapping on IOMMU page boundaries, so offset within
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c b/drivers/staging/android/ion/ion_chunk_heap.c
index 3e6ec2ee6802..b7da5d142aa9 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
err:
sg = table->sgl;
for (i -= 1; i >= 0; i--) {
- gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+ gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
sg->length);
sg = sg_next(sg);
}
@@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
DMA_BIDIRECTIONAL);

for_each_sg(table->sgl, sg, table->nents, i) {
- gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+ gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
sg->length);
}
chunk_heap->allocated -= allocated_size;

2015-06-05 21:24:56

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 7/9] scatterlist: cleanup sg_chain() and sg_unmark_end()

Replace open coded sg_chain() and sg_unmark_end() instances with the
aforementioned helpers.

Signed-off-by: Dan Williams <[email protected]>
---
block/blk-merge.c | 2 +-
drivers/crypto/omap-sham.c | 2 +-
drivers/dma/imx-dma.c | 8 ++------
drivers/dma/ste_dma40.c | 5 +----
drivers/mmc/card/queue.c | 4 ++--
include/crypto/scatterwalk.h | 9 ++-------
6 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 30a0d9f89017..f4a3e87623dc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -266,7 +266,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
if (rq->cmd_flags & REQ_WRITE)
memset(q->dma_drain_buffer, 0, q->dma_drain_size);

- sg->page_link &= ~0x02;
+ sg_unmark_end(sg);
sg = sg_next(sg);
sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
q->dma_drain_size,
diff --git a/drivers/crypto/omap-sham.c b/drivers/crypto/omap-sham.c
index 4d63e0d4da9a..df8b23e19b90 100644
--- a/drivers/crypto/omap-sham.c
+++ b/drivers/crypto/omap-sham.c
@@ -582,7 +582,7 @@ static int omap_sham_xmit_dma(struct omap_sham_dev *dd, dma_addr_t dma_addr,
* the dmaengine may try to DMA the incorrect amount of data.
*/
sg_init_table(&ctx->sgl, 1);
- ctx->sgl.page_link = ctx->sg->page_link;
+ sg_assign_page(&ctx->sgl, sg_page(ctx->sg));
ctx->sgl.offset = ctx->sg->offset;
sg_dma_len(&ctx->sgl) = len32;
sg_dma_address(&ctx->sgl) = sg_dma_address(ctx->sg);
diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
index eed405976ea9..081fbfc87f6b 100644
--- a/drivers/dma/imx-dma.c
+++ b/drivers/dma/imx-dma.c
@@ -886,18 +886,14 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_cyclic(
sg_init_table(imxdmac->sg_list, periods);

for (i = 0; i < periods; i++) {
- imxdmac->sg_list[i].page_link = 0;
- imxdmac->sg_list[i].offset = 0;
+ sg_set_page(&imxdmac->sg_list[i], NULL, period_len, 0);
imxdmac->sg_list[i].dma_address = dma_addr;
sg_dma_len(&imxdmac->sg_list[i]) = period_len;
dma_addr += period_len;
}

/* close the loop */
- imxdmac->sg_list[periods].offset = 0;
- sg_dma_len(&imxdmac->sg_list[periods]) = 0;
- imxdmac->sg_list[periods].page_link =
- ((unsigned long)imxdmac->sg_list | 0x01) & ~0x02;
+ sg_chain(imxdmac->sg_list, periods + 1, imxdmac->sg_list);

desc->type = IMXDMA_DESC_CYCLIC;
desc->sg = imxdmac->sg_list;
diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c
index 3c10f034d4b9..e8c00642cacb 100644
--- a/drivers/dma/ste_dma40.c
+++ b/drivers/dma/ste_dma40.c
@@ -2562,10 +2562,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t dma_addr,
dma_addr += period_len;
}

- sg[periods].offset = 0;
- sg_dma_len(&sg[periods]) = 0;
- sg[periods].page_link =
- ((unsigned long)sg | 0x01) & ~0x02;
+ sg_chain(sg, periods + 1, sg);

txd = d40_prep_sg(chan, sg, sg, periods, direction,
DMA_PREP_INTERRUPT);
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 236d194c2883..127f76294e71 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
sg_set_buf(__sg, buf + offset, len);
offset += len;
remain -= len;
- (__sg++)->page_link &= ~0x02;
+ sg_unmark_end(__sg++);
sg_len++;
} while (remain);
}
@@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
list_for_each_entry(req, &packed->list, queuelist) {
sg_len += blk_rq_map_sg(mq->queue, req, __sg);
__sg = sg + (sg_len - 1);
- (__sg++)->page_link &= ~0x02;
+ sg_unmark_end(__sg++);
}
sg_mark_end(sg + (sg_len - 1));
return sg_len;
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 20e4226a2e14..4529889b0f07 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -25,13 +25,8 @@
#include <linux/scatterlist.h>
#include <linux/sched.h>

-static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
- struct scatterlist *sg2)
-{
- sg_set_page(&sg1[num - 1], (void *)sg2, 0, 0);
- sg1[num - 1].page_link &= ~0x02;
- sg1[num - 1].page_link |= 0x01;
-}
+#define scatterwalk_sg_chain(prv, num, sgl) sg_chain(prv, num, sgl)
+#define scatterwalk_sg_next(sgl) sg_next(sgl)

static inline void scatterwalk_crypto_chain(struct scatterlist *head,
struct scatterlist *sg,

2015-06-05 21:22:44

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 8/9] scatterlist: convert to __pfn_t

__pfn_t replaces the struct page reference in struct scatterlist. Given
__pfn_t implements the same bits at the same bit positions for denoting
sg_is_chain() + sg_is_last() this conversion is binary identical to the
previous state.

Signed-off-by: Dan Williams <[email protected]>
---
include/asm-generic/pfn.h | 9 ++++
include/linux/scatterlist.h | 103 ++++++++++++++++++++++++++++++-------------
samples/kfifo/dma-example.c | 8 ++-
3 files changed, 86 insertions(+), 34 deletions(-)

diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
index e9fed20d606a..f826c50ed025 100644
--- a/include/asm-generic/pfn.h
+++ b/include/asm-generic/pfn.h
@@ -108,4 +108,13 @@ static inline __pfn_t page_to_pfn_t(struct page *page)

return pfn;
}
+
+static inline __pfn_t nth_pfn(__pfn_t pfn, unsigned int n)
+{
+ __pfn_t ret;
+
+ ret.data = (__pfn_t_to_pfn(pfn) + n) << PFN_SHIFT
+ | (pfn.data & PFN_MASK);
+ return ret;
+}
#endif /* __ASM_PFN_H */
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index eca1ec93775c..49054374646e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -11,7 +11,7 @@ struct scatterlist {
#ifdef CONFIG_DEBUG_SG
unsigned long sg_magic;
#endif
- unsigned long page_link;
+ __pfn_t pfn;
unsigned int offset;
unsigned int length;
dma_addr_t dma_address;
@@ -44,14 +44,14 @@ struct sg_table {
/*
* Notes on SG table design.
*
- * We use the unsigned long page_link field in the scatterlist struct to place
+ * We use the __pfn_t pfn field in the scatterlist struct to place
* the page pointer AND encode information about the sg table as well. The two
* lower bits are reserved for this information.
*
- * If bit 0 is set, then the page_link contains a pointer to the next sg
+ * If PFN_SG_CHAIN is set, then the pfn contains a pointer to the next sg
* table list. Otherwise the next entry is at sg + 1.
*
- * If bit 1 is set, then this sg entry is the last element in a list.
+ * If PFN_SG_LAST is set, then this sg entry is the last element in a list.
*
* See sg_next().
*
@@ -64,10 +64,30 @@ struct sg_table {
* a valid sg entry, or whether it points to the start of a new scatterlist.
* Those low bits are there for everyone! (thanks mason :-)
*/
-#define sg_is_chain(sg) ((sg)->page_link & 0x01)
-#define sg_is_last(sg) ((sg)->page_link & 0x02)
-#define sg_chain_ptr(sg) \
- ((struct scatterlist *) ((sg)->page_link & ~0x03))
+static inline bool sg_is_chain(struct scatterlist *sg)
+{
+ return (sg->pfn.data & PFN_MASK) == PFN_SG_CHAIN;
+}
+
+static inline bool sg_is_last(struct scatterlist *sg)
+{
+ return (sg->pfn.data & PFN_MASK) == PFN_SG_LAST;
+}
+
+static inline struct scatterlist *sg_chain_ptr(struct scatterlist *sg)
+{
+ return (struct scatterlist *) (sg->pfn.data & ~PFN_MASK);
+}
+
+static inline void sg_assign_pfn(struct scatterlist *sg, __pfn_t pfn)
+{
+#ifdef CONFIG_DEBUG_SG
+ BUG_ON(sg->sg_magic != SG_MAGIC);
+ BUG_ON(sg_is_chain(sg));
+#endif
+ pfn.data &= ~PFN_SG_LAST;
+ sg->pfn.data = (sg->pfn.data & PFN_SG_LAST) | pfn.data;
+}

/**
* sg_assign_page - Assign a given page to an SG entry
@@ -81,18 +101,23 @@ struct sg_table {
**/
static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
{
- unsigned long page_link = sg->page_link & 0x3;
+ __pfn_t pfn = page_to_pfn_t(page);

/*
* In order for the low bit stealing approach to work, pages
- * must be aligned at a 32-bit boundary as a minimum.
+ * must be aligned at a sizeof(unsigned long) boundary.
*/
- BUG_ON((unsigned long) page & 0x03);
-#ifdef CONFIG_DEBUG_SG
- BUG_ON(sg->sg_magic != SG_MAGIC);
- BUG_ON(sg_is_chain(sg));
-#endif
- sg->page_link = page_link | (unsigned long) page;
+ BUG_ON(pfn.data & PFN_MASK);
+
+ sg_assign_pfn(sg, pfn);
+}
+
+static inline void sg_set_pfn(struct scatterlist *sg, __pfn_t pfn,
+ unsigned int len, unsigned int offset)
+{
+ sg_assign_pfn(sg, pfn);
+ sg->offset = offset;
+ sg->length = len;
}

/**
@@ -112,18 +137,30 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
unsigned int len, unsigned int offset)
{
- sg_assign_page(sg, page);
- sg->offset = offset;
- sg->length = len;
+ sg_set_pfn(sg, page_to_pfn_t(page), len, offset);
}

static inline struct page *sg_page(struct scatterlist *sg)
{
+ __pfn_t pfn = sg->pfn;
+ struct page *page;
+
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
BUG_ON(sg_is_chain(sg));
#endif
- return (struct page *)((sg)->page_link & ~0x3);
+
+ pfn.data &= ~PFN_SG_LAST;
+ page = __pfn_t_to_page(pfn);
+
+ /* don't use sg_page() on non linear-mapped memory */
+ BUG_ON(!page);
+ return page;
+}
+
+static inline unsigned long sg_pfn(struct scatterlist *sg)
+{
+ return __pfn_t_to_pfn(sg->pfn);
}

/**
@@ -175,7 +212,8 @@ static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
* Set lowest bit to indicate a link pointer, and make sure to clear
* the termination bit if it happens to be set.
*/
- prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+ prv[prv_nents - 1].pfn.data = ((unsigned long) sgl | PFN_SG_CHAIN)
+ & ~PFN_SG_LAST;
}

/**
@@ -195,8 +233,8 @@ static inline void sg_mark_end(struct scatterlist *sg)
/*
* Set termination bit, clear potential chain bit
*/
- sg->page_link |= 0x02;
- sg->page_link &= ~0x01;
+ sg->pfn.data |= PFN_SG_LAST;
+ sg->pfn.data &= ~PFN_SG_CHAIN;
}

/**
@@ -212,7 +250,7 @@ static inline void sg_unmark_end(struct scatterlist *sg)
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
#endif
- sg->page_link &= ~0x02;
+ sg->pfn.data &= ~PFN_SG_LAST;
}

/**
@@ -220,14 +258,13 @@ static inline void sg_unmark_end(struct scatterlist *sg)
* @sg: SG entry
*
* Description:
- * This calls page_to_phys() on the page in this sg entry, and adds the
- * sg offset. The caller must know that it is legal to call page_to_phys()
- * on the sg page.
+ * This calls __pfn_t_to_phys() on the pfn in this sg entry, and adds the
+ * sg offset.
*
**/
static inline dma_addr_t sg_phys(struct scatterlist *sg)
{
- return page_to_phys(sg_page(sg)) + sg->offset;
+ return __pfn_t_to_phys(sg->pfn) + sg->offset;
}

/**
@@ -281,7 +318,7 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
#define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist))

/*
- * sg page iterator
+ * sg page / pfn iterator
*
* Iterates over sg entries page-by-page. On each successful iteration,
* you can call sg_page_iter_page(@piter) and sg_page_iter_dma_address(@piter)
@@ -304,13 +341,19 @@ bool __sg_page_iter_next(struct sg_page_iter *piter);
void __sg_page_iter_start(struct sg_page_iter *piter,
struct scatterlist *sglist, unsigned int nents,
unsigned long pgoffset);
+
+static inline __pfn_t sg_page_iter_pfn(struct sg_page_iter *piter)
+{
+ return nth_pfn(piter->sg->pfn, piter->sg_pgoffset);
+}
+
/**
* sg_page_iter_page - get the current page held by the page iterator
* @piter: page iterator holding the page
*/
static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
{
- return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
+ return __pfn_t_to_page(sg_page_iter_pfn(piter));
}

/**
diff --git a/samples/kfifo/dma-example.c b/samples/kfifo/dma-example.c
index aa243db93f01..3eeff9a56e0e 100644
--- a/samples/kfifo/dma-example.c
+++ b/samples/kfifo/dma-example.c
@@ -75,8 +75,8 @@ static int __init example_init(void)
for (i = 0; i < nents; i++) {
printk(KERN_INFO
"sg[%d] -> "
- "page_link 0x%.8lx offset 0x%.8x length 0x%.8x\n",
- i, sg[i].page_link, sg[i].offset, sg[i].length);
+ "pfn_data 0x%.8lx offset 0x%.8x length 0x%.8x\n",
+ i, sg[i].pfn.data, sg[i].offset, sg[i].length);

if (sg_is_last(&sg[i]))
break;
@@ -104,8 +104,8 @@ static int __init example_init(void)
for (i = 0; i < nents; i++) {
printk(KERN_INFO
"sg[%d] -> "
- "page_link 0x%.8lx offset 0x%.8x length 0x%.8x\n",
- i, sg[i].page_link, sg[i].offset, sg[i].length);
+ "pfn_data 0x%.8lx offset 0x%.8x length 0x%.8x\n",
+ i, sg[i].pfn.data, sg[i].offset, sg[i].length);

if (sg_is_last(&sg[i]))
break;

2015-06-05 21:23:30

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 9/9] x86: convert dma_map_ops to support mapping a __pfn_t.

As long as a dma_map_sg() implementation avoids sg_page() conversions it
can support scatterlists that carry "page-less" __pfn_t entries.
However, a couple implementations require that __pfn_t_has_page() is
always true. The Xen swiotlb implementation's entanglements with ARM and
the Calgary MMUs requirement to have a pre-existing virtual mapping make
them unable to support this conversion (i.e. these now have 'depends on
!HAVE_DMA_PFN').

Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 2 ++
arch/x86/kernel/amd_gart_64.c | 22 +++++++++++++++++-----
arch/x86/kernel/pci-nommu.c | 22 +++++++++++++++++-----
arch/x86/kernel/pci-swiotlb.c | 4 ++++
arch/x86/pci/sta2x11-fixup.c | 4 ++++
drivers/iommu/amd_iommu.c | 21 ++++++++++++++++-----
drivers/iommu/intel-iommu.c | 22 +++++++++++++++++-----
drivers/pci/Kconfig | 2 +-
include/asm-generic/dma-mapping-common.h | 4 ++--
include/linux/scatterlist.h | 4 ++--
include/linux/swiotlb.h | 4 ++++
lib/swiotlb.c | 20 +++++++++++++++-----
12 files changed, 101 insertions(+), 30 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 130d1a4c2efc..2fd7690ed0e2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -796,6 +796,7 @@ config CALGARY_IOMMU
bool "IBM Calgary IOMMU support"
select SWIOTLB
depends on X86_64 && PCI
+ depends on !HAVE_DMA_PFN
---help---
Support for hardware IOMMUs in IBM's xSeries x366 and x460
systems. Needed to run systems with more than 3GB of memory
@@ -1436,6 +1437,7 @@ config X86_PMEM_DMA
depends on !HIGHMEM
def_bool DEV_PFN
select HAVE_KMAP_PFN
+ select HAVE_DMA_PFN

config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 8e3842fc8bea..8fad83c8dfd2 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
}

/* Map a single area into the IOMMU */
-static dma_addr_t gart_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
unsigned long bus;
- phys_addr_t paddr = page_to_phys(page) + offset;
+ phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;

if (!dev)
dev = &x86_dma_fallback_dev;
@@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct page *page,
return bus;
}

+static __maybe_unused dma_addr_t gart_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return gart_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
/*
* Free a DMA mapping.
*/
@@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info)
static struct dma_map_ops gart_dma_ops = {
.map_sg = gart_map_sg,
.unmap_sg = gart_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = gart_map_pfn,
+#else
.map_page = gart_map_page,
+#endif
.unmap_page = gart_unmap_page,
.alloc = gart_alloc_coherent,
.free = gart_free_coherent,
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index da15918d1c81..876dacfbabf6 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t bus, size_t size)
return 1;
}

-static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- dma_addr_t bus = page_to_phys(page) + offset;
+ dma_addr_t bus = __pfn_t_to_phys(pfn) + offset;
WARN_ON(size == 0);
if (!check_addr("map_single", dev, bus, size))
return DMA_ERROR_CODE;
@@ -38,6 +38,14 @@ static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
return bus;
}

+static __maybe_unused dma_addr_t nommu_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return nommu_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
/* Map a set of buffers described by scatterlist in streaming
* mode for DMA. This is the scatter-gather version of the
* above pci_map_single interface. Here the scatter gather list
@@ -92,7 +100,11 @@ struct dma_map_ops nommu_dma_ops = {
.alloc = dma_generic_alloc_coherent,
.free = dma_generic_free_coherent,
.map_sg = nommu_map_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = nommu_map_pfn,
+#else
.map_page = nommu_map_page,
+#endif
.sync_single_for_device = nommu_sync_single_for_device,
.sync_sg_for_device = nommu_sync_sg_for_device,
.is_phys = 1,
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 77dd0ad58be4..5351eb8c8f7f 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -48,7 +48,11 @@ static struct dma_map_ops swiotlb_dma_ops = {
.sync_sg_for_device = swiotlb_sync_sg_for_device,
.map_sg = swiotlb_map_sg_attrs,
.unmap_sg = swiotlb_unmap_sg_attrs,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = swiotlb_map_pfn,
+#else
.map_page = swiotlb_map_page,
+#endif
.unmap_page = swiotlb_unmap_page,
.dma_supported = NULL,
};
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 5ceda85b8687..d1c6e3808bb5 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -182,7 +182,11 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
static struct dma_map_ops sta2x11_dma_ops = {
.alloc = sta2x11_swiotlb_alloc_coherent,
.free = x86_swiotlb_free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = swiotlb_map_pfn,
+#else
.map_page = swiotlb_map_page,
+#endif
.unmap_page = swiotlb_unmap_page,
.map_sg = swiotlb_map_sg_attrs,
.unmap_sg = swiotlb_unmap_sg_attrs,
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index e43d48956dea..ee8f70224b73 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2754,16 +2754,15 @@ static void __unmap_single(struct dma_ops_domain *dma_dom,
/*
* The exported map_single function for dma_ops.
*/
-static dma_addr_t map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t map_pfn(struct device *dev, __pfn_t pfn, unsigned long offset,
+ size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
unsigned long flags;
struct protection_domain *domain;
dma_addr_t addr;
u64 dma_mask;
- phys_addr_t paddr = page_to_phys(page) + offset;
+ phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;

INC_STATS_COUNTER(cnt_map_single);

@@ -2788,6 +2787,14 @@ out:
spin_unlock_irqrestore(&domain->lock, flags);

return addr;
+
+}
+
+static __maybe_unused dma_addr_t map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ return map_pfn(dev, page_to_pfn_t(page), offset, size, dir, attrs);
}

/*
@@ -3062,7 +3069,11 @@ static void __init prealloc_protection_domains(void)
static struct dma_map_ops amd_iommu_dma_ops = {
.alloc = alloc_coherent,
.free = free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = map_pfn,
+#else
.map_page = map_page,
+#endif
.unmap_page = unmap_page,
.map_sg = map_sg,
.unmap_sg = unmap_sg,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 9b9ada71e0d3..6d9a0f85b827 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3086,15 +3086,23 @@ error:
return 0;
}

-static dma_addr_t intel_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t intel_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- return __intel_map_single(dev, page_to_phys(page) + offset, size,
+ return __intel_map_single(dev, __pfn_t_to_phys(pfn) + offset, size,
dir, *dev->dma_mask);
}

+static __maybe_unused dma_addr_t intel_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return intel_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
static void flush_unmaps(void)
{
int i, j;
@@ -3380,7 +3388,11 @@ struct dma_map_ops intel_dma_ops = {
.free = intel_free_coherent,
.map_sg = intel_map_sg,
.unmap_sg = intel_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = intel_map_pfn,
+#else
.map_page = intel_map_page,
+#endif
.unmap_page = intel_unmap_page,
.mapping_error = intel_mapping_error,
};
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 7a8f1c5e65af..e56b04e24ec6 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -56,7 +56,7 @@ config PCI_STUB

config XEN_PCIDEV_FRONTEND
tristate "Xen PCI Frontend"
- depends on PCI && X86 && XEN
+ depends on PCI && X86 && XEN && !HAVE_DMA_PFN
select PCI_XEN
select XEN_XENBUS_FRONTEND
default y
diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index 7305efb1bac6..e031b079ce4e 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -18,7 +18,7 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
#ifdef CONFIG_HAVE_DMA_PFN
- addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+ addr = ops->map_pfn(dev, page_to_pfn_t(virt_to_page(ptr)),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, attrs);
#else
@@ -99,7 +99,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
enum dma_data_direction dir)
{
kmemcheck_mark_initialized(page_address(page) + offset, size);
- return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+ return dma_map_pfn(dev, page_to_pfn_t(page), offset, size, dir);
}
#else
static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 49054374646e..20334222d0c9 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -158,9 +158,9 @@ static inline struct page *sg_page(struct scatterlist *sg)
return page;
}

-static inline unsigned long sg_pfn(struct scatterlist *sg)
+static inline __pfn_t sg_pfn(struct scatterlist *sg)
{
- return __pfn_t_to_pfn(sg->pfn);
+ return sg->pfn;
}

/**
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e7a018eaf3a2..5093fc8d2825 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -66,6 +66,10 @@ extern dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
struct dma_attrs *attrs);
+extern dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs);
extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
size_t size, enum dma_data_direction dir,
struct dma_attrs *attrs);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 341268841b31..5ab7566735ba 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -727,12 +727,12 @@ swiotlb_full(struct device *dev, size_t size, enum dma_data_direction dir,
* Once the device is given the dma address, the device owns this memory until
* either swiotlb_unmap_page or swiotlb_dma_sync_single is performed.
*/
-dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- phys_addr_t map, phys = page_to_phys(page) + offset;
+ phys_addr_t map, phys = __pfn_t_to_phys(pfn) + offset;
dma_addr_t dev_addr = phys_to_dma(dev, phys);

BUG_ON(dir == DMA_NONE);
@@ -763,6 +763,16 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,

return dev_addr;
}
+EXPORT_SYMBOL_GPL(swiotlb_map_pfn);
+
+dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ return swiotlb_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
EXPORT_SYMBOL_GPL(swiotlb_map_page);

/*

2015-06-05 21:23:46

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 0/9] introduce __pfn_t, evacuate struct page from sgls

On Fri, Jun 5, 2015 at 2:19 PM, Dan Williams <[email protected]> wrote:
> Introduce __pfn_t which:
>
> 1/ Allows kernel internal DAX mappings to adhere to the lifetime of the
> the underlying block device. In general, it enables a mechanism to
> allow any device driver to advertise "device memory" (CONFIG_DEV_PFN)
> to other parts of the kernel.
>
> 2/ Replaces usage of struct page in struct scatterlist. A scatterlist
> need only carry enough information to generate a dma address, and
> removing struct page from scatterlists is a precursor to allowing DMA to
> device memory. Some dma mapping implementations are not ready for a
> scatterlist-pfn to reference unampped device memory, those
> implementations are disabled by CONFIG_DEV_PFN=y.
>
> Changes since v4 [1]:

v3: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000749.html

2015-06-05 21:37:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 1/9] introduce __pfn_t for scatterlists and pmem

On Fri, Jun 5, 2015 at 2:19 PM, Dan Williams <[email protected]> wrote:
> +enum {
> +#if BITS_PER_LONG == 64
> + PFN_SHIFT = 3,
> + /* device-pfn not covered by memmap */
> + PFN_DEV = (1UL << 2),
> +#else
> + PFN_SHIFT = 2,
> +#endif
> + PFN_MASK = (1UL << PFN_SHIFT) - 1,
> + PFN_SG_CHAIN = (1UL << 0),
> + PFN_SG_LAST = (1UL << 1),
> +};

Ugh. Just make PFN_SHIFT unconditional. Make it 2, unconditionally.
Or, if you want to have more bits, make it three unconditionally, and
make 'struct page' just be at least 8-byte aligned even on 32-bit.

Even on 32-bit architectures, there's plenty of bits. There's no
reason to "pack" this optimally. Remember: it's a page frame number,
so there's the page size shifting going on in physical memory, and
even if you shift the PFN by 3 - or four of five - bits
unconditionally (rather than try to shift it by some minimal number),
you're covering a *lot* of physical memory.

Say you're a 32-bit architecture with a 4k page size, and you lose
three bits to "type" bits. You still have 32+12-3=41 bits of physical
address space. Which is way more than realistic for a 32-bit
architecture anyway, even with PAE (or PXE or whatever ARM calls it).
Not that I see persistent memory being all that relevant on 32-bit
hardware anyway.

So I think if you actually do want that third bit, you're better off
just marking "struct page" as being __aligned__((8)) and getting the
three bits unconditionally. Just make the rule be that mem_map[] has
to be 8-byte aligned.

Even 16-byte alignment would probably be fine. No?

Linus

2015-06-05 22:12:58

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 1/9] introduce __pfn_t for scatterlists and pmem

On Fri, Jun 5, 2015 at 2:37 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jun 5, 2015 at 2:19 PM, Dan Williams <[email protected]> wrote:
>> +enum {
>> +#if BITS_PER_LONG == 64
>> + PFN_SHIFT = 3,
>> + /* device-pfn not covered by memmap */
>> + PFN_DEV = (1UL << 2),
>> +#else
>> + PFN_SHIFT = 2,
>> +#endif
>> + PFN_MASK = (1UL << PFN_SHIFT) - 1,
>> + PFN_SG_CHAIN = (1UL << 0),
>> + PFN_SG_LAST = (1UL << 1),
>> +};
>
> Ugh. Just make PFN_SHIFT unconditional. Make it 2, unconditionally.
> Or, if you want to have more bits, make it three unconditionally, and
> make 'struct page' just be at least 8-byte aligned even on 32-bit.
>
> Even on 32-bit architectures, there's plenty of bits. There's no
> reason to "pack" this optimally. Remember: it's a page frame number,
> so there's the page size shifting going on in physical memory, and
> even if you shift the PFN by 3 - or four of five - bits
> unconditionally (rather than try to shift it by some minimal number),
> you're covering a *lot* of physical memory.

It is a page frame number, but page_to_pfn_t() just stores the value
of the struct page pointer directly, so we really only have the
pointer alignment bits. I do this so that kmap_atomic_pfn_t() can
optionally call kmap_atomic() if the pfn is mapped.

>
> Say you're a 32-bit architecture with a 4k page size, and you lose
> three bits to "type" bits. You still have 32+12-3=41 bits of physical
> address space. Which is way more than realistic for a 32-bit
> architecture anyway, even with PAE (or PXE or whatever ARM calls it).
> Not that I see persistent memory being all that relevant on 32-bit
> hardware anyway.
>
> So I think if you actually do want that third bit, you're better off
> just marking "struct page" as being __aligned__((8)) and getting the
> three bits unconditionally. Just make the rule be that mem_map[] has
> to be 8-byte aligned.
>
> Even 16-byte alignment would probably be fine. No?
>

Ooh, that's great, I was already lamenting the fact that I had run out
of bits. One of the reasons to go to 16-byte alignment is to have
another bit to further qualify the pfn as persistent memory not just
un-mapped memory. The rationale would be to generate, and verify
proper usage of, __pmem annotated pointers.

...but I'm still waiting for someone to tell me I'm needlessly
complicating things with a __pmem annotation [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-June/001087.html

2015-06-06 11:37:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 3/9] dax: drop size parameter to ->direct_access()

On Fri, Jun 05, 2015 at 05:19:18PM -0400, Dan Williams wrote:
> None of the implementations currently use it. The common
> bdev_direct_access() entry point handles all the size checks before
> calling ->direct_access().

Interesting. The reason it was still there, even after moving all
the size checks to bdev_direct_access() was to support the case of a
hypothetical future driver that exposed several discontiguous ranges as
a single block device.

But we don't actually need it for that, as long as the discontiguous
ranges are relatively large. As the comment you deleted noted, the brd
driver could do this, but I certainly haven't taken the time to figure
out whether it'd be a performance win if the pages "just happen" to line
up right.

So I don't think I'm objecting to this patch (except that it conflicts
textually with my dynamic mapping patch that you've seen but hasn't been
posted publically yet). Just something to think about.

2015-06-06 11:58:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

On Fri, Jun 05, 2015 at 05:19:24PM -0400, Dan Williams wrote:
> @@ -35,13 +35,16 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
> might_sleep();
> do {
> void *addr;
> - unsigned long pfn;
> + __pfn_t pfn;
> long count;
>
> - count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> + count = bdev_direct_access(bdev, sector, &pfn, size);
> if (count < 0)
> return count;
> BUG_ON(size < count);
> + addr = kmap_atomic_pfn_t(pfn);
> + if (!addr)
> + return -EIO;
> while (count > 0) {
> unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> if (pgsz > count)

This part is incomplete. When bdev_direct_access() could return an
address, it was possible for that address to be unaligned (eg when
'sector' was not a multiple of 8). DAX has never had full support for
devices that weren't a 4k sector size, but I was trying to not make that
assumption in more places than I had to. So this function needs a lot
more simplification (or it needs to add '(sector & 7) << 9' to addr ...
assuming that the partition this bdev represents actually starts at a
multiple of 8 ... bleh!).

>
> -static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
> +static long dax_get_pfn(struct buffer_head *bh, __pfn_t *pfn, unsigned blkbits)
> {
> - unsigned long pfn;
> sector_t sector = bh->b_blocknr << (blkbits - 9);
> - return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
> + return bdev_direct_access(bh->b_bdev, sector, pfn, bh->b_size);
> }

This function should just be deleted. It offers essentially nothing
over just calling bdev_direct_access().

> @@ -142,9 +146,19 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> addr = NULL;
> size = bh->b_size - first;
> } else {
> - retval = dax_get_addr(bh, &addr, blkbits);
> + if (kmap) {
> + kunmap_atomic_pfn_t(kmap);
> + kmap = NULL;
> + }
> + retval = dax_get_pfn(bh, &pfn, blkbits);
> if (retval < 0)
> break;
> + kmap = kmap_atomic_pfn_t(pfn);
> + if (!kmap) {
> + retval = -EIO;
> + break;
> + }
> + addr = kmap;
> if (buffer_unwritten(bh) || buffer_new(bh))
> dax_new_buf(addr, retval, first, pos,
> end);

Interesting approach. The patch I sent you was more complex ... this
probably ends up working out better since it has fewer places to check
for kmap returning an error.

Subject: RE: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

> -----Original Message-----
> From: Linux-nvdimm [mailto:[email protected]] On Behalf
> Of Dan Williams
> Sent: Friday, June 05, 2015 3:19 PM
> Subject: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to
> __pfn_t + kmap_atomic_pfn_t()
...
> diff --git a/arch/powerpc/sysdev/axonram.c
> b/arch/powerpc/sysdev/axonram.c
> index e8657d3bc588..20725006592e 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
...
> @@ -165,9 +166,13 @@ static int axon_ram_probe(struct platform_device
> *device)
> {
> static int axon_ram_bank_id = -1;
> struct axon_ram_bank *bank;
> - struct resource resource;
> + struct resource *resource;
> int rc = 0;
>
> + resource = devm_kzalloc(&device->dev, sizeof(*resource),
> GFP_KERNEL);
> + if (!resource)
> + return -ENOMEM;
> +

Since resource is now a pointer...

...
> @@ -184,13 +189,13 @@ static int axon_ram_probe(struct platform_device
> *device)
>
> bank->device = device;
>
> - if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0)
> {
> + if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
> dev_err(&device->dev, "Cannot access device tree\n");
> rc = -EFAULT;
> goto failed;
> }
...

... I'd expect to see a devm_kfree call added after the failed label,
like was done here:

> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 2f1734ba0e22..a7b9743c546f 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
...
> struct dcssblk_dev_info {
> @@ -520,12 +522,18 @@ static const struct attribute_group
> *dcssblk_dev_attr_groups[] = {
> static ssize_t
> dcssblk_add_store(struct device *dev, struct device_attribute *attr, const
> char *buf, size_t count)
> {
> + struct resource *res = devm_kzalloc(dev, sizeof(*res), GFP_KERNEL);
> int rc, i, j, num_of_segments;
> struct dcssblk_dev_info *dev_info;
> struct segment_info *seg_info, *temp;
> char *local_buf;
> unsigned long seg_byte_size;
>
> + if (!res) {
> + rc = -ENOMEM;
> + goto out_nobuf;
> + }
> +
> dev_info = NULL;
> seg_info = NULL;
> if (dev != dcssblk_root_dev) {
> @@ -652,6 +660,13 @@ dcssblk_add_store(struct device *dev, struct
> device_attribute *attr, const char
> if (rc)
> goto put_dev;
>
> + res->start = dev_info->start;
> + res->end = dev_info->end - 1;
> + rc = devm_register_kmap_pfn_range(&dev_info->dev, res,
> + (void *) dev_info->start);
> + if (rc)
> + goto put_dev;
> +
> get_device(&dev_info->dev);
> add_disk(dev_info->gd);
>
> @@ -699,6 +714,8 @@ seg_list_del:
> out:
> kfree(local_buf);
> out_nobuf:
> + if (res)
> + devm_kfree(dev, res);
> return rc;
> }

2015-06-08 16:36:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

On Mon, Jun 8, 2015 at 9:29 AM, Elliott, Robert (Server Storage)
<[email protected]> wrote:
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:[email protected]] On Behalf
>> Of Dan Williams
>> Sent: Friday, June 05, 2015 3:19 PM
>> Subject: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to
>> __pfn_t + kmap_atomic_pfn_t()
> ...
>> diff --git a/arch/powerpc/sysdev/axonram.c
>> b/arch/powerpc/sysdev/axonram.c
>> index e8657d3bc588..20725006592e 100644
>> --- a/arch/powerpc/sysdev/axonram.c
>> +++ b/arch/powerpc/sysdev/axonram.c
> ...
>> @@ -165,9 +166,13 @@ static int axon_ram_probe(struct platform_device
>> *device)
>> {
>> static int axon_ram_bank_id = -1;
>> struct axon_ram_bank *bank;
>> - struct resource resource;
>> + struct resource *resource;
>> int rc = 0;
>>
>> + resource = devm_kzalloc(&device->dev, sizeof(*resource),
>> GFP_KERNEL);
>> + if (!resource)
>> + return -ENOMEM;
>> +
>
> Since resource is now a pointer...
>
> ...
>> @@ -184,13 +189,13 @@ static int axon_ram_probe(struct platform_device
>> *device)
>>
>> bank->device = device;
>>
>> - if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0)
>> {
>> + if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) {
>> dev_err(&device->dev, "Cannot access device tree\n");
>> rc = -EFAULT;
>> goto failed;
>> }
> ...
>
> ... I'd expect to see a devm_kfree call added after the failed label,
> like was done here:

No, because unlike dccsblk the axon_ram driver actually adheres to the
device model, so it will be auto-freed on failed probe. The dcssblk
implementation is not a proper "driver" so we don't get the
devres_release_all() until the device is unregistered. That causes
the "resource" object to stick around longer than necessary unless we
force devm_kfree() it.

2015-06-09 06:50:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

The code is mostly generic, so the x86 in the subject is a little misleading.
What keeps the Kconfig symbol in x86 anyway? Is there a reason why
it can't be made entirely generic?

2015-06-09 06:51:56

by Christoph Hellwig

[permalink] [raw]

2015-06-09 06:56:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

On Fri, Jun 05, 2015 at 05:19:24PM -0400, Dan Williams wrote:
> The primary source for non-page-backed page-frames to enter the system
> is via the pmem driver's ->direct_access() method. The pfns returned by
> the top-level bdev_direct_access() may be passed to any other subsystem
> in the kernel and those sub-systems either need to assume that the pfn
> is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page
> backed case (CONFIG_DEV_PFN=y). Currently the pfns returned by
> ->direct_access() are only ever used by vm_insert_mixed() which does not
> care if the pfn is mapped. As we go to add more usages of these pfns
> add the type-safety of __pfn_t.
>
> This also simplifies the calling convention of ->direct_access() by not
> returning the virtual address in the same call. This annotates cases
> where the kernel is directly accessing pmem outside the driver, and
> makes the valid lifetime of the reference explicit. This property may
> be useful in the future for invalidating mappings to pmem, but for now
> it provides some protection against the "pmem disable vs still-in-use"
> race.
>
> Note that axon_ram_direct_access and dcssblk_direct_access were
> previously making potentially incorrect assumptions about the addresses
> they passed to virt_to_phys().

This looks generally good.

The subject line is a little confusing as the main change is that
it moves the responsibility of mapping a PFN returned from
->direct_access to a kernel virtual address to the caller, so I'd
suggest to reflect this.

Second all the ifdef HAVE_KMAP_PFN is horribly ugly. I'd suggest
to try hard to offer it on all architectures and select it from
the drivers that need it.

Third it seems like this breaks axonram and dcssblk as powerpc
and s390 don't define HAVE_KMAP_PFN yet unless I missed something.

2015-06-09 06:58:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 9/9] x86: convert dma_map_ops to support mapping a __pfn_t.

On Fri, Jun 05, 2015 at 05:19:55PM -0400, Dan Williams wrote:
> As long as a dma_map_sg() implementation avoids sg_page() conversions it
> can support scatterlists that carry "page-less" __pfn_t entries.
> However, a couple implementations require that __pfn_t_has_page() is
> always true. The Xen swiotlb implementation's entanglements with ARM and
> the Calgary MMUs requirement to have a pre-existing virtual mapping make
> them unable to support this conversion (i.e. these now have 'depends on
> !HAVE_DMA_PFN').

That's why we really need a whole kernel conversion and not just a piecemail
one. Given how trivial this patch is that doesn't look like a too big
task ayway.

2015-06-09 07:00:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 6/9] scatterlist: use sg_phys()

Please send this and the next one to Jens for this cycle.

2015-06-09 13:51:29

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v4 9/9] x86: convert dma_map_ops to support mapping a __pfn_t.

On Tue, Jun 09, 2015 at 08:58:54AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 05, 2015 at 05:19:55PM -0400, Dan Williams wrote:
> > As long as a dma_map_sg() implementation avoids sg_page() conversions it
> > can support scatterlists that carry "page-less" __pfn_t entries.
> > However, a couple implementations require that __pfn_t_has_page() is
> > always true. The Xen swiotlb implementation's entanglements with ARM and
> > the Calgary MMUs requirement to have a pre-existing virtual mapping make
> > them unable to support this conversion (i.e. these now have 'depends on
> > !HAVE_DMA_PFN').
>
> That's why we really need a whole kernel conversion and not just a piecemail
> one. Given how trivial this patch is that doesn't look like a too big
> task ayway.

Aye, and the SWIOTLB (baremetal), Xen SWIOTLB (x86), Xen SWIOTLB (ARM)
can surely be easily tested by the Xen folks if you have patches. Please
just CC the [email protected] on the patches and shout
out for testing help.

2015-06-10 12:12:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

Btw, I don't think this actually is safe without refcounting your kmap
structure.

The driver model ->remove callback can be called at any time, which
will ioremap the memory and remap the kmap structure. But at this
point a user might still be using it.

2015-06-10 15:04:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, Jun 10, 2015 at 02:12:02PM +0200, Christoph Hellwig wrote:
> Btw, I don't think this actually is safe without refcounting your kmap
> structure.
>
> The driver model ->remove callback can be called at any time, which
> will ioremap the memory and remap the kmap structure. But at this
> point a user might still be using it.

Won't the device data structures be pinned by the refcount on the bdev?

2015-06-10 15:11:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, Jun 10, 2015 at 11:03:35AM -0400, Matthew Wilcox wrote:
> On Wed, Jun 10, 2015 at 02:12:02PM +0200, Christoph Hellwig wrote:
> > Btw, I don't think this actually is safe without refcounting your kmap
> > structure.
> >
> > The driver model ->remove callback can be called at any time, which
> > will ioremap the memory and remap the kmap structure. But at this
> > point a user might still be using it.
>
> Won't the device data structures be pinned by the refcount on the bdev?

An open filesystem only keeps a reference on the request_queue. The
underlying driver model ->remove method will still be called on
a surprise removal.

2015-06-10 15:37:08

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, Jun 10, 2015 at 8:11 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jun 10, 2015 at 11:03:35AM -0400, Matthew Wilcox wrote:
>> On Wed, Jun 10, 2015 at 02:12:02PM +0200, Christoph Hellwig wrote:
>> > Btw, I don't think this actually is safe without refcounting your kmap
>> > structure.
>> >
>> > The driver model ->remove callback can be called at any time, which
>> > will ioremap the memory and remap the kmap structure. But at this
>> > point a user might still be using it.
>>
>> Won't the device data structures be pinned by the refcount on the bdev?
>
> An open filesystem only keeps a reference on the request_queue. The
> underlying driver model ->remove method will still be called on
> a surprise removal.

On surprise removal my expectation is that the driver keeps the the
ioremap mapping alive for at least a synchronize_rcu() period. With
that in place the rcu_read_lock() in kmap_atomic_pfn_t() should keep
the mapping alive for the short duration, or otherwise prevent new
mappings. However, I missed converting the driver to order its
iounmap() with respect to the pfn range registration via devres, will
fix.

2015-06-10 16:12:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, Jun 10, 2015 at 08:36:48AM -0700, Dan Williams wrote:
> On surprise removal my expectation is that the driver keeps the the
> ioremap mapping alive for at least a synchronize_rcu() period. With
> that in place the rcu_read_lock() in kmap_atomic_pfn_t() should keep
> the mapping alive for the short duration, or otherwise prevent new
> mappings. However, I missed converting the driver to order its
> iounmap() with respect to the pfn range registration via devres, will
> fix.

Right. I guess the best idea is to replace devm_register_kmap_pfn_range
with an interface that does the ioremap, in which case the cleanup
function can take care of the proper ordering behind the drivers back.

2015-06-10 16:17:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/9] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, Jun 10, 2015 at 9:11 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jun 10, 2015 at 08:36:48AM -0700, Dan Williams wrote:
>> On surprise removal my expectation is that the driver keeps the the
>> ioremap mapping alive for at least a synchronize_rcu() period. With
>> that in place the rcu_read_lock() in kmap_atomic_pfn_t() should keep
>> the mapping alive for the short duration, or otherwise prevent new
>> mappings. However, I missed converting the driver to order its
>> iounmap() with respect to the pfn range registration via devres, will
>> fix.
>
> Right. I guess the best idea is to replace devm_register_kmap_pfn_range
> with an interface that does the ioremap, in which case the cleanup
> function can take care of the proper ordering behind the drivers back.

Even better than what I was thinking... will do.

2015-08-07 23:54:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 4/9] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()

On Sat, Jun 6, 2015 at 4:58 AM, Matthew Wilcox <[email protected]> wrote:
> On Fri, Jun 05, 2015 at 05:19:24PM -0400, Dan Williams wrote:
>> @@ -35,13 +35,16 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>> might_sleep();
>> do {
>> void *addr;
>> - unsigned long pfn;
>> + __pfn_t pfn;
>> long count;
>>
>> - count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
>> + count = bdev_direct_access(bdev, sector, &pfn, size);
>> if (count < 0)
>> return count;
>> BUG_ON(size < count);
>> + addr = kmap_atomic_pfn_t(pfn);
>> + if (!addr)
>> + return -EIO;
>> while (count > 0) {
>> unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>> if (pgsz > count)
>
> This part is incomplete. When bdev_direct_access() could return an
> address, it was possible for that address to be unaligned (eg when
> 'sector' was not a multiple of 8). DAX has never had full support for
> devices that weren't a 4k sector size, but I was trying to not make that
> assumption in more places than I had to. So this function needs a lot
> more simplification (or it needs to add '(sector & 7) << 9' to addr ...
> assuming that the partition this bdev represents actually starts at a
> multiple of 8 ... bleh!).

Isn't this already handled by the:

if (sector % (PAGE_SIZE / 512))
return -EINVAL;

...check in bdev_direct_access()? As long as the driver's mapping is
4K aligned, which appears to be the case for all DAX-enabled drivers,
then we should be good to go.

>>
>> -static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
>> +static long dax_get_pfn(struct buffer_head *bh, __pfn_t *pfn, unsigned blkbits)
>> {
>> - unsigned long pfn;
>> sector_t sector = bh->b_blocknr << (blkbits - 9);
>> - return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
>> + return bdev_direct_access(bh->b_bdev, sector, pfn, bh->b_size);
>> }
>
> This function should just be deleted. It offers essentially nothing
> over just calling bdev_direct_access().

Ok.