2015-05-06 20:07:38

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

Changes since v1 [1]:

1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers.

2/ added kmap_atomic_pfn_t()

3/ rebased on v4.1-rc2

[1]: http://marc.info/?l=linux-kernel&m=142653770511970&w=2

---

A lead in note, this looks scarier than it is. Most of the code thrash
is automated via Coccinelle. Also the subtle differences behind an
'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a
Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls
whether a pfn and a __pfn_t are equivalent.

The motivation for this change is persistent memory and the desire to
use it not only via the pmem driver, but also as a memory target for I/O
(DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. Aside
from the pmem driver and DAX, persistent memory is not able to be used
in these I/O scenarios due to the lack of a backing struct page, i.e.
persistent memory is not part of the memmap. This patchset takes the
position that the solution is to teach I/O paths that want to operate on
persistent memory to do so by referencing a __pfn_t. The alternatives
are discussed in the changelog for "[PATCH v2 01/10] arch: introduce
__pfn_t for persistent memory i/o", copied here:

Alternatives:

1/ Provide struct page coverage for persistent memory in
DRAM. The expectation is that persistent memory capacities make
this untenable in the long term.

2/ Provide struct page coverage for persistent memory with
persistent memory. While persistent memory may have near DRAM
performance characteristics it may not have the same
write-endurance of DRAM. Given the update frequency of struct
page objects it may not be suitable for persistent memory.

3/ Dynamically allocate struct page. This appears to be on
the order of the complexity of converting code paths to use
__pfn_t references instead of struct page, and the amount of
setup required to establish a valid struct page reference is
mostly wasted when the only usage in the block stack is to
perform a page_to_pfn() conversion for dma-mapping. Instances
of kmap() / kmap_atomic() usage appear to be the only occasions
in the block stack where struct page is non-trivially used. A
new kmap_atomic_pfn_t() is proposed to handle those cases.

---

Dan Williams (9):
arch: introduce __pfn_t for persistent memory i/o
block: add helpers for accessing a bio_vec page
block: convert .bv_page to .bv_pfn bio_vec
dma-mapping: allow archs to optionally specify a ->map_pfn() operation
scatterlist: use sg_phys()
x86: support dma_map_pfn()
x86: support kmap_atomic_pfn_t() for persistent memory
dax: convert to __pfn_t
block: base support for pfn i/o

Matthew Wilcox (1):
scatterlist: support "page-less" (__pfn_t only) entries


arch/Kconfig | 6 ++
arch/arm/mm/dma-mapping.c | 2 -
arch/microblaze/kernel/dma.c | 2 -
arch/powerpc/sysdev/axonram.c | 6 +-
arch/x86/Kconfig | 7 ++
arch/x86/kernel/Makefile | 1
arch/x86/kernel/amd_gart_64.c | 22 +++++-
arch/x86/kernel/kmap.c | 95 ++++++++++++++++++++++++++
arch/x86/kernel/pci-nommu.c | 22 +++++-
arch/x86/kernel/pci-swiotlb.c | 4 +
arch/x86/pci/sta2x11-fixup.c | 4 +
arch/x86/xen/pci-swiotlb-xen.c | 4 +
block/bio-integrity.c | 8 +-
block/bio.c | 82 ++++++++++++++++------
block/blk-core.c | 13 +++-
block/blk-integrity.c | 7 +-
block/blk-lib.c | 2 -
block/blk-merge.c | 15 ++--
block/bounce.c | 26 ++++---
drivers/block/aoe/aoecmd.c | 8 +-
drivers/block/brd.c | 6 +-
drivers/block/drbd/drbd_bitmap.c | 5 +
drivers/block/drbd/drbd_main.c | 6 +-
drivers/block/drbd/drbd_receiver.c | 4 +
drivers/block/drbd/drbd_worker.c | 3 +
drivers/block/floppy.c | 6 +-
drivers/block/loop.c | 13 ++--
drivers/block/nbd.c | 8 +-
drivers/block/nvme-core.c | 2 -
drivers/block/pktcdvd.c | 11 ++-
drivers/block/pmem.c | 16 +++-
drivers/block/ps3disk.c | 2 -
drivers/block/ps3vram.c | 2 -
drivers/block/rbd.c | 2 -
drivers/block/rsxx/dma.c | 2 -
drivers/block/umem.c | 2 -
drivers/block/zram/zram_drv.c | 10 +--
drivers/dma/ste_dma40.c | 5 -
drivers/iommu/amd_iommu.c | 21 ++++--
drivers/iommu/intel-iommu.c | 26 +++++--
drivers/iommu/iommu.c | 2 -
drivers/md/bcache/btree.c | 4 +
drivers/md/bcache/debug.c | 6 +-
drivers/md/bcache/movinggc.c | 2 -
drivers/md/bcache/request.c | 6 +-
drivers/md/bcache/super.c | 10 +--
drivers/md/bcache/util.c | 5 +
drivers/md/bcache/writeback.c | 2 -
drivers/md/dm-crypt.c | 12 ++-
drivers/md/dm-io.c | 2 -
drivers/md/dm-log-writes.c | 14 ++--
drivers/md/dm-verity.c | 2 -
drivers/md/raid1.c | 50 +++++++-------
drivers/md/raid10.c | 38 +++++-----
drivers/md/raid5.c | 6 +-
drivers/mmc/card/queue.c | 4 +
drivers/s390/block/dasd_diag.c | 2 -
drivers/s390/block/dasd_eckd.c | 14 ++--
drivers/s390/block/dasd_fba.c | 6 +-
drivers/s390/block/dcssblk.c | 8 +-
drivers/s390/block/scm_blk.c | 2 -
drivers/s390/block/scm_blk_cluster.c | 2 -
drivers/s390/block/xpram.c | 2 -
drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 +-
drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 +-
drivers/scsi/sd_dif.c | 4 +
drivers/staging/android/ion/ion_chunk_heap.c | 4 +
drivers/staging/lustre/lustre/llite/lloop.c | 2 -
drivers/target/target_core_file.c | 4 +
drivers/xen/biomerge.c | 4 +
drivers/xen/swiotlb-xen.c | 29 +++++---
fs/9p/vfs_addr.c | 2 -
fs/block_dev.c | 2 -
fs/btrfs/check-integrity.c | 6 +-
fs/btrfs/compression.c | 12 ++-
fs/btrfs/disk-io.c | 5 +
fs/btrfs/extent_io.c | 8 +-
fs/btrfs/file-item.c | 8 +-
fs/btrfs/inode.c | 19 +++--
fs/btrfs/raid56.c | 4 +
fs/btrfs/volumes.c | 2 -
fs/buffer.c | 4 +
fs/dax.c | 9 +-
fs/direct-io.c | 2 -
fs/exofs/ore.c | 4 +
fs/exofs/ore_raid.c | 2 -
fs/ext4/page-io.c | 2 -
fs/ext4/readpage.c | 4 +
fs/f2fs/data.c | 4 +
fs/f2fs/segment.c | 2 -
fs/gfs2/lops.c | 4 +
fs/jfs/jfs_logmgr.c | 4 +
fs/logfs/dev_bdev.c | 10 +--
fs/mpage.c | 2 -
fs/splice.c | 2 -
include/asm-generic/dma-mapping-common.h | 30 ++++++++
include/asm-generic/memory_model.h | 1
include/asm-generic/pfn.h | 67 ++++++++++++++++++
include/asm-generic/scatterlist.h | 10 +++
include/crypto/scatterwalk.h | 10 +++
include/linux/bio.h | 24 ++++---
include/linux/blk_types.h | 20 +++++
include/linux/blkdev.h | 6 +-
include/linux/dma-debug.h | 23 +++++-
include/linux/dma-mapping.h | 8 ++
include/linux/highmem.h | 23 ++++++
include/linux/mm.h | 1
include/linux/scatterlist.h | 91 ++++++++++++++++++++++---
include/linux/swiotlb.h | 4 +
init/Kconfig | 13 ++++
kernel/power/block_io.c | 2 -
lib/dma-debug.c | 10 ++-
lib/iov_iter.c | 22 +++---
lib/swiotlb.c | 20 ++++-
mm/page_io.c | 10 +--
net/ceph/messenger.c | 2 -
116 files changed, 896 insertions(+), 372 deletions(-)
create mode 100644 arch/x86/kernel/kmap.c
create mode 100644 include/asm-generic/pfn.h


2015-05-06 20:07:45

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o

Introduce a type that encapsulates a page-frame-number that is
optionally backed by memmap (struct page). This type will be used in
place of 'struct page *' instances in contexts where persistent memory
is being referenced (scatterlists for drivers, biovecs for the block
layer, etc). The operations in those i/o paths that formerly required a
'struct page *' are to be converted to use __pfn_t aware equivalent
helpers. Otherwise, in the absence of persistent memory, there is no
functional change and __pfn_t is an alias for a normal memory page.

It turns out that while 'struct page' references are used broadly in the
kernel I/O stacks the usage of 'struct page' based capabilities is very
shallow. It is only used for populating bio_vecs and scatterlists for
the retrieval of dma addresses, and for temporary kernel mappings
(kmap). Aside from kmap, these usages can be trivially converted to
operate on a pfn.

Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
via struct page, to setup and track temporary kernel mappings. It would
be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
bonds and leaked into 64-bit code. Thankfully, it seems all that is
needed here is to convert kmap_atomic() callers, that want to opt-in to
supporting persistent memory, to use a new kmap_atomic_pfn_t(). Where
kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
established by the driver for persistent memory.

Note, that as far as conceptually understanding __pfn_t is concerned,
'persistent memory' is really any address range in host memory not
covered by memmap. Contrast this with pure iomem that is on an mmio
mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
PAGE_SHIFT".

Cc: H. Peter Anvin <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/asm-generic/memory_model.h | 1 -
include/asm-generic/pfn.h | 51 ++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 +
init/Kconfig | 13 +++++++++
4 files changed, 65 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/pfn.h

diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..1b0ae21fd8ff 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -70,7 +70,6 @@
#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */

#define page_to_pfn __page_to_pfn
-#define pfn_to_page __pfn_to_page

#endif /* __ASSEMBLY__ */

diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
new file mode 100644
index 000000000000..91171e0285d9
--- /dev/null
+++ b/include/asm-generic/pfn.h
@@ -0,0 +1,51 @@
+#ifndef __ASM_PFN_H
+#define __ASM_PFN_H
+
+#ifndef __pfn_to_phys
+#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT)
+#endif
+
+static inline struct page *pfn_to_page(unsigned long pfn)
+{
+ return __pfn_to_page(pfn);
+}
+
+/*
+ * __pfn_t: encapsulates a page-frame number that is optionally backed
+ * by memmap (struct page). This type will be used in place of a
+ * 'struct page *' instance in contexts where unmapped memory (usually
+ * persistent memory) is being referenced (scatterlists for drivers,
+ * biovecs for the block layer, etc).
+ */
+typedef struct {
+ union {
+ unsigned long pfn;
+ struct page *page;
+ };
+} __pfn_t;
+
+static inline struct page *__pfn_t_to_page(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+ if (pfn.pfn < PAGE_OFFSET)
+ return NULL;
+#endif
+ return pfn.page;
+}
+
+static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+ if (pfn.pfn < PAGE_OFFSET)
+ return __pfn_to_phys(pfn.pfn);
+#endif
+ return __pfn_to_phys(page_to_pfn(pfn.page));
+}
+
+static inline __pfn_t page_to_pfn_t(struct page *page)
+{
+ __pfn_t pfn = { .page = page };
+
+ return pfn;
+}
+#endif /* __ASM_PFN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..9d35cff41c12 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -52,6 +52,7 @@ extern int sysctl_legacy_va_layout;
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
+#include <asm-generic/pfn.h>

#ifndef __pa_symbol
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..7d2ad350fd29 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1764,6 +1764,19 @@ config PROFILING
Say Y here to enable the extended profiling support mechanisms used
by profilers such as OProfile.

+config PMEM_IO
+ default n
+ bool "Support for I/O, DAX, DMA, RDMA to unmapped (persistent) memory" if EXPERT
+ help
+ Say Y here to enable the Block and Networking stacks to
+ reference memory that is not mapped. This is usually the
+ case if you have large quantities of persistent memory
+ relative to DRAM. Enabling this option may increase the
+ kernel size by a few kilobytes as it instructs the kernel
+ that a __pfn_t may reference unmapped memory. Disabling
+ this option instructs the kernel that a __pfn_t always
+ references mapped memory.
+
#
# Place an empty function call at each tracepoint site. Can be
# dynamically changed for a probe function.

2015-05-06 20:07:56

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 02/10] block: add helpers for accessing a bio_vec page

In preparation for converting struct bio_vec to carry a __pfn_t instead
of struct page.

This change is prompted by the desire to add in-kernel DMA support
(O_DIRECT, hierarchical storage, RDMA, etc) for persistent memory which
lacks struct page coverage.

Alternatives:

1/ Provide struct page coverage for persistent memory in DRAM. The
expectation is that persistent memory capacities make this untenable
in the long term.

2/ Provide struct page coverage for persistent memory with persistent
memory. While persistent memory may have near DRAM performance
characteristics it may not have the same write-endurance of DRAM.
Given the update frequency of struct page objects it may not be
suitable for persistent memory.

3/ Dynamically allocate struct page. This appears to be on the order
of the complexity of converting code paths to use __pfn_t references
instead of struct page, and the amount of setup required to establish
a valid struct page reference is mostly wasted when the only usage in
the block stack is to perform a page_to_pfn() conversion for
dma-mapping. Instances of kmap() / kmap_atomic() usage appear to be
the only occasions in the block stack where struct page is
non-trivially used. A new kmap_atomic_pfn_t() is proposed to handle
those cases.

Generated with the following semantic patch:

// bv_page.cocci: convert usage of ->bv_page to use set/get helpers
// usage: make coccicheck COCCI=bv_page.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct bio_vec bvec;
expression E;
type T;
@@

- bvec.bv_page = (T)E
+ bvec_set_page(&bvec, E)

@@
struct bio_vec *bvec;
expression E;
type T;
@@

- bvec->bv_page = (T)E
+ bvec_set_page(bvec, E)

@@
struct bio_vec bvec;
type T;
@@

- (T)bvec.bv_page
+ bvec_page(&bvec)

@@
struct bio_vec *bvec;
type T;
@@

- (T)bvec->bv_page
+ bvec_page(bvec)

@@
struct bio *bio;
expression E;
expression F;
type T;
@@

- bio->bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(&bio->bi_io_vec[F], E)

@@
struct bio *bio;
expression E;
type T;
@@

- bio->bi_io_vec->bv_page = (T)E
+ bvec_set_page(bio->bi_io_vec, E)

@@
struct cached_dev *dc;
expression E;
type T;
@@

- dc->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(dc->sb_bio.bi_io_vec, E)

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_io_vec[F].bv_page = (T)E
+ bvec_set_page(&ca->sb_bio.bi_io_vec[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_io_vec[F].bv_page
+ bvec_page(&ca->sb_bio.bi_io_vec[F])

@@
struct cache *ca;
expression E;
expression F;
type T;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page = (T)E
+ bvec_set_page(&ca->sb_bio.bi_inline_vecs[F], E)

@@
struct cache *ca;
expression F;
@@

- ca->sb_bio.bi_inline_vecs[F].bv_page
+ bvec_page(&ca->sb_bio.bi_inline_vecs[F])


@@
struct cache *ca;
expression E;
type T;
@@

- ca->sb_bio.bi_io_vec->bv_page = (T)E
+ bvec_set_page(ca->sb_bio.bi_io_vec, E)

@@
struct bio *bio;
expression F;
@@

- bio->bi_io_vec[F].bv_page
+ bvec_page(&bio->bi_io_vec[F])

@@
struct bio bio;
expression F;
@@

- bio.bi_io_vec[F].bv_page
+ bvec_page(&bio.bi_io_vec[F])

@@
struct bio *bio;
@@

- bio->bi_io_vec->bv_page
+ bvec_page(bio->bi_io_vec)

@@
struct cached_dev *dc;
@@

- dc->sb_bio.bi_io_vec->bv_page
+ bvec_page(&dc->sb_bio->bi_io_vec)


@@
struct bio bio;
@@

- bio.bi_io_vec->bv_page
+ bvec_page(bio.bi_io_vec)

@@
struct bio_integrity_payload *bip;
expression E;
type T;
@@

- bip->bip_vec->bv_page = (T)E
+ bvec_set_page(bip->bip_vec, E)

@@
struct bio_integrity_payload *bip;
@@

- bip->bip_vec->bv_page
+ bvec_page(bip->bip_vec)

@@
struct bio_integrity_payload bip;
@@

- bip.bip_vec->bv_page
+ bvec_page(bip.bip_vec)

Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +
block/bio-integrity.c | 8 ++--
block/bio.c | 40 +++++++++++-----------
block/blk-core.c | 4 +-
block/blk-integrity.c | 3 +-
block/blk-lib.c | 2 +
block/blk-merge.c | 7 ++--
block/bounce.c | 24 ++++++-------
drivers/block/aoe/aoecmd.c | 8 ++--
drivers/block/brd.c | 2 +
drivers/block/drbd/drbd_bitmap.c | 5 ++-
drivers/block/drbd/drbd_main.c | 6 ++-
drivers/block/drbd/drbd_receiver.c | 4 +-
drivers/block/drbd/drbd_worker.c | 3 +-
drivers/block/floppy.c | 6 ++-
drivers/block/loop.c | 13 ++++---
drivers/block/nbd.c | 8 ++--
drivers/block/nvme-core.c | 2 +
drivers/block/pktcdvd.c | 11 +++---
drivers/block/pmem.c | 2 +
drivers/block/ps3disk.c | 2 +
drivers/block/ps3vram.c | 2 +
drivers/block/rbd.c | 2 +
drivers/block/rsxx/dma.c | 2 +
drivers/block/umem.c | 2 +
drivers/block/zram/zram_drv.c | 10 +++--
drivers/md/bcache/btree.c | 2 +
drivers/md/bcache/debug.c | 6 ++-
drivers/md/bcache/movinggc.c | 2 +
drivers/md/bcache/request.c | 6 ++-
drivers/md/bcache/super.c | 10 +++--
drivers/md/bcache/util.c | 5 +--
drivers/md/bcache/writeback.c | 2 +
drivers/md/dm-crypt.c | 12 +++---
drivers/md/dm-io.c | 2 +
drivers/md/dm-log-writes.c | 14 ++++----
drivers/md/dm-verity.c | 2 +
drivers/md/raid1.c | 50 ++++++++++++++-------------
drivers/md/raid10.c | 38 ++++++++++-----------
drivers/md/raid5.c | 6 ++-
drivers/s390/block/dasd_diag.c | 2 +
drivers/s390/block/dasd_eckd.c | 14 ++++----
drivers/s390/block/dasd_fba.c | 6 ++-
drivers/s390/block/dcssblk.c | 2 +
drivers/s390/block/scm_blk.c | 2 +
drivers/s390/block/scm_blk_cluster.c | 2 +
drivers/s390/block/xpram.c | 2 +
drivers/scsi/mpt2sas/mpt2sas_transport.c | 6 ++-
drivers/scsi/mpt3sas/mpt3sas_transport.c | 6 ++-
drivers/scsi/sd_dif.c | 4 +-
drivers/staging/lustre/lustre/llite/lloop.c | 2 +
drivers/target/target_core_file.c | 4 +-
drivers/xen/biomerge.c | 4 +-
fs/9p/vfs_addr.c | 2 +
fs/btrfs/check-integrity.c | 6 ++-
fs/btrfs/compression.c | 12 +++---
fs/btrfs/disk-io.c | 5 ++-
fs/btrfs/extent_io.c | 8 ++--
fs/btrfs/file-item.c | 8 ++--
fs/btrfs/inode.c | 19 ++++++----
fs/btrfs/raid56.c | 4 +-
fs/btrfs/volumes.c | 2 +
fs/buffer.c | 4 +-
fs/direct-io.c | 2 +
fs/exofs/ore.c | 4 +-
fs/exofs/ore_raid.c | 2 +
fs/ext4/page-io.c | 2 +
fs/ext4/readpage.c | 4 +-
fs/f2fs/data.c | 4 +-
fs/f2fs/segment.c | 2 +
fs/gfs2/lops.c | 4 +-
fs/jfs/jfs_logmgr.c | 4 +-
fs/logfs/dev_bdev.c | 10 +++--
fs/mpage.c | 2 +
fs/splice.c | 2 +
include/linux/blk_types.h | 10 +++++
kernel/power/block_io.c | 2 +
mm/page_io.c | 6 ++-
net/ceph/messenger.c | 2 +
79 files changed, 275 insertions(+), 250 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ee90db17b097..9bb5da7f2c0c 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -123,7 +123,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
return;
}

- user_mem = page_address(vec.bv_page) + vec.bv_offset;
+ user_mem = page_address(bvec_page(&vec)) + vec.bv_offset;
if (bio_data_dir(bio) == READ)
memcpy(user_mem, (void *) phys_mem, vec.bv_len);
else
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9ea61d..3add34cba048 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -101,7 +101,7 @@ void bio_integrity_free(struct bio *bio)
struct bio_set *bs = bio->bi_pool;

if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
- kfree(page_address(bip->bip_vec->bv_page) +
+ kfree(page_address(bvec_page(bip->bip_vec)) +
bip->bip_vec->bv_offset);

if (bs) {
@@ -140,7 +140,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,

iv = bip->bip_vec + bip->bip_vcnt;

- iv->bv_page = page;
+ bvec_set_page(iv, page);
iv->bv_len = len;
iv->bv_offset = offset;
bip->bip_vcnt++;
@@ -220,7 +220,7 @@ static int bio_integrity_process(struct bio *bio,
struct bio_vec bv;
struct bio_integrity_payload *bip = bio_integrity(bio);
unsigned int ret = 0;
- void *prot_buf = page_address(bip->bip_vec->bv_page) +
+ void *prot_buf = page_address(bvec_page(bip->bip_vec)) +
bip->bip_vec->bv_offset;

iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
@@ -229,7 +229,7 @@ static int bio_integrity_process(struct bio *bio,
iter.prot_buf = prot_buf;

bio_for_each_segment(bv, bio, bviter) {
- void *kaddr = kmap_atomic(bv.bv_page);
+ void *kaddr = kmap_atomic(bvec_page(&bv));

iter.data_buf = kaddr + bv.bv_offset;
iter.data_size = bv.bv_len;
diff --git a/block/bio.c b/block/bio.c
index f66a4eae16ee..7100fd6d5898 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -508,7 +508,7 @@ void zero_fill_bio(struct bio *bio)
bio_for_each_segment(bv, bio, iter) {
char *data = bvec_kmap_irq(&bv, &flags);
memset(data, 0, bv.bv_len);
- flush_dcache_page(bv.bv_page);
+ flush_dcache_page(bvec_page(&bv));
bvec_kunmap_irq(data, &flags);
}
}
@@ -723,7 +723,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (bio->bi_vcnt > 0) {
struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];

- if (page == prev->bv_page &&
+ if (page == bvec_page(prev) &&
offset == prev->bv_offset + prev->bv_len) {
unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
@@ -768,7 +768,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
* cannot add the page
*/
bvec = &bio->bi_io_vec[bio->bi_vcnt];
- bvec->bv_page = page;
+ bvec_set_page(bvec, page);
bvec->bv_len = len;
bvec->bv_offset = offset;
bio->bi_vcnt++;
@@ -818,7 +818,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
return len;

failed:
- bvec->bv_page = NULL;
+ bvec_set_page(bvec, NULL);
bvec->bv_len = 0;
bvec->bv_offset = 0;
bio->bi_vcnt--;
@@ -948,10 +948,10 @@ int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
struct bio_vec *bv;

bio_for_each_segment_all(bv, bio, i) {
- bv->bv_page = alloc_page(gfp_mask);
- if (!bv->bv_page) {
+ bvec_set_page(bv, alloc_page(gfp_mask));
+ if (!bvec_page(bv)) {
while (--bv >= bio->bi_io_vec)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));
return -ENOMEM;
}
}
@@ -1004,8 +1004,8 @@ void bio_copy_data(struct bio *dst, struct bio *src)

bytes = min(src_bv.bv_len, dst_bv.bv_len);

- src_p = kmap_atomic(src_bv.bv_page);
- dst_p = kmap_atomic(dst_bv.bv_page);
+ src_p = kmap_atomic(bvec_page(&src_bv));
+ dst_p = kmap_atomic(bvec_page(&dst_bv));

memcpy(dst_p + dst_bv.bv_offset,
src_p + src_bv.bv_offset,
@@ -1052,7 +1052,7 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter iter)
bio_for_each_segment_all(bvec, bio, i) {
ssize_t ret;

- ret = copy_page_from_iter(bvec->bv_page,
+ ret = copy_page_from_iter(bvec_page(bvec),
bvec->bv_offset,
bvec->bv_len,
&iter);
@@ -1083,7 +1083,7 @@ static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
bio_for_each_segment_all(bvec, bio, i) {
ssize_t ret;

- ret = copy_page_to_iter(bvec->bv_page,
+ ret = copy_page_to_iter(bvec_page(bvec),
bvec->bv_offset,
bvec->bv_len,
&iter);
@@ -1104,7 +1104,7 @@ static void bio_free_pages(struct bio *bio)
int i;

bio_for_each_segment_all(bvec, bio, i)
- __free_page(bvec->bv_page);
+ __free_page(bvec_page(bvec));
}

/**
@@ -1406,9 +1406,9 @@ static void __bio_unmap_user(struct bio *bio)
*/
bio_for_each_segment_all(bvec, bio, i) {
if (bio_data_dir(bio) == READ)
- set_page_dirty_lock(bvec->bv_page);
+ set_page_dirty_lock(bvec_page(bvec));

- page_cache_release(bvec->bv_page);
+ page_cache_release(bvec_page(bvec));
}

bio_put(bio);
@@ -1499,7 +1499,7 @@ static void bio_copy_kern_endio_read(struct bio *bio, int err)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
+ memcpy(p, page_address(bvec_page(bvec)), bvec->bv_len);
p += bvec->bv_len;
}

@@ -1611,7 +1611,7 @@ void bio_set_pages_dirty(struct bio *bio)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (page && !PageCompound(page))
set_page_dirty_lock(page);
@@ -1624,7 +1624,7 @@ static void bio_release_pages(struct bio *bio)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (page)
put_page(page);
@@ -1678,11 +1678,11 @@ void bio_check_pages_dirty(struct bio *bio)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (PageDirty(page) || PageCompound(page)) {
page_cache_release(page);
- bvec->bv_page = NULL;
+ bvec_set_page(bvec, NULL);
} else {
nr_clean_pages++;
}
@@ -1736,7 +1736,7 @@ void bio_flush_dcache_pages(struct bio *bi)
struct bvec_iter iter;

bio_for_each_segment(bvec, bi, iter)
- flush_dcache_page(bvec.bv_page);
+ flush_dcache_page(bvec_page(&bvec));
}
EXPORT_SYMBOL(bio_flush_dcache_pages);
#endif
diff --git a/block/blk-core.c b/block/blk-core.c
index fd154b94447a..94d2c6ccf801 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1442,7 +1442,7 @@ void blk_add_request_payload(struct request *rq, struct page *page,
{
struct bio *bio = rq->bio;

- bio->bi_io_vec->bv_page = page;
+ bvec_set_page(bio->bi_io_vec, page);
bio->bi_io_vec->bv_offset = 0;
bio->bi_io_vec->bv_len = len;

@@ -2868,7 +2868,7 @@ void rq_flush_dcache_pages(struct request *rq)
struct bio_vec bvec;

rq_for_each_segment(bvec, rq, iter)
- flush_dcache_page(bvec.bv_page);
+ flush_dcache_page(bvec_page(&bvec));
}
EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
#endif
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 79ffb4855af0..0458f31f075a 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -117,7 +117,8 @@ new_segment:
sg = sg_next(sg);
}

- sg_set_page(sg, iv.bv_page, iv.bv_len, iv.bv_offset);
+ sg_set_page(sg, bvec_page(&iv),
+ iv.bv_len, iv.bv_offset);
segments++;
}

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3f5d72..7931a09f86d6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -187,7 +187,7 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
bio->bi_bdev = bdev;
bio->bi_private = &bb;
bio->bi_vcnt = 1;
- bio->bi_io_vec->bv_page = page;
+ bvec_set_page(bio->bi_io_vec, page);
bio->bi_io_vec->bv_offset = 0;
bio->bi_io_vec->bv_len = bdev_logical_block_size(bdev);

diff --git a/block/blk-merge.c b/block/blk-merge.c
index fd3fee81c23c..47ceefacd320 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -51,7 +51,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
* never considered part of another segment, since
* that might change with the bounce page.
*/
- high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q);
+ high = page_to_pfn(bvec_page(&bv)) > queue_bounce_pfn(q);
if (!high && !highprv && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -192,7 +192,7 @@ new_segment:
*sg = sg_next(*sg);
}

- sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+ sg_set_page(*sg, bvec_page(bvec), nbytes, bvec->bv_offset);
(*nsegs)++;
}
*bvprv = *bvec;
@@ -228,7 +228,8 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
single_segment:
*sg = sglist;
bvec = bio_iovec(bio);
- sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
+ sg_set_page(*sg, bvec_page(&bvec),
+ bvec.bv_len, bvec.bv_offset);
return 1;
}

diff --git a/block/bounce.c b/block/bounce.c
index ab21ba203d5c..0390e44d6e1b 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -55,7 +55,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
unsigned char *vto;

local_irq_save(flags);
- vto = kmap_atomic(to->bv_page);
+ vto = kmap_atomic(bvec_page(to));
memcpy(vto + to->bv_offset, vfrom, to->bv_len);
kunmap_atomic(vto);
local_irq_restore(flags);
@@ -105,17 +105,17 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
struct bvec_iter iter;

bio_for_each_segment(tovec, to, iter) {
- if (tovec.bv_page != fromvec->bv_page) {
+ if (bvec_page(&tovec) != bvec_page(fromvec)) {
/*
* fromvec->bv_offset and fromvec->bv_len might have
* been modified by the block layer, so use the original
* copy, bounce_copy_vec already uses tovec->bv_len
*/
- vfrom = page_address(fromvec->bv_page) +
+ vfrom = page_address(bvec_page(fromvec)) +
tovec.bv_offset;

bounce_copy_vec(&tovec, vfrom);
- flush_dcache_page(tovec.bv_page);
+ flush_dcache_page(bvec_page(&tovec));
}

fromvec++;
@@ -136,11 +136,11 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
*/
bio_for_each_segment_all(bvec, bio, i) {
org_vec = bio_orig->bi_io_vec + i;
- if (bvec->bv_page == org_vec->bv_page)
+ if (bvec_page(bvec) == bvec_page(org_vec))
continue;

- dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
- mempool_free(bvec->bv_page, pool);
+ dec_zone_page_state(bvec_page(bvec), NR_BOUNCE);
+ mempool_free(bvec_page(bvec), pool);
}

bio_endio(bio_orig, err);
@@ -208,7 +208,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
if (force)
goto bounce;
bio_for_each_segment(from, *bio_orig, iter)
- if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
+ if (page_to_pfn(bvec_page(&from)) > queue_bounce_pfn(q))
goto bounce;

return;
@@ -216,20 +216,20 @@ bounce:
bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);

bio_for_each_segment_all(to, bio, i) {
- struct page *page = to->bv_page;
+ struct page *page = bvec_page(to);

if (page_to_pfn(page) <= queue_bounce_pfn(q) && !force)
continue;

- inc_zone_page_state(to->bv_page, NR_BOUNCE);
- to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+ inc_zone_page_state(bvec_page(to), NR_BOUNCE);
+ bvec_set_page(to, mempool_alloc(pool, q->bounce_gfp));

if (rw == WRITE) {
char *vto, *vfrom;

flush_dcache_page(page);

- vto = page_address(to->bv_page) + to->bv_offset;
+ vto = page_address(bvec_page(to)) + to->bv_offset;
vfrom = kmap_atomic(page) + to->bv_offset;
memcpy(vto, vfrom, to->bv_len);
kunmap_atomic(vfrom);
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 422b7d84f686..f0cbfe8c4bd8 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -300,7 +300,7 @@ skb_fillup(struct sk_buff *skb, struct bio *bio, struct bvec_iter iter)
struct bio_vec bv;

__bio_for_each_segment(bv, bio, iter, iter)
- skb_fill_page_desc(skb, frag++, bv.bv_page,
+ skb_fill_page_desc(skb, frag++, bvec_page(&bv),
bv.bv_offset, bv.bv_len);
}

@@ -874,7 +874,7 @@ bio_pageinc(struct bio *bio)
/* Non-zero page count for non-head members of
* compound pages is no longer allowed by the kernel.
*/
- page = compound_head(bv.bv_page);
+ page = compound_head(bvec_page(&bv));
atomic_inc(&page->_count);
}
}
@@ -887,7 +887,7 @@ bio_pagedec(struct bio *bio)
struct bvec_iter iter;

bio_for_each_segment(bv, bio, iter) {
- page = compound_head(bv.bv_page);
+ page = compound_head(bvec_page(&bv));
atomic_dec(&page->_count);
}
}
@@ -1092,7 +1092,7 @@ bvcpy(struct sk_buff *skb, struct bio *bio, struct bvec_iter iter, long cnt)
iter.bi_size = cnt;

__bio_for_each_segment(bv, bio, iter, iter) {
- char *p = page_address(bv.bv_page) + bv.bv_offset;
+ char *p = page_address(bvec_page(&bv)) + bv.bv_offset;
skb_copy_bits(skb, soff, p, bv.bv_len);
soff += bv.bv_len;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 64ab4951e9d6..115c6cf9cb43 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -349,7 +349,7 @@ static void brd_make_request(struct request_queue *q, struct bio *bio)

bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
- err = brd_do_bvec(brd, bvec.bv_page, len,
+ err = brd_do_bvec(brd, bvec_page(&bvec), len,
bvec.bv_offset, rw, sector);
if (err)
break;
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 434c77dcc99e..37ba0f533e4b 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -946,7 +946,7 @@ static void drbd_bm_endio(struct bio *bio, int error)
struct drbd_bm_aio_ctx *ctx = bio->bi_private;
struct drbd_device *device = ctx->device;
struct drbd_bitmap *b = device->bitmap;
- unsigned int idx = bm_page_to_idx(bio->bi_io_vec[0].bv_page);
+ unsigned int idx = bm_page_to_idx(bvec_page(&bio->bi_io_vec[0]));
int uptodate = bio_flagged(bio, BIO_UPTODATE);


@@ -979,7 +979,8 @@ static void drbd_bm_endio(struct bio *bio, int error)
bm_page_unlock_io(device, idx);

if (ctx->flags & BM_AIO_COPY_PAGES)
- mempool_free(bio->bi_io_vec[0].bv_page, drbd_md_io_page_pool);
+ mempool_free(bvec_page(&bio->bi_io_vec[0]),
+ drbd_md_io_page_pool);

bio_put(bio);

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 81fde9ef7f8e..dc759609b2a6 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1554,7 +1554,8 @@ static int _drbd_send_bio(struct drbd_peer_device *peer_device, struct bio *bio)
bio_for_each_segment(bvec, bio, iter) {
int err;

- err = _drbd_no_send_page(peer_device, bvec.bv_page,
+ err = _drbd_no_send_page(peer_device,
+ bvec_page(&bvec),
bvec.bv_offset, bvec.bv_len,
bio_iter_last(bvec, iter)
? 0 : MSG_MORE);
@@ -1573,7 +1574,8 @@ static int _drbd_send_zc_bio(struct drbd_peer_device *peer_device, struct bio *b
bio_for_each_segment(bvec, bio, iter) {
int err;

- err = _drbd_send_page(peer_device, bvec.bv_page,
+ err = _drbd_send_page(peer_device,
+ bvec_page(&bvec),
bvec.bv_offset, bvec.bv_len,
bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
if (err)
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index cee20354ac37..b4f16c6a0d73 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1729,10 +1729,10 @@ static int recv_dless_read(struct drbd_peer_device *peer_device, struct drbd_req
D_ASSERT(peer_device->device, sector == bio->bi_iter.bi_sector);

bio_for_each_segment(bvec, bio, iter) {
- void *mapped = kmap(bvec.bv_page) + bvec.bv_offset;
+ void *mapped = kmap(bvec_page(&bvec)) + bvec.bv_offset;
expect = min_t(int, data_size, bvec.bv_len);
err = drbd_recv_all_warn(peer_device->connection, mapped, expect);
- kunmap(bvec.bv_page);
+ kunmap(bvec_page(&bvec));
if (err)
return err;
data_size -= expect;
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index d0fae55d871d..d4b6e432bf35 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -332,7 +332,8 @@ void drbd_csum_bio(struct crypto_hash *tfm, struct bio *bio, void *digest)
crypto_hash_init(&desc);

bio_for_each_segment(bvec, bio, iter) {
- sg_set_page(&sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
+ sg_set_page(&sg, bvec_page(&bvec),
+ bvec.bv_len, bvec.bv_offset);
crypto_hash_update(&desc, &sg, sg.length);
}
crypto_hash_final(&desc, digest);
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index a08cda955285..6eae02e31731 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2374,7 +2374,7 @@ static int buffer_chain_size(void)
size = 0;

rq_for_each_segment(bv, current_req, iter) {
- if (page_address(bv.bv_page) + bv.bv_offset != base + size)
+ if (page_address(bvec_page(&bv)) + bv.bv_offset != base + size)
break;

size += bv.bv_len;
@@ -2444,7 +2444,7 @@ static void copy_buffer(int ssize, int max_sector, int max_sector_2)
size = bv.bv_len;
SUPBOUND(size, remaining);

- buffer = page_address(bv.bv_page) + bv.bv_offset;
+ buffer = page_address(bvec_page(&bv)) + bv.bv_offset;
if (dma_buffer + size >
floppy_track_buffer + (max_buffer_sectors << 10) ||
dma_buffer < floppy_track_buffer) {
@@ -3805,7 +3805,7 @@ static int __floppy_read_block_0(struct block_device *bdev, int drive)

bio_init(&bio);
bio.bi_io_vec = &bio_vec;
- bio_vec.bv_page = page;
+ bvec_set_page(&bio_vec, page);
bio_vec.bv_len = size;
bio_vec.bv_offset = 0;
bio.bi_vcnt = 1;
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ae3fcb4199e9..08a52b42126a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -261,12 +261,13 @@ static int lo_write_transfer(struct loop_device *lo, struct request *rq,
return -ENOMEM;

rq_for_each_segment(bvec, rq, iter) {
- ret = lo_do_transfer(lo, WRITE, page, 0, bvec.bv_page,
+ ret = lo_do_transfer(lo, WRITE, page, 0,
+ bvec_page(&bvec),
bvec.bv_offset, bvec.bv_len, pos >> 9);
if (unlikely(ret))
break;

- b.bv_page = page;
+ bvec_set_page(&b, page);
b.bv_offset = 0;
b.bv_len = bvec.bv_len;
ret = lo_write_bvec(lo->lo_backing_file, &b, &pos);
@@ -292,7 +293,7 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq,
if (len < 0)
return len;

- flush_dcache_page(bvec.bv_page);
+ flush_dcache_page(bvec_page(&bvec));

if (len != bvec.bv_len) {
struct bio *bio;
@@ -324,7 +325,7 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
rq_for_each_segment(bvec, rq, iter) {
loff_t offset = pos;

- b.bv_page = page;
+ bvec_set_page(&b, page);
b.bv_offset = 0;
b.bv_len = bvec.bv_len;

@@ -335,12 +336,12 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
goto out_free_page;
}

- ret = lo_do_transfer(lo, READ, page, 0, bvec.bv_page,
+ ret = lo_do_transfer(lo, READ, page, 0, bvec_page(&bvec),
bvec.bv_offset, len, offset >> 9);
if (ret)
goto out_free_page;

- flush_dcache_page(bvec.bv_page);
+ flush_dcache_page(bvec_page(&bvec));

if (len != bvec.bv_len) {
struct bio *bio;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 39e5f7fae3ef..dbab11437d2e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -217,10 +217,10 @@ static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec *bvec,
int flags)
{
int result;
- void *kaddr = kmap(bvec->bv_page);
+ void *kaddr = kmap(bvec_page(bvec));
result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
bvec->bv_len, flags);
- kunmap(bvec->bv_page);
+ kunmap(bvec_page(bvec));
return result;
}

@@ -303,10 +303,10 @@ static struct request *nbd_find_request(struct nbd_device *nbd,
static inline int sock_recv_bvec(struct nbd_device *nbd, struct bio_vec *bvec)
{
int result;
- void *kaddr = kmap(bvec->bv_page);
+ void *kaddr = kmap(bvec_page(bvec));
result = sock_xmit(nbd, 0, kaddr + bvec->bv_offset, bvec->bv_len,
MSG_WAITALL);
- kunmap(bvec->bv_page);
+ kunmap(bvec_page(bvec));
return result;
}

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 85b8036deaa3..2727840266bf 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -516,7 +516,7 @@ static void nvme_dif_remap(struct request *req,
if (!bip)
return;

- pmap = kmap_atomic(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset;
+ pmap = kmap_atomic(bvec_page(bip->bip_vec)) + bip->bip_vec->bv_offset;

p = pmap;
virt = bip_get_seed(bip);
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628dafd9d..c873290bd8bb 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -958,12 +958,12 @@ static void pkt_make_local_copy(struct packet_data *pkt, struct bio_vec *bvec)
p = 0;
offs = 0;
for (f = 0; f < pkt->frames; f++) {
- if (bvec[f].bv_page != pkt->pages[p]) {
- void *vfrom = kmap_atomic(bvec[f].bv_page) + bvec[f].bv_offset;
+ if (bvec_page(&bvec[f]) != pkt->pages[p]) {
+ void *vfrom = kmap_atomic(bvec_page(&bvec[f])) + bvec[f].bv_offset;
void *vto = page_address(pkt->pages[p]) + offs;
memcpy(vto, vfrom, CD_FRAMESIZE);
kunmap_atomic(vfrom);
- bvec[f].bv_page = pkt->pages[p];
+ bvec_set_page(&bvec[f], pkt->pages[p]);
bvec[f].bv_offset = offs;
} else {
BUG_ON(bvec[f].bv_offset != offs);
@@ -1307,9 +1307,10 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt)

/* XXX: locking? */
for (f = 0; f < pkt->frames; f++) {
- bvec[f].bv_page = pkt->pages[(f * CD_FRAMESIZE) / PAGE_SIZE];
+ bvec_set_page(&bvec[f],
+ pkt->pages[(f * CD_FRAMESIZE) / PAGE_SIZE]);
bvec[f].bv_offset = (f * CD_FRAMESIZE) % PAGE_SIZE;
- if (!bio_add_page(pkt->w_bio, bvec[f].bv_page, CD_FRAMESIZE, bvec[f].bv_offset))
+ if (!bio_add_page(pkt->w_bio, bvec_page(&bvec[f]), CD_FRAMESIZE, bvec[f].bv_offset))
BUG();
}
pkt_dbg(2, pd, "vcnt=%d\n", pkt->w_bio->bi_vcnt);
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index eabf4a8d0085..41bb424533e6 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -77,7 +77,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
rw = bio_data_dir(bio);
sector = bio->bi_iter.bi_sector;
bio_for_each_segment(bvec, bio, iter) {
- pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+ pmem_do_bvec(pmem, bvec_page(&bvec), bvec.bv_len, bvec.bv_offset,
rw, sector);
sector += bvec.bv_len >> 9;
}
diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index c120d70d3fb3..07ad0d9d9480 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -112,7 +112,7 @@ static void ps3disk_scatter_gather(struct ps3_storage_device *dev,
else
memcpy(buf, dev->bounce_buf+offset, size);
offset += size;
- flush_kernel_dcache_page(bvec.bv_page);
+ flush_kernel_dcache_page(bvec_page(&bvec));
bvec_kunmap_irq(buf, &flags);
i++;
}
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index ef45cfb98fd2..5db3311c2865 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -561,7 +561,7 @@ static struct bio *ps3vram_do_bio(struct ps3_system_bus_device *dev,

bio_for_each_segment(bvec, bio, iter) {
/* PS3 is ppc64, so we don't handle highmem */
- char *ptr = page_address(bvec.bv_page) + bvec.bv_offset;
+ char *ptr = page_address(bvec_page(&bvec)) + bvec.bv_offset;
size_t len = bvec.bv_len, retlen;

dev_dbg(&dev->core, " %s %zu bytes at offset %llu\n", op,
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ec6c5c6e1ac9..8aa209d929d4 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1257,7 +1257,7 @@ static void zero_bio_chain(struct bio *chain, int start_ofs)
buf = bvec_kmap_irq(&bv, &flags);
memset(buf + remainder, 0,
bv.bv_len - remainder);
- flush_dcache_page(bv.bv_page);
+ flush_dcache_page(bvec_page(&bv));
bvec_kunmap_irq(buf, &flags);
}
pos += bv.bv_len;
diff --git a/drivers/block/rsxx/dma.c b/drivers/block/rsxx/dma.c
index cf8cd293abb5..6a7e128f9c32 100644
--- a/drivers/block/rsxx/dma.c
+++ b/drivers/block/rsxx/dma.c
@@ -737,7 +737,7 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
st = rsxx_queue_dma(card, &dma_list[tgt],
bio_data_dir(bio),
dma_off, dma_len,
- laddr, bvec.bv_page,
+ laddr, bvec_page(&bvec),
bv_off, cb, cb_data);
if (st)
goto bvec_err;
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4cf81b5bf0f7..c7f65e4ec874 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -366,7 +366,7 @@ static int add_bio(struct cardinfo *card)
vec = bio_iter_iovec(bio, card->current_iter);

dma_handle = pci_map_page(card->dev,
- vec.bv_page,
+ bvec_page(&vec),
vec.bv_offset,
vec.bv_len,
(rw == READ) ?
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index c94386aa563d..79e3b33c736c 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -409,7 +409,7 @@ static int page_zero_filled(void *ptr)

static void handle_zero_page(struct bio_vec *bvec)
{
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);
void *user_mem;

user_mem = kmap_atomic(page);
@@ -497,7 +497,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
struct page *page;
unsigned char *user_mem, *uncmem = NULL;
struct zram_meta *meta = zram->meta;
- page = bvec->bv_page;
+ page = bvec_page(bvec);

bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
if (unlikely(!meta->table[index].handle) ||
@@ -568,7 +568,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
bool locked = false;
unsigned long alloced_pages;

- page = bvec->bv_page;
+ page = bvec_page(bvec);
if (is_partial_io(bvec)) {
/*
* This is a partial IO. We need to read the full page
@@ -924,7 +924,7 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
*/
struct bio_vec bv;

- bv.bv_page = bvec.bv_page;
+ bvec_set_page(&bv, bvec_page(&bvec));
bv.bv_len = max_transfer_size;
bv.bv_offset = bvec.bv_offset;

@@ -1011,7 +1011,7 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
index = sector >> SECTORS_PER_PAGE_SHIFT;
offset = sector & (SECTORS_PER_PAGE - 1) << SECTOR_SHIFT;

- bv.bv_page = page;
+ bvec_set_page(&bv, page);
bv.bv_len = PAGE_SIZE;
bv.bv_offset = 0;

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 00cde40db572..2e76e8b62902 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -366,7 +366,7 @@ static void btree_node_write_done(struct closure *cl)
int n;

bio_for_each_segment_all(bv, b->bio, n)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));

__btree_node_write_done(cl);
}
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 8b1f1d5c1819..c355a02b94dd 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -120,8 +120,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
submit_bio_wait(READ_SYNC, check);

bio_for_each_segment(bv, bio, iter) {
- void *p1 = kmap_atomic(bv.bv_page);
- void *p2 = page_address(check->bi_io_vec[iter.bi_idx].bv_page);
+ void *p1 = kmap_atomic(bvec_page(&bv));
+ void *p2 = page_address(bvec_page(&check->bi_io_vec[iter.bi_idx]));

cache_set_err_on(memcmp(p1 + bv.bv_offset,
p2 + bv.bv_offset,
@@ -135,7 +135,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
}

bio_for_each_segment_all(bv2, check, i)
- __free_page(bv2->bv_page);
+ __free_page(bvec_page(bv2));
out_put:
bio_put(check);
}
diff --git a/drivers/md/bcache/movinggc.c b/drivers/md/bcache/movinggc.c
index cd7490311e51..744e7af4b160 100644
--- a/drivers/md/bcache/movinggc.c
+++ b/drivers/md/bcache/movinggc.c
@@ -48,7 +48,7 @@ static void write_moving_finish(struct closure *cl)
int i;

bio_for_each_segment_all(bv, bio, i)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));

if (io->op.replace_collision)
trace_bcache_gc_copy_collision(&io->w->key);
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ab43faddb447..e6378a998618 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -42,9 +42,9 @@ static void bio_csum(struct bio *bio, struct bkey *k)
uint64_t csum = 0;

bio_for_each_segment(bv, bio, iter) {
- void *d = kmap(bv.bv_page) + bv.bv_offset;
+ void *d = kmap(bvec_page(&bv)) + bv.bv_offset;
csum = bch_crc64_update(csum, d, bv.bv_len);
- kunmap(bv.bv_page);
+ kunmap(bvec_page(&bv));
}

k->ptr[KEY_PTRS(k)] = csum & (~0ULL >> 1);
@@ -690,7 +690,7 @@ static void cached_dev_cache_miss_done(struct closure *cl)
struct bio_vec *bv;

bio_for_each_segment_all(bv, s->iop.bio, i)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));
}

cached_dev_bio_complete(cl);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4dd2bb7167f0..8d7cbba7ff7e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -231,7 +231,7 @@ static void write_bdev_super_endio(struct bio *bio, int error)

static void __write_super(struct cache_sb *sb, struct bio *bio)
{
- struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+ struct cache_sb *out = page_address(bvec_page(&bio->bi_io_vec[0]));
unsigned i;

bio->bi_iter.bi_sector = SB_SECTOR;
@@ -1172,7 +1172,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
bio_init(&dc->sb_bio);
dc->sb_bio.bi_max_vecs = 1;
dc->sb_bio.bi_io_vec = dc->sb_bio.bi_inline_vecs;
- dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
+ bvec_set_page(dc->sb_bio.bi_io_vec, sb_page);
get_page(sb_page);

if (cached_dev_init(dc, sb->block_size << 9))
@@ -1811,8 +1811,8 @@ void bch_cache_release(struct kobject *kobj)
for (i = 0; i < RESERVE_NR; i++)
free_fifo(&ca->free[i]);

- if (ca->sb_bio.bi_inline_vecs[0].bv_page)
- put_page(ca->sb_bio.bi_io_vec[0].bv_page);
+ if (bvec_page(&ca->sb_bio.bi_inline_vecs[0]))
+ put_page(bvec_page(&ca->sb_bio.bi_io_vec[0]));

if (!IS_ERR_OR_NULL(ca->bdev))
blkdev_put(ca->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
@@ -1870,7 +1870,7 @@ static void register_cache(struct cache_sb *sb, struct page *sb_page,
bio_init(&ca->sb_bio);
ca->sb_bio.bi_max_vecs = 1;
ca->sb_bio.bi_io_vec = ca->sb_bio.bi_inline_vecs;
- ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
+ bvec_set_page(&ca->sb_bio.bi_io_vec[0], sb_page);
get_page(sb_page);

if (blk_queue_discard(bdev_get_queue(ca->bdev)))
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index db3ae4c2b223..d02f6d626529 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -238,9 +238,8 @@ void bch_bio_map(struct bio *bio, void *base)
start: bv->bv_len = min_t(size_t, PAGE_SIZE - bv->bv_offset,
size);
if (base) {
- bv->bv_page = is_vmalloc_addr(base)
- ? vmalloc_to_page(base)
- : virt_to_page(base);
+ bvec_set_page(bv,
+ is_vmalloc_addr(base) ? vmalloc_to_page(base) : virt_to_page(base));

base += bv->bv_len;
}
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f1986bcd1bf0..6e9901c5dd66 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -133,7 +133,7 @@ static void write_dirty_finish(struct closure *cl)
int i;

bio_for_each_segment_all(bv, &io->bio, i)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));

/* This is kind of a dumb way of signalling errors. */
if (KEY_DIRTY(&w->key)) {
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 9eeea196328a..61784d3e9ac3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -849,11 +849,11 @@ static int crypt_convert_block(struct crypt_config *cc,
dmreq->iv_sector = ctx->cc_sector;
dmreq->ctx = ctx;
sg_init_table(&dmreq->sg_in, 1);
- sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
+ sg_set_page(&dmreq->sg_in, bvec_page(&bv_in), 1 << SECTOR_SHIFT,
bv_in.bv_offset);

sg_init_table(&dmreq->sg_out, 1);
- sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
+ sg_set_page(&dmreq->sg_out, bvec_page(&bv_out), 1 << SECTOR_SHIFT,
bv_out.bv_offset);

bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
@@ -1002,7 +1002,7 @@ retry:
len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;

bvec = &clone->bi_io_vec[clone->bi_vcnt++];
- bvec->bv_page = page;
+ bvec_set_page(bvec, page);
bvec->bv_len = len;
bvec->bv_offset = 0;

@@ -1024,9 +1024,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
struct bio_vec *bv;

bio_for_each_segment_all(bv, clone, i) {
- BUG_ON(!bv->bv_page);
- mempool_free(bv->bv_page, cc->page_pool);
- bv->bv_page = NULL;
+ BUG_ON(!bvec_page(bv));
+ mempool_free(bvec_page(bv), cc->page_pool);
+ bvec_set_page(bv, NULL);
}
}

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 74adcd2c967e..b0537d3073a2 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -204,7 +204,7 @@ static void bio_get_page(struct dpages *dp, struct page **p,
unsigned long *len, unsigned *offset)
{
struct bio_vec *bvec = dp->context_ptr;
- *p = bvec->bv_page;
+ *p = bvec_page(bvec);
*len = bvec->bv_len - dp->context_u;
*offset = bvec->bv_offset + dp->context_u;
}
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 93e08446a87d..d015f29b4a1c 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -162,7 +162,7 @@ static void log_end_io(struct bio *bio, int err)
}

bio_for_each_segment_all(bvec, bio, i)
- __free_page(bvec->bv_page);
+ __free_page(bvec_page(bvec));

put_io_block(lc);
bio_put(bio);
@@ -178,8 +178,8 @@ static void free_pending_block(struct log_writes_c *lc,
int i;

for (i = 0; i < block->vec_cnt; i++) {
- if (block->vecs[i].bv_page)
- __free_page(block->vecs[i].bv_page);
+ if (bvec_page(&block->vecs[i]))
+ __free_page(bvec_page(&block->vecs[i]));
}
kfree(block->data);
kfree(block);
@@ -277,7 +277,7 @@ static int log_one_block(struct log_writes_c *lc,
* The page offset is always 0 because we allocate a new page
* for every bvec in the original bio for simplicity sake.
*/
- ret = bio_add_page(bio, block->vecs[i].bv_page,
+ ret = bio_add_page(bio, bvec_page(&block->vecs[i]),
block->vecs[i].bv_len, 0);
if (ret != block->vecs[i].bv_len) {
atomic_inc(&lc->io_blocks);
@@ -294,7 +294,7 @@ static int log_one_block(struct log_writes_c *lc,
bio->bi_private = lc;
set_bit(BIO_UPTODATE, &bio->bi_flags);

- ret = bio_add_page(bio, block->vecs[i].bv_page,
+ ret = bio_add_page(bio, bvec_page(&block->vecs[i]),
block->vecs[i].bv_len, 0);
if (ret != block->vecs[i].bv_len) {
DMERR("Couldn't add page on new bio?");
@@ -641,12 +641,12 @@ static int log_writes_map(struct dm_target *ti, struct bio *bio)
return -ENOMEM;
}

- src = kmap_atomic(bv.bv_page);
+ src = kmap_atomic(bvec_page(&bv));
dst = kmap_atomic(page);
memcpy(dst, src + bv.bv_offset, bv.bv_len);
kunmap_atomic(dst);
kunmap_atomic(src);
- block->vecs[i].bv_page = page;
+ bvec_set_page(&block->vecs[i], page);
block->vecs[i].bv_len = bv.bv_len;
block->vec_cnt++;
i++;
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db33e6f..d56914eac6f2 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -408,7 +408,7 @@ test_block_hash:
unsigned len;
struct bio_vec bv = bio_iter_iovec(bio, io->iter);

- page = kmap_atomic(bv.bv_page);
+ page = kmap_atomic(bvec_page(&bv));
len = bv.bv_len;
if (likely(len >= todo))
len = todo;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9157a29c8dbf..78bc83fab933 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -134,8 +134,8 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
if (!test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) {
for (i=0; i<RESYNC_PAGES ; i++)
for (j=1; j<pi->raid_disks; j++)
- r1_bio->bios[j]->bi_io_vec[i].bv_page =
- r1_bio->bios[0]->bi_io_vec[i].bv_page;
+ bvec_set_page(&r1_bio->bios[j]->bi_io_vec[i],
+ bvec_page(&r1_bio->bios[0]->bi_io_vec[i]));
}

r1_bio->master_bio = NULL;
@@ -147,7 +147,7 @@ out_free_pages:
struct bio_vec *bv;

bio_for_each_segment_all(bv, r1_bio->bios[j], i)
- __free_page(bv->bv_page);
+ __free_page(bvec_page(bv));
}

out_free_bio:
@@ -166,9 +166,9 @@ static void r1buf_pool_free(void *__r1_bio, void *data)
for (i = 0; i < RESYNC_PAGES; i++)
for (j = pi->raid_disks; j-- ;) {
if (j == 0 ||
- r1bio->bios[j]->bi_io_vec[i].bv_page !=
- r1bio->bios[0]->bi_io_vec[i].bv_page)
- safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
+ bvec_page(&r1bio->bios[j]->bi_io_vec[i]) !=
+ bvec_page(&r1bio->bios[0]->bi_io_vec[i]))
+ safe_put_page(bvec_page(&r1bio->bios[j]->bi_io_vec[i]));
}
for (i=0 ; i < pi->raid_disks; i++)
bio_put(r1bio->bios[i]);
@@ -369,7 +369,7 @@ static void close_write(struct r1bio *r1_bio)
/* free extra copy of the data pages */
int i = r1_bio->behind_page_count;
while (i--)
- safe_put_page(r1_bio->behind_bvecs[i].bv_page);
+ safe_put_page(bvec_page(&r1_bio->behind_bvecs[i]));
kfree(r1_bio->behind_bvecs);
r1_bio->behind_bvecs = NULL;
}
@@ -1010,13 +1010,13 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)

bio_for_each_segment_all(bvec, bio, i) {
bvecs[i] = *bvec;
- bvecs[i].bv_page = alloc_page(GFP_NOIO);
- if (unlikely(!bvecs[i].bv_page))
+ bvec_set_page(&bvecs[i], alloc_page(GFP_NOIO));
+ if (unlikely(!bvec_page(&bvecs[i])))
goto do_sync_io;
- memcpy(kmap(bvecs[i].bv_page) + bvec->bv_offset,
- kmap(bvec->bv_page) + bvec->bv_offset, bvec->bv_len);
- kunmap(bvecs[i].bv_page);
- kunmap(bvec->bv_page);
+ memcpy(kmap(bvec_page(&bvecs[i])) + bvec->bv_offset,
+ kmap(bvec_page(bvec)) + bvec->bv_offset, bvec->bv_len);
+ kunmap(bvec_page(&bvecs[i]));
+ kunmap(bvec_page(bvec));
}
r1_bio->behind_bvecs = bvecs;
r1_bio->behind_page_count = bio->bi_vcnt;
@@ -1025,8 +1025,8 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)

do_sync_io:
for (i = 0; i < bio->bi_vcnt; i++)
- if (bvecs[i].bv_page)
- put_page(bvecs[i].bv_page);
+ if (bvec_page(&bvecs[i]))
+ put_page(bvec_page(&bvecs[i]));
kfree(bvecs);
pr_debug("%dB behind alloc failed, doing sync I/O\n",
bio->bi_iter.bi_size);
@@ -1397,7 +1397,8 @@ read_again:
* We trimmed the bio, so _all is legit
*/
bio_for_each_segment_all(bvec, mbio, j)
- bvec->bv_page = r1_bio->behind_bvecs[j].bv_page;
+ bvec_set_page(bvec,
+ bvec_page(&r1_bio->behind_bvecs[j]));
if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags))
atomic_inc(&r1_bio->behind_remaining);
}
@@ -1861,7 +1862,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
*/
rdev = conf->mirrors[d].rdev;
if (sync_page_io(rdev, sect, s<<9,
- bio->bi_io_vec[idx].bv_page,
+ bvec_page(&bio->bi_io_vec[idx]),
READ, false)) {
success = 1;
break;
@@ -1917,7 +1918,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
- bio->bi_io_vec[idx].bv_page,
+ bvec_page(&bio->bi_io_vec[idx]),
WRITE) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
@@ -1932,7 +1933,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
- bio->bi_io_vec[idx].bv_page,
+ bvec_page(&bio->bi_io_vec[idx]),
READ) != 0)
atomic_add(s, &rdev->corrected_errors);
}
@@ -2016,8 +2017,8 @@ static void process_checks(struct r1bio *r1_bio)
if (uptodate) {
for (j = vcnt; j-- ; ) {
struct page *p, *s;
- p = pbio->bi_io_vec[j].bv_page;
- s = sbio->bi_io_vec[j].bv_page;
+ p = bvec_page(&pbio->bi_io_vec[j]);
+ s = bvec_page(&sbio->bi_io_vec[j]);
if (memcmp(page_address(p),
page_address(s),
sbio->bi_io_vec[j].bv_len))
@@ -2226,7 +2227,7 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
unsigned vcnt = r1_bio->behind_page_count;
struct bio_vec *vec = r1_bio->behind_bvecs;

- while (!vec->bv_page) {
+ while (!bvec_page(vec)) {
vec++;
vcnt--;
}
@@ -2700,10 +2701,11 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
for (i = 0 ; i < conf->raid_disks * 2; i++) {
bio = r1_bio->bios[i];
if (bio->bi_end_io) {
- page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+ page = bvec_page(&bio->bi_io_vec[bio->bi_vcnt]);
if (bio_add_page(bio, page, len, 0) == 0) {
/* stop here */
- bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
+ bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt],
+ page);
while (i > 0) {
i--;
bio = r1_bio->bios[i];
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e793ab6b3570..61e0e6d415c7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -181,16 +181,16 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
/* we can share bv_page's during recovery
* and reshape */
struct bio *rbio = r10_bio->devs[0].bio;
- page = rbio->bi_io_vec[i].bv_page;
+ page = bvec_page(&rbio->bi_io_vec[i]);
get_page(page);
} else
page = alloc_page(gfp_flags);
if (unlikely(!page))
goto out_free_pages;

- bio->bi_io_vec[i].bv_page = page;
+ bvec_set_page(&bio->bi_io_vec[i], page);
if (rbio)
- rbio->bi_io_vec[i].bv_page = page;
+ bvec_set_page(&rbio->bi_io_vec[i], page);
}
}

@@ -198,10 +198,10 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)

out_free_pages:
for ( ; i > 0 ; i--)
- safe_put_page(bio->bi_io_vec[i-1].bv_page);
+ safe_put_page(bvec_page(&bio->bi_io_vec[i - 1]));
while (j--)
for (i = 0; i < RESYNC_PAGES ; i++)
- safe_put_page(r10_bio->devs[j].bio->bi_io_vec[i].bv_page);
+ safe_put_page(bvec_page(&r10_bio->devs[j].bio->bi_io_vec[i]));
j = 0;
out_free_bio:
for ( ; j < nalloc; j++) {
@@ -225,8 +225,8 @@ static void r10buf_pool_free(void *__r10_bio, void *data)
struct bio *bio = r10bio->devs[j].bio;
if (bio) {
for (i = 0; i < RESYNC_PAGES; i++) {
- safe_put_page(bio->bi_io_vec[i].bv_page);
- bio->bi_io_vec[i].bv_page = NULL;
+ safe_put_page(bvec_page(&bio->bi_io_vec[i]));
+ bvec_set_page(&bio->bi_io_vec[i], NULL);
}
bio_put(bio);
}
@@ -2074,8 +2074,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
int len = PAGE_SIZE;
if (sectors < (len / 512))
len = sectors * 512;
- if (memcmp(page_address(fbio->bi_io_vec[j].bv_page),
- page_address(tbio->bi_io_vec[j].bv_page),
+ if (memcmp(page_address(bvec_page(&fbio->bi_io_vec[j])),
+ page_address(bvec_page(&tbio->bi_io_vec[j])),
len))
break;
sectors -= len/512;
@@ -2104,8 +2104,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
tbio->bi_io_vec[j].bv_offset = 0;
tbio->bi_io_vec[j].bv_len = PAGE_SIZE;

- memcpy(page_address(tbio->bi_io_vec[j].bv_page),
- page_address(fbio->bi_io_vec[j].bv_page),
+ memcpy(page_address(bvec_page(&tbio->bi_io_vec[j])),
+ page_address(bvec_page(&fbio->bi_io_vec[j])),
PAGE_SIZE);
}
tbio->bi_end_io = end_sync_write;
@@ -2132,8 +2132,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
if (r10_bio->devs[i].bio->bi_end_io != end_sync_write
&& r10_bio->devs[i].bio != fbio)
for (j = 0; j < vcnt; j++)
- memcpy(page_address(tbio->bi_io_vec[j].bv_page),
- page_address(fbio->bi_io_vec[j].bv_page),
+ memcpy(page_address(bvec_page(&tbio->bi_io_vec[j])),
+ page_address(bvec_page(&fbio->bi_io_vec[j])),
PAGE_SIZE);
d = r10_bio->devs[i].devnum;
atomic_inc(&r10_bio->remaining);
@@ -2191,7 +2191,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
ok = sync_page_io(rdev,
addr,
s << 9,
- bio->bi_io_vec[idx].bv_page,
+ bvec_page(&bio->bi_io_vec[idx]),
READ, false);
if (ok) {
rdev = conf->mirrors[dw].rdev;
@@ -2199,7 +2199,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
ok = sync_page_io(rdev,
addr,
s << 9,
- bio->bi_io_vec[idx].bv_page,
+ bvec_page(&bio->bi_io_vec[idx]),
WRITE, false);
if (!ok) {
set_bit(WriteErrorSeen, &rdev->flags);
@@ -3355,12 +3355,12 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
break;
for (bio= biolist ; bio ; bio=bio->bi_next) {
struct bio *bio2;
- page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+ page = bvec_page(&bio->bi_io_vec[bio->bi_vcnt]);
if (bio_add_page(bio, page, len, 0))
continue;

/* stop here */
- bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
+ bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page);
for (bio2 = biolist;
bio2 && bio2 != bio;
bio2 = bio2->bi_next) {
@@ -4430,7 +4430,7 @@ read_more:

nr_sectors = 0;
for (s = 0 ; s < max_sectors; s += PAGE_SIZE >> 9) {
- struct page *page = r10_bio->devs[0].bio->bi_io_vec[s/(PAGE_SIZE>>9)].bv_page;
+ struct page *page = bvec_page(&r10_bio->devs[0].bio->bi_io_vec[s / (PAGE_SIZE >> 9)]);
int len = (max_sectors - s) << 9;
if (len > PAGE_SIZE)
len = PAGE_SIZE;
@@ -4587,7 +4587,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
success = sync_page_io(rdev,
addr,
s << 9,
- bvec[idx].bv_page,
+ bvec_page(&bvec[idx]),
READ, false);
if (success)
break;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 77dfd720aaa0..6ec297699621 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1006,7 +1006,7 @@ again:

if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
- sh->dev[i].vec.bv_page = sh->dev[i].page;
+ bvec_set_page(&sh->dev[i].vec, sh->dev[i].page);
bi->bi_vcnt = 1;
bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
bi->bi_io_vec[0].bv_offset = 0;
@@ -1055,7 +1055,7 @@ again:
+ rrdev->data_offset);
if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
- sh->dev[i].rvec.bv_page = sh->dev[i].page;
+ bvec_set_page(&sh->dev[i].rvec, sh->dev[i].page);
rbi->bi_vcnt = 1;
rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
rbi->bi_io_vec[0].bv_offset = 0;
@@ -1132,7 +1132,7 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,

if (clen > 0) {
b_offset += bvl.bv_offset;
- bio_page = bvl.bv_page;
+ bio_page = bvec_page(&bvl);
if (frombio) {
if (sh->raid_conf->skip_copy &&
b_offset == 0 && page_offset == 0 &&
diff --git a/drivers/s390/block/dasd_diag.c b/drivers/s390/block/dasd_diag.c
index c062f1620c58..89f39d00077d 100644
--- a/drivers/s390/block/dasd_diag.c
+++ b/drivers/s390/block/dasd_diag.c
@@ -545,7 +545,7 @@ static struct dasd_ccw_req *dasd_diag_build_cp(struct dasd_device *memdev,
dbio = dreq->bio;
recid = first_rec;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
for (off = 0; off < bv.bv_len; off += blksize) {
memset(dbio, 0, sizeof (struct dasd_diag_bio));
dbio->type = rw_cmd;
diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c
index 6215f6455eb8..926d458e5376 100644
--- a/drivers/s390/block/dasd_eckd.c
+++ b/drivers/s390/block/dasd_eckd.c
@@ -2612,7 +2612,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_single(
/* Eckd can only do full blocks. */
return ERR_PTR(-EINVAL);
count += bv.bv_len >> (block->s2b_shift + 9);
- if (idal_is_needed (page_address(bv.bv_page), bv.bv_len))
+ if (idal_is_needed (page_address(bvec_page(&bv)), bv.bv_len))
cidaw += bv.bv_len >> (block->s2b_shift + 9);
}
/* Paranoia. */
@@ -2683,7 +2683,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_single(
last_rec - recid + 1, cmd, basedev, blksize);
}
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
if (dasd_page_cache) {
char *copy = kmem_cache_alloc(dasd_page_cache,
GFP_DMA | __GFP_NOWARN);
@@ -2846,7 +2846,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_cmd_track(
idaw_dst = NULL;
idaw_len = 0;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
seg_len = bv.bv_len;
while (seg_len) {
if (new_track) {
@@ -3158,7 +3158,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track(
new_track = 1;
recid = first_rec;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
seg_len = bv.bv_len;
while (seg_len) {
if (new_track) {
@@ -3191,7 +3191,7 @@ static struct dasd_ccw_req *dasd_eckd_build_cp_tpm_track(
}
} else {
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
last_tidaw = itcw_add_tidaw(itcw, 0x00,
dst, bv.bv_len);
if (IS_ERR(last_tidaw)) {
@@ -3411,7 +3411,7 @@ static struct dasd_ccw_req *dasd_raw_build_cp(struct dasd_device *startdev,
idaws = idal_create_words(idaws, rawpadpage, PAGE_SIZE);
}
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
seg_len = bv.bv_len;
if (cmd == DASD_ECKD_CCW_READ_TRACK)
memset(dst, 0, seg_len);
@@ -3475,7 +3475,7 @@ dasd_eckd_free_cp(struct dasd_ccw_req *cqr, struct request *req)
if (private->uses_cdl == 0 || recid > 2*blk_per_trk)
ccw++;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
for (off = 0; off < bv.bv_len; off += blksize) {
/* Skip locate record. */
if (private->uses_cdl && recid <= 2*blk_per_trk)
diff --git a/drivers/s390/block/dasd_fba.c b/drivers/s390/block/dasd_fba.c
index c9262e78938b..a51cdc5db6dc 100644
--- a/drivers/s390/block/dasd_fba.c
+++ b/drivers/s390/block/dasd_fba.c
@@ -287,7 +287,7 @@ static struct dasd_ccw_req *dasd_fba_build_cp(struct dasd_device * memdev,
/* Fba can only do full blocks. */
return ERR_PTR(-EINVAL);
count += bv.bv_len >> (block->s2b_shift + 9);
- if (idal_is_needed (page_address(bv.bv_page), bv.bv_len))
+ if (idal_is_needed (page_address(bvec_page(&bv)), bv.bv_len))
cidaw += bv.bv_len / blksize;
}
/* Paranoia. */
@@ -324,7 +324,7 @@ static struct dasd_ccw_req *dasd_fba_build_cp(struct dasd_device * memdev,
}
recid = first_rec;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
if (dasd_page_cache) {
char *copy = kmem_cache_alloc(dasd_page_cache,
GFP_DMA | __GFP_NOWARN);
@@ -397,7 +397,7 @@ dasd_fba_free_cp(struct dasd_ccw_req *cqr, struct request *req)
if (private->rdc_data.mode.bits.data_chain != 0)
ccw++;
rq_for_each_segment(bv, req, iter) {
- dst = page_address(bv.bv_page) + bv.bv_offset;
+ dst = page_address(bvec_page(&bv)) + bv.bv_offset;
for (off = 0; off < bv.bv_len; off += blksize) {
/* Skip locate record. */
if (private->rdc_data.mode.bits.data_chain == 0)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index da212813f2d5..5da8515b8fb9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -857,7 +857,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
index = (bio->bi_iter.bi_sector >> 3);
bio_for_each_segment(bvec, bio, iter) {
page_addr = (unsigned long)
- page_address(bvec.bv_page) + bvec.bv_offset;
+ page_address(bvec_page(&bvec)) + bvec.bv_offset;
source_addr = dev_info->start + (index<<12) + bytes_done;
if (unlikely((page_addr & 4095) != 0) || (bvec.bv_len & 4095) != 0)
// More paranoia.
diff --git a/drivers/s390/block/scm_blk.c b/drivers/s390/block/scm_blk.c
index 75d9896deccb..9bf2d42c1946 100644
--- a/drivers/s390/block/scm_blk.c
+++ b/drivers/s390/block/scm_blk.c
@@ -203,7 +203,7 @@ static int scm_request_prepare(struct scm_request *scmrq)
rq_for_each_segment(bv, req, iter) {
WARN_ON(bv.bv_offset);
msb->blk_count += bv.bv_len >> 12;
- aidaw->data_addr = (u64) page_address(bv.bv_page);
+ aidaw->data_addr = (u64) page_address(bvec_page(&bv));
aidaw++;
}

diff --git a/drivers/s390/block/scm_blk_cluster.c b/drivers/s390/block/scm_blk_cluster.c
index 7497ddde2dd6..a7e2fcb8f185 100644
--- a/drivers/s390/block/scm_blk_cluster.c
+++ b/drivers/s390/block/scm_blk_cluster.c
@@ -181,7 +181,7 @@ static int scm_prepare_cluster_request(struct scm_request *scmrq)
i++;
}
rq_for_each_segment(bv, req, iter) {
- aidaw->data_addr = (u64) page_address(bv.bv_page);
+ aidaw->data_addr = (u64) page_address(bvec_page(&bv));
aidaw++;
i++;
}
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index 7d4e9397ac31..44e80e13b643 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -202,7 +202,7 @@ static void xpram_make_request(struct request_queue *q, struct bio *bio)
index = (bio->bi_iter.bi_sector >> 3) + xdev->offset;
bio_for_each_segment(bvec, bio, iter) {
page_addr = (unsigned long)
- kmap(bvec.bv_page) + bvec.bv_offset;
+ kmap(bvec_page(&bvec)) + bvec.bv_offset;
bytes = bvec.bv_len;
if ((page_addr & 4095) != 0 || (bytes & 4095) != 0)
/* More paranoia. */
diff --git a/drivers/scsi/mpt2sas/mpt2sas_transport.c b/drivers/scsi/mpt2sas/mpt2sas_transport.c
index ff2500ab9ba4..788de1c250a3 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_transport.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_transport.c
@@ -1956,7 +1956,7 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,

bio_for_each_segment(bvec, req->bio, iter) {
memcpy(pci_addr_out + offset,
- page_address(bvec.bv_page) + bvec.bv_offset,
+ page_address(bvec_page(&bvec)) + bvec.bv_offset,
bvec.bv_len);
offset += bvec.bv_len;
}
@@ -2107,12 +2107,12 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
le16_to_cpu(mpi_reply->ResponseDataLength);
bio_for_each_segment(bvec, rsp->bio, iter) {
if (bytes_to_copy <= bvec.bv_len) {
- memcpy(page_address(bvec.bv_page) +
+ memcpy(page_address(bvec_page(&bvec)) +
bvec.bv_offset, pci_addr_in +
offset, bytes_to_copy);
break;
} else {
- memcpy(page_address(bvec.bv_page) +
+ memcpy(page_address(bvec_page(&bvec)) +
bvec.bv_offset, pci_addr_in +
offset, bvec.bv_len);
bytes_to_copy -= bvec.bv_len;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_transport.c b/drivers/scsi/mpt3sas/mpt3sas_transport.c
index efb98afc46e0..f187a1a05b9b 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_transport.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_transport.c
@@ -1939,7 +1939,7 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,

bio_for_each_segment(bvec, req->bio, iter) {
memcpy(pci_addr_out + offset,
- page_address(bvec.bv_page) + bvec.bv_offset,
+ page_address(bvec_page(&bvec)) + bvec.bv_offset,
bvec.bv_len);
offset += bvec.bv_len;
}
@@ -2068,12 +2068,12 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
le16_to_cpu(mpi_reply->ResponseDataLength);
bio_for_each_segment(bvec, rsp->bio, iter) {
if (bytes_to_copy <= bvec.bv_len) {
- memcpy(page_address(bvec.bv_page) +
+ memcpy(page_address(bvec_page(&bvec)) +
bvec.bv_offset, pci_addr_in +
offset, bytes_to_copy);
break;
} else {
- memcpy(page_address(bvec.bv_page) +
+ memcpy(page_address(bvec_page(&bvec)) +
bvec.bv_offset, pci_addr_in +
offset, bvec.bv_len);
bytes_to_copy -= bvec.bv_len;
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 5c06d292b94c..9e838bd5f2c3 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -134,7 +134,7 @@ void sd_dif_prepare(struct scsi_cmnd *scmd)
virt = bip_get_seed(bip) & 0xffffffff;

bip_for_each_vec(iv, bip, iter) {
- pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+ pi = kmap_atomic(bvec_page(&iv)) + iv.bv_offset;

for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {

@@ -181,7 +181,7 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int good_bytes)
virt = bip_get_seed(bip) & 0xffffffff;

bip_for_each_vec(iv, bip, iter) {
- pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
+ pi = kmap_atomic(bvec_page(&iv)) + iv.bv_offset;

for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {

diff --git a/drivers/staging/lustre/lustre/llite/lloop.c b/drivers/staging/lustre/lustre/llite/lloop.c
index 413a8408e3f5..044c435fae28 100644
--- a/drivers/staging/lustre/lustre/llite/lloop.c
+++ b/drivers/staging/lustre/lustre/llite/lloop.c
@@ -221,7 +221,7 @@ static int do_bio_lustrebacked(struct lloop_device *lo, struct bio *head)
BUG_ON(bvec.bv_offset != 0);
BUG_ON(bvec.bv_len != PAGE_CACHE_SIZE);

- pages[page_count] = bvec.bv_page;
+ pages[page_count] = bvec_page(&bvec);
offsets[page_count] = offset;
page_count++;
offset += bvec.bv_len;
diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index f7e6e51aed36..47fffb9522fc 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -336,7 +336,7 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
}

for_each_sg(sgl, sg, sgl_nents, i) {
- bvec[i].bv_page = sg_page(sg);
+ bvec_set_page(&bvec[i], sg_page(sg));
bvec[i].bv_len = sg->length;
bvec[i].bv_offset = sg->offset;

@@ -462,7 +462,7 @@ fd_execute_write_same(struct se_cmd *cmd)
return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;

for (i = 0; i < nolb; i++) {
- bvec[i].bv_page = sg_page(&cmd->t_data_sg[0]);
+ bvec_set_page(&bvec[i], sg_page(&cmd->t_data_sg[0]));
bvec[i].bv_len = cmd->t_data_sg[0].length;
bvec[i].bv_offset = cmd->t_data_sg[0].offset;

diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
index 0edb91c0de6b..7fcdcb2265f1 100644
--- a/drivers/xen/biomerge.c
+++ b/drivers/xen/biomerge.c
@@ -6,8 +6,8 @@
bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct bio_vec *vec2)
{
- unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page));
- unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page));
+ unsigned long mfn1 = pfn_to_mfn(page_to_pfn(bvec_page(vec1)));
+ unsigned long mfn2 = pfn_to_mfn(page_to_pfn(bvec_page(vec2)));

return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) &&
((mfn1 == mfn2) || ((mfn1+1) == mfn2));
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index e9e04376c52c..14b65a2c0d99 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -171,7 +171,7 @@ static int v9fs_vfs_writepage_locked(struct page *page)
else
len = PAGE_CACHE_SIZE;

- bvec.bv_page = page;
+ bvec_set_page(&bvec, page);
bvec.bv_offset = 0;
bvec.bv_len = len;
iov_iter_bvec(&from, ITER_BVEC | WRITE, &bvec, 1, len);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index ce7dec88f4b8..bf6fec07b276 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2997,11 +2997,11 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
cur_bytenr = dev_bytenr;
for (i = 0; i < bio->bi_vcnt; i++) {
BUG_ON(bio->bi_io_vec[i].bv_len != PAGE_CACHE_SIZE);
- mapped_datav[i] = kmap(bio->bi_io_vec[i].bv_page);
+ mapped_datav[i] = kmap(bvec_page(&bio->bi_io_vec[i]));
if (!mapped_datav[i]) {
while (i > 0) {
i--;
- kunmap(bio->bi_io_vec[i].bv_page);
+ kunmap(bvec_page(&bio->bi_io_vec[i]));
}
kfree(mapped_datav);
goto leave;
@@ -3020,7 +3020,7 @@ static void __btrfsic_submit_bio(int rw, struct bio *bio)
NULL, rw);
while (i > 0) {
i--;
- kunmap(bio->bi_io_vec[i].bv_page);
+ kunmap(bvec_page(&bio->bi_io_vec[i]));
}
kfree(mapped_datav);
} else if (NULL != dev_state && (rw & REQ_FLUSH)) {
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ce62324c78e7..8573fed0e8cb 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -208,7 +208,7 @@ csum_failed:
* checked so the end_io handlers know about it
*/
bio_for_each_segment_all(bvec, cb->orig_bio, i)
- SetPageChecked(bvec->bv_page);
+ SetPageChecked(bvec_page(bvec));

bio_endio(cb->orig_bio, 0);
}
@@ -459,7 +459,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
u64 end;
int misses = 0;

- page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
+ page = bvec_page(&cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1]);
last_offset = (page_offset(page) + PAGE_CACHE_SIZE);
em_tree = &BTRFS_I(inode)->extent_tree;
tree = &BTRFS_I(inode)->io_tree;
@@ -592,7 +592,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
/* we need the actual starting offset of this extent in the file */
read_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree,
- page_offset(bio->bi_io_vec->bv_page),
+ page_offset(bvec_page(bio->bi_io_vec)),
PAGE_CACHE_SIZE);
read_unlock(&em_tree->lock);
if (!em)
@@ -986,7 +986,7 @@ int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
unsigned long working_bytes = total_out - buf_start;
unsigned long bytes;
char *kaddr;
- struct page *page_out = bvec[*pg_index].bv_page;
+ struct page *page_out = bvec_page(&bvec[*pg_index]);

/*
* start byte is the first byte of the page we're currently
@@ -1031,7 +1031,7 @@ int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
if (*pg_index >= vcnt)
return 0;

- page_out = bvec[*pg_index].bv_page;
+ page_out = bvec_page(&bvec[*pg_index]);
*pg_offset = 0;
start_byte = page_offset(page_out) - disk_start;

@@ -1071,7 +1071,7 @@ void btrfs_clear_biovec_end(struct bio_vec *bvec, int vcnt,
unsigned long pg_offset)
{
while (pg_index < vcnt) {
- struct page *page = bvec[pg_index].bv_page;
+ struct page *page = bvec_page(&bvec[pg_index]);
unsigned long off = bvec[pg_index].bv_offset;
unsigned long len = bvec[pg_index].bv_len;

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b72d06..a9ec0c6cfb81 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -876,8 +876,9 @@ static int btree_csum_one_bio(struct bio *bio)
int i, ret = 0;

bio_for_each_segment_all(bvec, bio, i) {
- root = BTRFS_I(bvec->bv_page->mapping->host)->root;
- ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
+ root = BTRFS_I(bvec_page(bvec)->mapping->host)->root;
+ ret = csum_dirty_buffer(root->fs_info,
+ bvec_page(bvec));
if (ret)
break;
}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 43af5a61ad25..9d5062f298c6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2489,7 +2489,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

/* We always issue full-page reads, but if some block
* in a page fails to read, blk_update_request() will
@@ -2563,7 +2563,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
uptodate = 0;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);
struct inode *inode = page->mapping->host;

pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
@@ -2751,7 +2751,7 @@ static int __must_check submit_one_bio(int rw, struct bio *bio,
{
int ret = 0;
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);
struct extent_io_tree *tree = bio->bi_private;
u64 start;

@@ -3700,7 +3700,7 @@ static void end_bio_extent_buffer_writepage(struct bio *bio, int err)
int i, done;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

eb = (struct extent_buffer *)page->private;
BUG_ON(!eb);
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 58ece6558430..835284e230d4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -222,7 +222,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
offset = logical_offset;
while (bio_index < bio->bi_vcnt) {
if (!dio)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;
count = btrfs_find_ordered_sum(inode, offset, disk_bytenr,
(u32 *)csum, nblocks);
if (count)
@@ -448,7 +448,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
if (contig)
offset = file_start;
else
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;

ordered = btrfs_lookup_ordered_extent(inode, offset);
BUG_ON(!ordered); /* Logic error */
@@ -457,7 +457,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,

while (bio_index < bio->bi_vcnt) {
if (!contig)
- offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+ offset = page_offset(bvec_page(bvec)) + bvec->bv_offset;

if (offset >= ordered->file_offset + ordered->len ||
offset < ordered->file_offset) {
@@ -480,7 +480,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
index = 0;
}

- data = kmap_atomic(bvec->bv_page);
+ data = kmap_atomic(bvec_page(bvec));
sums->sums[index] = ~(u32)0;
sums->sums[index] = btrfs_csum_data(data + bvec->bv_offset,
sums->sums[index],
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8bb013672aee..89f2dc525859 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7682,7 +7682,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio, int err)

done->uptodate = 1;
bio_for_each_segment_all(bvec, bio, i)
- clean_io_failure(done->inode, done->start, bvec->bv_page, 0);
+ clean_io_failure(done->inode, done->start, bvec_page(bvec), 0);
end:
complete(&done->done);
bio_put(bio);
@@ -7706,7 +7706,9 @@ try_again:
done.start = start;
init_completion(&done.done);

- ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+ ret = dio_read_error(inode, &io_bio->bio,
+ bvec_page(bvec),
+ start,
start + bvec->bv_len - 1,
io_bio->mirror_num,
btrfs_retry_endio_nocsum, &done);
@@ -7741,11 +7743,11 @@ static void btrfs_retry_endio(struct bio *bio, int err)
uptodate = 1;
bio_for_each_segment_all(bvec, bio, i) {
ret = __readpage_endio_check(done->inode, io_bio, i,
- bvec->bv_page, 0,
+ bvec_page(bvec), 0,
done->start, bvec->bv_len);
if (!ret)
clean_io_failure(done->inode, done->start,
- bvec->bv_page, 0);
+ bvec_page(bvec), 0);
else
uptodate = 0;
}
@@ -7771,7 +7773,8 @@ static int __btrfs_subio_endio_read(struct inode *inode,
done.inode = inode;

bio_for_each_segment_all(bvec, &io_bio->bio, i) {
- ret = __readpage_endio_check(inode, io_bio, i, bvec->bv_page,
+ ret = __readpage_endio_check(inode, io_bio, i,
+ bvec_page(bvec),
0, start, bvec->bv_len);
if (likely(!ret))
goto next;
@@ -7780,7 +7783,9 @@ try_again:
done.start = start;
init_completion(&done.done);

- ret = dio_read_error(inode, &io_bio->bio, bvec->bv_page, start,
+ ret = dio_read_error(inode, &io_bio->bio,
+ bvec_page(bvec),
+ start,
start + bvec->bv_len - 1,
io_bio->mirror_num,
btrfs_retry_endio, &done);
@@ -8076,7 +8081,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip,

while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
if (map_length < submit_len + bvec->bv_len ||
- bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+ bio_add_page(bio, bvec_page(bvec), bvec->bv_len,
bvec->bv_offset) < bvec->bv_len) {
/*
* inc the count before we submit the bio so
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index fa72068bd256..fc94998fee9b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1147,7 +1147,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
page_index = stripe_offset >> PAGE_CACHE_SHIFT;

for (i = 0; i < bio->bi_vcnt; i++) {
- p = bio->bi_io_vec[i].bv_page;
+ p = bvec_page(&bio->bi_io_vec[i]);
rbio->bio_pages[page_index + i] = p;
}
}
@@ -1428,7 +1428,7 @@ static void set_bio_pages_uptodate(struct bio *bio)
struct page *p;

for (i = 0; i < bio->bi_vcnt; i++) {
- p = bio->bi_io_vec[i].bv_page;
+ p = bvec_page(&bio->bi_io_vec[i]);
SetPageUptodate(p);
}
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 96aebf3bcd5b..ed579c40d0e5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5791,7 +5791,7 @@ again:
return -ENOMEM;

while (bvec <= (first_bio->bi_io_vec + first_bio->bi_vcnt - 1)) {
- if (bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+ if (bio_add_page(bio, bvec_page(bvec), bvec->bv_len,
bvec->bv_offset) < bvec->bv_len) {
u64 len = bio->bi_iter.bi_size;

diff --git a/fs/buffer.c b/fs/buffer.c
index c7a5602d01ee..e691107060f9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2992,7 +2992,7 @@ void guard_bio_eod(int rw, struct bio *bio)

/* ..and clear the end of the buffer for reads */
if ((rw & RW_MASK) == READ) {
- zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+ zero_user(bvec_page(bvec), bvec->bv_offset + bvec->bv_len,
truncated_bytes);
}
}
@@ -3022,7 +3022,7 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)

bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
- bio->bi_io_vec[0].bv_page = bh->b_page;
+ bvec_set_page(&bio->bi_io_vec[0], bh->b_page);
bio->bi_io_vec[0].bv_len = bh->b_size;
bio->bi_io_vec[0].bv_offset = bh_offset(bh);

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d2342651a..6c0e8c2b8217 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -468,7 +468,7 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (dio->rw == READ && !PageCompound(page))
set_page_dirty_lock(page);
diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 7bd8ac8dfb28..4bd44bfed847 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -411,9 +411,9 @@ static void _clear_bio(struct bio *bio)
unsigned this_count = bv->bv_len;

if (likely(PAGE_SIZE == this_count))
- clear_highpage(bv->bv_page);
+ clear_highpage(bvec_page(bv));
else
- zero_user(bv->bv_page, bv->bv_offset, this_count);
+ zero_user(bvec_page(bv), bv->bv_offset, this_count);
}
}

diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index 27cbdb697649..da76728824e6 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -438,7 +438,7 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
continue;

bio_for_each_segment_all(bv, bio, i) {
- struct page *page = bv->bv_page;
+ struct page *page = bvec_page(bv);

SetPageUptodate(page);
if (PageError(page))
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 5765f88b3904..1951399f54ec 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -65,7 +65,7 @@ static void ext4_finish_bio(struct bio *bio)
struct bio_vec *bvec;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);
#ifdef CONFIG_EXT4_FS_ENCRYPTION
struct page *data_page = NULL;
struct ext4_crypto_ctx *ctx = NULL;
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 171b9ac4b45e..9b58ce079f7d 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -60,7 +60,7 @@ static void completion_pages(struct work_struct *work)
int i;

bio_for_each_segment_all(bv, bio, i) {
- struct page *page = bv->bv_page;
+ struct page *page = bvec_page(bv);

int ret = ext4_decrypt(ctx, page);
if (ret) {
@@ -116,7 +116,7 @@ static void mpage_end_io(struct bio *bio, int err)
}
}
bio_for_each_segment_all(bv, bio, i) {
- struct page *page = bv->bv_page;
+ struct page *page = bvec_page(bv);

if (!err) {
SetPageUptodate(page);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b91b0e10678e..8aef4873c5d7 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -34,7 +34,7 @@ static void f2fs_read_end_io(struct bio *bio, int err)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (!err) {
SetPageUptodate(page);
@@ -54,7 +54,7 @@ static void f2fs_write_end_io(struct bio *bio, int err)
int i;

bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);

if (unlikely(err)) {
set_page_dirty(page);
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index f939660941bb..3ae89adeb346 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1314,7 +1314,7 @@ static inline bool is_merged_page(struct f2fs_sb_info *sbi,
goto out;

bio_for_each_segment_all(bvec, io->bio, i) {
- if (page == bvec->bv_page) {
+ if (page == bvec_page(bvec)) {
up_read(&io->io_rwsem);
return true;
}
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 2c1ae861dc94..2c1e14ca5971 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -173,7 +173,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, struct bio_vec *bvec,
int error)
{
struct buffer_head *bh, *next;
- struct page *page = bvec->bv_page;
+ struct page *page = bvec_page(bvec);
unsigned size;

bh = page_buffers(page);
@@ -215,7 +215,7 @@ static void gfs2_end_log_write(struct bio *bio, int error)
}

bio_for_each_segment_all(bvec, bio, i) {
- page = bvec->bv_page;
+ page = bvec_page(bvec);
if (page_has_buffers(page))
gfs2_end_log_write_bh(sdp, bvec, error);
else
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index bc462dcd7a40..4effe870b5aa 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -1999,7 +1999,7 @@ static int lbmRead(struct jfs_log * log, int pn, struct lbuf ** bpp)

bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
+ bvec_set_page(&bio->bi_io_vec[0], bp->l_page);
bio->bi_io_vec[0].bv_len = LOGPSIZE;
bio->bi_io_vec[0].bv_offset = bp->l_offset;

@@ -2145,7 +2145,7 @@ static void lbmStartIO(struct lbuf * bp)
bio = bio_alloc(GFP_NOFS, 1);
bio->bi_iter.bi_sector = bp->l_blkno << (log->l2bsize - 9);
bio->bi_bdev = log->bdev;
- bio->bi_io_vec[0].bv_page = bp->l_page;
+ bvec_set_page(&bio->bi_io_vec[0], bp->l_page);
bio->bi_io_vec[0].bv_len = LOGPSIZE;
bio->bi_io_vec[0].bv_offset = bp->l_offset;

diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 76279e11982d..7daa0e336fdf 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -22,7 +22,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int rw)
bio_init(&bio);
bio.bi_max_vecs = 1;
bio.bi_io_vec = &bio_vec;
- bio_vec.bv_page = page;
+ bvec_set_page(&bio_vec, page);
bio_vec.bv_len = PAGE_SIZE;
bio_vec.bv_offset = 0;
bio.bi_vcnt = 1;
@@ -65,8 +65,8 @@ static void writeseg_end_io(struct bio *bio, int err)
BUG_ON(err);

bio_for_each_segment_all(bvec, bio, i) {
- end_page_writeback(bvec->bv_page);
- page_cache_release(bvec->bv_page);
+ end_page_writeback(bvec_page(bvec));
+ page_cache_release(bvec_page(bvec));
}
bio_put(bio);
if (atomic_dec_and_test(&super->s_pending_writes))
@@ -110,7 +110,7 @@ static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
}
page = find_lock_page(mapping, index + i);
BUG_ON(!page);
- bio->bi_io_vec[i].bv_page = page;
+ bvec_set_page(&bio->bi_io_vec[i], page);
bio->bi_io_vec[i].bv_len = PAGE_SIZE;
bio->bi_io_vec[i].bv_offset = 0;

@@ -200,7 +200,7 @@ static int do_erase(struct super_block *sb, u64 ofs, pgoff_t index,
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
}
- bio->bi_io_vec[i].bv_page = super->s_erase_page;
+ bvec_set_page(&bio->bi_io_vec[i], super->s_erase_page);
bio->bi_io_vec[i].bv_len = PAGE_SIZE;
bio->bi_io_vec[i].bv_offset = 0;
}
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220babac..c570a63e0913 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,7 +48,7 @@ static void mpage_end_io(struct bio *bio, int err)
int i;

bio_for_each_segment_all(bv, bio, i) {
- struct page *page = bv->bv_page;
+ struct page *page = bvec_page(bv);
page_endio(page, bio_data_dir(bio), err);
}

diff --git a/fs/splice.c b/fs/splice.c
index 476024bb6546..b627c2c55047 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1000,7 +1000,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
goto done;
}

- array[n].bv_page = buf->page;
+ bvec_set_page(&array[n], buf->page);
array[n].bv_len = this_len;
array[n].bv_offset = buf->offset;
left -= this_len;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1b25e35ea5f..d7167b50299f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -26,6 +26,16 @@ struct bio_vec {
unsigned int bv_offset;
};

+static inline struct page *bvec_page(const struct bio_vec *bvec)
+{
+ return bvec->bv_page;
+}
+
+static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
+{
+ bvec->bv_page = page;
+}
+
#ifdef CONFIG_BLOCK

struct bvec_iter {
diff --git a/kernel/power/block_io.c b/kernel/power/block_io.c
index 9a58bc258810..f2824bacb84d 100644
--- a/kernel/power/block_io.c
+++ b/kernel/power/block_io.c
@@ -90,7 +90,7 @@ int hib_wait_on_bio_chain(struct bio **bio_chain)
struct page *page;

next_bio = bio->bi_private;
- page = bio->bi_io_vec[0].bv_page;
+ page = bvec_page(&bio->bi_io_vec[0]);
wait_on_page_locked(page);
if (!PageUptodate(page) || PageError(page))
ret = -EIO;
diff --git a/mm/page_io.c b/mm/page_io.c
index 6424869e275e..75738896b691 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -33,7 +33,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
if (bio) {
bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
- bio->bi_io_vec[0].bv_page = page;
+ bvec_set_page(&bio->bi_io_vec[0], page);
bio->bi_io_vec[0].bv_len = PAGE_SIZE;
bio->bi_io_vec[0].bv_offset = 0;
bio->bi_vcnt = 1;
@@ -46,7 +46,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
void end_swap_bio_write(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
- struct page *page = bio->bi_io_vec[0].bv_page;
+ struct page *page = bvec_page(&bio->bi_io_vec[0]);

if (!uptodate) {
SetPageError(page);
@@ -72,7 +72,7 @@ void end_swap_bio_write(struct bio *bio, int err)
void end_swap_bio_read(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
- struct page *page = bio->bi_io_vec[0].bv_page;
+ struct page *page = bvec_page(&bio->bi_io_vec[0]);

if (!uptodate) {
SetPageError(page);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 967080a9f043..41e77e7813a2 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -842,7 +842,7 @@ static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
BUG_ON(*length > cursor->resid);
BUG_ON(*page_offset + *length > PAGE_SIZE);

- return bio_vec.bv_page;
+ return bvec_page(&bio_vec);
}

static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,

2015-05-06 20:07:58

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 03/10] block: convert .bv_page to .bv_pfn bio_vec

Carry an __pfn_t in a bio_vec rather than a 'struct page *' in support
of allowing a bio to reference unmapped (not struct page backed)
persistent memory.

This also fixes up the macros and static initializers that we were not
automatically converted by the Coccinelle script that introduced the
bvec_page() and bvec_set_page() helpers.

If CONFIG_PMEM_IO=n this is functionally equivalent to the status quo as
the __pfn_t helpers can assume that a __pfn_t always has a corresponding
struct page.

Cc: Jens Axboe <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Julia Lawall <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
block/blk-integrity.c | 4 ++--
block/blk-merge.c | 6 +++---
block/bounce.c | 2 +-
drivers/md/bcache/btree.c | 2 +-
include/linux/bio.h | 24 +++++++++++++-----------
include/linux/blk_types.h | 13 ++++++++++---
lib/iov_iter.c | 22 +++++++++++-----------
mm/page_io.c | 4 ++--
8 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 0458f31f075a..351198fbda3c 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -43,7 +43,7 @@ static const char *bi_unsupported_name = "unsupported";
*/
int blk_rq_count_integrity_sg(struct request_queue *q, struct bio *bio)
{
- struct bio_vec iv, ivprv = { NULL };
+ struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
unsigned int segments = 0;
unsigned int seg_size = 0;
struct bvec_iter iter;
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(blk_rq_count_integrity_sg);
int blk_rq_map_integrity_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist)
{
- struct bio_vec iv, ivprv = { NULL };
+ struct bio_vec iv, ivprv = BIO_VEC_INIT(ivprv);
struct scatterlist *sg = NULL;
unsigned int segments = 0;
struct bvec_iter iter;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 47ceefacd320..218ad1e57a49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -13,7 +13,7 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
struct bio *bio,
bool no_sg_merge)
{
- struct bio_vec bv, bvprv = { NULL };
+ struct bio_vec bv, bvprv = BIO_VEC_INIT(bvprv);
int cluster, high, highprv = 1;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
@@ -123,7 +123,7 @@ EXPORT_SYMBOL(blk_recount_segments);
static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
struct bio *nxt)
{
- struct bio_vec end_bv = { NULL }, nxt_bv;
+ struct bio_vec end_bv = BIO_VEC_INIT(end_bv), nxt_bv;
struct bvec_iter iter;

if (!blk_queue_cluster(q))
@@ -202,7 +202,7 @@ static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist,
struct scatterlist **sg)
{
- struct bio_vec bvec, bvprv = { NULL };
+ struct bio_vec bvec, bvprv = BIO_VEC_INIT(bvprv);
struct bvec_iter iter;
int nsegs, cluster;

diff --git a/block/bounce.c b/block/bounce.c
index 0390e44d6e1b..4a3098067c81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -64,7 +64,7 @@ static void bounce_copy_vec(struct bio_vec *to, unsigned char *vfrom)
#else /* CONFIG_HIGHMEM */

#define bounce_copy_vec(to, vfrom) \
- memcpy(page_address((to)->bv_page) + (to)->bv_offset, vfrom, (to)->bv_len)
+ memcpy(page_address(bvec_page(to)) + (to)->bv_offset, vfrom, (to)->bv_len)

#endif /* CONFIG_HIGHMEM */

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2e76e8b62902..36bbe29a806b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -426,7 +426,7 @@ static void do_btree_node_write(struct btree *b)
void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));

bio_for_each_segment_all(bv, b->bio, j)
- memcpy(page_address(bv->bv_page),
+ memcpy(page_address(bvec_page(bv)),
base + j * PAGE_SIZE, PAGE_SIZE);

bch_submit_bbio(b->bio, b->c, &k.key, 0);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index da3a127c9958..a59d97cbfe13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,8 +63,8 @@
*/
#define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx])

-#define bvec_iter_page(bvec, iter) \
- (__bvec_iter_bvec((bvec), (iter))->bv_page)
+#define bvec_iter_pfn(bvec, iter) \
+ (__bvec_iter_bvec((bvec), (iter))->bv_pfn)

#define bvec_iter_len(bvec, iter) \
min((iter).bi_size, \
@@ -75,7 +75,7 @@

#define bvec_iter_bvec(bvec, iter) \
((struct bio_vec) { \
- .bv_page = bvec_iter_page((bvec), (iter)), \
+ .bv_pfn = bvec_iter_pfn((bvec), (iter)), \
.bv_len = bvec_iter_len((bvec), (iter)), \
.bv_offset = bvec_iter_offset((bvec), (iter)), \
})
@@ -83,14 +83,16 @@
#define bio_iter_iovec(bio, iter) \
bvec_iter_bvec((bio)->bi_io_vec, (iter))

-#define bio_iter_page(bio, iter) \
- bvec_iter_page((bio)->bi_io_vec, (iter))
+#define bio_iter_pfn(bio, iter) \
+ bvec_iter_pfn((bio)->bi_io_vec, (iter))
#define bio_iter_len(bio, iter) \
bvec_iter_len((bio)->bi_io_vec, (iter))
#define bio_iter_offset(bio, iter) \
bvec_iter_offset((bio)->bi_io_vec, (iter))

-#define bio_page(bio) bio_iter_page((bio), (bio)->bi_iter)
+#define bio_page(bio) \
+ pfn_to_page((bio_iter_pfn((bio), (bio)->bi_iter)).pfn)
+#define bio_pfn(bio) bio_iter_pfn((bio), (bio)->bi_iter)
#define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter)
#define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter)

@@ -150,8 +152,8 @@ static inline void *bio_data(struct bio *bio)
/*
* will die
*/
-#define bio_to_phys(bio) (page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio)))
-#define bvec_to_phys(bv) (page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset)
+#define bio_to_phys(bio) (__pfn_t_to_phys(bio_pfn((bio))) + (unsigned long) bio_offset((bio)))
+#define bvec_to_phys(bv) (__pfn_t_to_phys((bv)->bv_pfn) + (unsigned long) (bv)->bv_offset)

/*
* queues that have highmem support enabled may still need to revert to
@@ -160,7 +162,7 @@ static inline void *bio_data(struct bio *bio)
* I/O completely on that queue (see ide-dma for example)
*/
#define __bio_kmap_atomic(bio, iter) \
- (kmap_atomic(bio_iter_iovec((bio), (iter)).bv_page) + \
+ (kmap_atomic(bio_iter_iovec((bio), bvec_page(iter)) + \
bio_iter_iovec((bio), (iter)).bv_offset)

#define __bio_kunmap_atomic(addr) kunmap_atomic(addr)
@@ -490,7 +492,7 @@ static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
* balancing is a lot nicer this way
*/
local_irq_save(*flags);
- addr = (unsigned long) kmap_atomic(bvec->bv_page);
+ addr = (unsigned long) kmap_atomic(bvec_page(bvec));

BUG_ON(addr & ~PAGE_MASK);

@@ -508,7 +510,7 @@ static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
#else
static inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
{
- return page_address(bvec->bv_page) + bvec->bv_offset;
+ return page_address(bvec_page(bvec)) + bvec->bv_offset;
}

static inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d7167b50299f..2a15c4943db6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -6,6 +6,7 @@
#define __LINUX_BLK_TYPES_H

#include <linux/types.h>
+#include <linux/mm.h>

struct bio_set;
struct bio;
@@ -21,19 +22,25 @@ typedef void (bio_destructor_t) (struct bio *);
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
struct bio_vec {
- struct page *bv_page;
+ __pfn_t bv_pfn;
unsigned int bv_len;
unsigned int bv_offset;
};

+#define BIO_VEC_INIT(name) { .bv_pfn = { .pfn = 0 }, .bv_len = 0, \
+ .bv_offset = 0 }
+
+#define BIO_VEC(name) \
+ struct bio_vec name = BIO_VEC_INIT(name)
+
static inline struct page *bvec_page(const struct bio_vec *bvec)
{
- return bvec->bv_page;
+ return __pfn_t_to_page(bvec->bv_pfn);
}

static inline void bvec_set_page(struct bio_vec *bvec, struct page *page)
{
- bvec->bv_page = page;
+ bvec->bv_pfn = page_to_pfn_t(page);
}

#ifdef CONFIG_BLOCK
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 75232ad0a5e7..4276f6d1826c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -61,7 +61,7 @@
__p = i->bvec; \
__v.bv_len = min_t(size_t, n, __p->bv_len - skip); \
if (likely(__v.bv_len)) { \
- __v.bv_page = __p->bv_page; \
+ __v.bv_pfn = __p->bv_pfn; \
__v.bv_offset = __p->bv_offset + skip; \
(void)(STEP); \
skip += __v.bv_len; \
@@ -72,7 +72,7 @@
__v.bv_len = min_t(size_t, n, __p->bv_len); \
if (unlikely(!__v.bv_len)) \
continue; \
- __v.bv_page = __p->bv_page; \
+ __v.bv_pfn = __p->bv_pfn; \
__v.bv_offset = __p->bv_offset; \
(void)(STEP); \
skip = __v.bv_len; \
@@ -395,7 +395,7 @@ size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
iterate_and_advance(i, bytes, v,
__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
v.iov_len),
- memcpy_to_page(v.bv_page, v.bv_offset,
+ memcpy_to_page(bvec_page(&v), v.bv_offset,
(from += v.bv_len) - v.bv_len, v.bv_len),
memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
)
@@ -416,7 +416,7 @@ size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
iterate_and_advance(i, bytes, v,
__copy_from_user((to += v.iov_len) - v.iov_len, v.iov_base,
v.iov_len),
- memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+ memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
v.bv_offset, v.bv_len),
memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
)
@@ -437,7 +437,7 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
iterate_and_advance(i, bytes, v,
__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
v.iov_base, v.iov_len),
- memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
+ memcpy_from_page((to += v.bv_len) - v.bv_len, bvec_page(&v),
v.bv_offset, v.bv_len),
memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
)
@@ -482,7 +482,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)

iterate_and_advance(i, bytes, v,
__clear_user(v.iov_base, v.iov_len),
- memzero_page(v.bv_page, v.bv_offset, v.bv_len),
+ memzero_page(bvec_page(&v), v.bv_offset, v.bv_len),
memset(v.iov_base, 0, v.iov_len)
)

@@ -497,7 +497,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
iterate_all_kinds(i, bytes, v,
__copy_from_user_inatomic((p += v.iov_len) - v.iov_len,
v.iov_base, v.iov_len),
- memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
+ memcpy_from_page((p += v.bv_len) - v.bv_len, bvec_page(&v),
v.bv_offset, v.bv_len),
memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
)
@@ -596,7 +596,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
0;}),({
/* can't be more than PAGE_SIZE */
*start = v.bv_offset;
- get_page(*pages = v.bv_page);
+ get_page(*pages = bvec_page(&v));
return v.bv_len;
}),({
return -EFAULT;
@@ -650,7 +650,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
*pages = p = get_pages_array(1);
if (!p)
return -ENOMEM;
- get_page(*p = v.bv_page);
+ get_page(*p = bvec_page(&v));
return v.bv_len;
}),({
return -EFAULT;
@@ -684,7 +684,7 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
}
err ? v.iov_len : 0;
}), ({
- char *p = kmap_atomic(v.bv_page);
+ char *p = kmap_atomic(bvec_page(&v));
next = csum_partial_copy_nocheck(p + v.bv_offset,
(to += v.bv_len) - v.bv_len,
v.bv_len, 0);
@@ -728,7 +728,7 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum,
}
err ? v.iov_len : 0;
}), ({
- char *p = kmap_atomic(v.bv_page);
+ char *p = kmap_atomic(bvec_page(&v));
next = csum_partial_copy_nocheck((from += v.bv_len) - v.bv_len,
p + v.bv_offset,
v.bv_len, 0);
diff --git a/mm/page_io.c b/mm/page_io.c
index 75738896b691..33c248527fd2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -265,8 +265,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
struct bio_vec bv = {
- .bv_page = page,
- .bv_len = PAGE_SIZE,
+ .bv_pfn = page_to_pfn_t(page),
+ .bv_len = PAGE_SIZE,
.bv_offset = 0
};
struct iov_iter from;

2015-05-06 20:08:06

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 04/10] dma-mapping: allow archs to optionally specify a ->map_pfn() operation

This is in support of enabling block device drivers to perform DMA
to/from persistent memory which may not have a backing struct page
entry.

Signed-off-by: Dan Williams <[email protected]>
---
arch/Kconfig | 3 +++
include/asm-generic/dma-mapping-common.h | 30 ++++++++++++++++++++++++++++++
include/asm-generic/pfn.h | 9 +++++++++
include/linux/dma-debug.h | 23 +++++++++++++++++++----
include/linux/dma-mapping.h | 8 +++++++-
lib/dma-debug.c | 10 ++++++----
6 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb24997..f7f800860c00 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,9 @@ config HAVE_DMA_ATTRS
config HAVE_DMA_CONTIGUOUS
bool

+config HAVE_DMA_PFN
+ bool
+
config GENERIC_SMP_IDLE_THREAD
bool

diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index 940d5ec122c9..7305efb1bac6 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -17,9 +17,15 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,

kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
+#ifdef CONFIG_HAVE_DMA_PFN
+ addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+ (unsigned long)ptr & ~PAGE_MASK, size,
+ dir, attrs);
+#else
addr = ops->map_page(dev, virt_to_page(ptr),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, attrs);
+#endif
debug_dma_map_page(dev, virt_to_page(ptr),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, addr, true);
@@ -73,6 +79,29 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
}

+#ifdef CONFIG_HAVE_DMA_PFN
+static inline dma_addr_t dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+ struct dma_map_ops *ops = get_dma_ops(dev);
+ dma_addr_t addr;
+
+ BUG_ON(!valid_dma_direction(dir));
+ addr = ops->map_pfn(dev, pfn, offset, size, dir, NULL);
+ debug_dma_map_pfn(dev, pfn, offset, size, dir, addr, false);
+
+ return addr;
+}
+
+static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ enum dma_data_direction dir)
+{
+ kmemcheck_mark_initialized(page_address(page) + offset, size);
+ return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+}
+#else
static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
size_t offset, size_t size,
enum dma_data_direction dir)
@@ -87,6 +116,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,

return addr;
}
+#endif /* CONFIG_HAVE_DMA_PFN */

static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir)
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
index 91171e0285d9..c1fdf41fb726 100644
--- a/include/asm-generic/pfn.h
+++ b/include/asm-generic/pfn.h
@@ -48,4 +48,13 @@ static inline __pfn_t page_to_pfn_t(struct page *page)

return pfn;
}
+
+static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
+{
+#if IS_ENABLED(CONFIG_PMEM_IO)
+ if (pfn.pfn < PAGE_OFFSET)
+ return pfn.pfn;
+#endif
+ return page_to_pfn(__pfn_t_to_page(pfn));
+}
#endif /* __ASM_PFN_H */
diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
index fe8cb610deac..a3b4c8c0cd68 100644
--- a/include/linux/dma-debug.h
+++ b/include/linux/dma-debug.h
@@ -34,10 +34,18 @@ extern void dma_debug_init(u32 num_entries);

extern int dma_debug_resize_entries(u32 num_entries);

-extern void debug_dma_map_page(struct device *dev, struct page *page,
- size_t offset, size_t size,
- int direction, dma_addr_t dma_addr,
- bool map_single);
+extern void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
+ size_t size, int direction, dma_addr_t dma_addr,
+ bool map_single);
+
+static inline void debug_dma_map_page(struct device *dev, struct page *page,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single)
+{
+ return debug_dma_map_pfn(dev, page_to_pfn_t(page), offset, size,
+ direction, dma_addr, map_single);
+}

extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);

@@ -109,6 +117,13 @@ static inline void debug_dma_map_page(struct device *dev, struct page *page,
{
}

+static inline void debug_dma_map_pfn(struct device *dev, __pfn_t pfn,
+ size_t offset, size_t size,
+ int direction, dma_addr_t dma_addr,
+ bool map_single)
+{
+}
+
static inline void debug_dma_mapping_error(struct device *dev,
dma_addr_t dma_addr)
{
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff090919..d6437b493300 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -26,11 +26,17 @@ struct dma_map_ops {

int (*get_sgtable)(struct device *dev, struct sg_table *sgt, void *,
dma_addr_t, size_t, struct dma_attrs *attrs);
-
+#ifdef CONFIG_HAVE_DMA_PFN
+ dma_addr_t (*map_pfn)(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+#else
dma_addr_t (*map_page)(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
struct dma_attrs *attrs);
+#endif
void (*unmap_page)(struct device *dev, dma_addr_t dma_handle,
size_t size, enum dma_data_direction dir,
struct dma_attrs *attrs);
diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index ae4b65e17e64..c24de1cd8f81 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -1250,11 +1250,12 @@ out:
put_hash_bucket(bucket, &flags);
}

-void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
+void debug_dma_map_pfn(struct device *dev, __pfn_t pfn, size_t offset,
size_t size, int direction, dma_addr_t dma_addr,
bool map_single)
{
struct dma_debug_entry *entry;
+ struct page *page;

if (unlikely(dma_debug_disabled()))
return;
@@ -1268,7 +1269,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

entry->dev = dev;
entry->type = dma_debug_page;
- entry->pfn = page_to_pfn(page);
+ entry->pfn = __pfn_t_to_pfn(pfn);
entry->offset = offset,
entry->dev_addr = dma_addr;
entry->size = size;
@@ -1278,7 +1279,8 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,
if (map_single)
entry->type = dma_debug_single;

- if (!PageHighMem(page)) {
+ page = __pfn_t_to_page(pfn);
+ if (page && !PageHighMem(page)) {
void *addr = page_address(page) + offset;

check_for_stack(dev, addr);
@@ -1287,7 +1289,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset,

add_dma_entry(entry);
}
-EXPORT_SYMBOL(debug_dma_map_page);
+EXPORT_SYMBOL(debug_dma_map_pfn);

void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
{

2015-05-06 20:08:10

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 05/10] scatterlist: use sg_phys()

Coccinelle cleanup to replace open coded sg to physical address
translations. This is in preparation for introducing scatterlists that
reference pfn(s) without a backing struct page.

// sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
// usage: make coccicheck COCCI=sg_phys.cocci MODE=patch

virtual patch
virtual report
virtual org

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg)) + sg->offset
+ sg_phys(sg)

@@
struct scatterlist *sg;
@@

- page_to_phys(sg_page(sg))
+ sg_phys(sg) - sg->offset

Cc: Julia Lawall <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/mm/dma-mapping.c | 2 +-
arch/microblaze/kernel/dma.c | 2 +-
drivers/iommu/intel-iommu.c | 4 ++--
drivers/iommu/iommu.c | 2 +-
drivers/staging/android/ion/ion_chunk_heap.c | 4 ++--
5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 09c5fe3d30c2..43cc6a8fdacc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1502,7 +1502,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
return -ENOMEM;

for (count = 0, s = sg; count < (size >> PAGE_SHIFT); s = sg_next(s)) {
- phys_addr_t phys = page_to_phys(sg_page(s));
+ phys_addr_t phys = sg_phys(s) - s->offset;
unsigned int len = PAGE_ALIGN(s->offset + s->length);

if (!is_coherent &&
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ed7ba8a11822..dcb3c594d626 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,7 +61,7 @@ static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg);
- __dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
+ __dma_sync(sg_phys(sg),
sg->length, direction);
}

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 68d43beccb7e..9b9ada71e0d3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1998,7 +1998,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
sg_res = aligned_nrpages(sg->offset, sg->length);
sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (sg_phys(sg) - sg->offset) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}

@@ -3302,7 +3302,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,

for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
- sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
+ sg->dma_address = sg_phys(sg);
sg->dma_length = sg->length;
}
return nelems;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d4f527e56679..59808fc9110d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1147,7 +1147,7 @@ size_t default_iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);

for_each_sg(sg, s, nents, i) {
- phys_addr_t phys = page_to_phys(sg_page(s)) + s->offset;
+ phys_addr_t phys = sg_phys(s);

/*
* We are mapping on IOMMU page boundaries, so offset within
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c b/drivers/staging/android/ion/ion_chunk_heap.c
index 3e6ec2ee6802..b7da5d142aa9 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -81,7 +81,7 @@ static int ion_chunk_heap_allocate(struct ion_heap *heap,
err:
sg = table->sgl;
for (i -= 1; i >= 0; i--) {
- gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+ gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
sg->length);
sg = sg_next(sg);
}
@@ -109,7 +109,7 @@ static void ion_chunk_heap_free(struct ion_buffer *buffer)
DMA_BIDIRECTIONAL);

for_each_sg(table->sgl, sg, table->nents, i) {
- gen_pool_free(chunk_heap->pool, page_to_phys(sg_page(sg)),
+ gen_pool_free(chunk_heap->pool, sg_phys(sg) - sg->offset,
sg->length);
}
chunk_heap->allocated -= allocated_size;

2015-05-06 20:08:18

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 06/10] scatterlist: support "page-less" (__pfn_t only) entries

From: Matthew Wilcox <[email protected]>

Given that an offset will never be more than PAGE_SIZE, steal the unused
bits of the offset to implement a flags field. Move the existing "this
is a sg_chain() entry" flag to the new flags field, and add a new flag
(SG_FLAGS_PAGE) to indicate that there is a struct page backing for the
entry.

Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
block/blk-merge.c | 2 -
drivers/dma/ste_dma40.c | 5 --
drivers/mmc/card/queue.c | 4 +-
include/asm-generic/scatterlist.h | 9 ++++
include/crypto/scatterwalk.h | 10 ++++
include/linux/scatterlist.h | 91 +++++++++++++++++++++++++++++++++----
6 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 218ad1e57a49..82a688551b72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -267,7 +267,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
if (rq->cmd_flags & REQ_WRITE)
memset(q->dma_drain_buffer, 0, q->dma_drain_size);

- sg->page_link &= ~0x02;
+ sg_unmark_end(sg);
sg = sg_next(sg);
sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
q->dma_drain_size,
diff --git a/drivers/dma/ste_dma40.c b/drivers/dma/ste_dma40.c
index 3c10f034d4b9..e8c00642cacb 100644
--- a/drivers/dma/ste_dma40.c
+++ b/drivers/dma/ste_dma40.c
@@ -2562,10 +2562,7 @@ dma40_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t dma_addr,
dma_addr += period_len;
}

- sg[periods].offset = 0;
- sg_dma_len(&sg[periods]) = 0;
- sg[periods].page_link =
- ((unsigned long)sg | 0x01) & ~0x02;
+ sg_chain(sg, periods + 1, sg);

txd = d40_prep_sg(chan, sg, sg, periods, direction,
DMA_PREP_INTERRUPT);
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index 236d194c2883..127f76294e71 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -469,7 +469,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
sg_set_buf(__sg, buf + offset, len);
offset += len;
remain -= len;
- (__sg++)->page_link &= ~0x02;
+ sg_unmark_end(__sg++);
sg_len++;
} while (remain);
}
@@ -477,7 +477,7 @@ static unsigned int mmc_queue_packed_map_sg(struct mmc_queue *mq,
list_for_each_entry(req, &packed->list, queuelist) {
sg_len += blk_rq_map_sg(mq->queue, req, __sg);
__sg = sg + (sg_len - 1);
- (__sg++)->page_link &= ~0x02;
+ sg_unmark_end(__sg++);
}
sg_mark_end(sg + (sg_len - 1));
return sg_len;
diff --git a/include/asm-generic/scatterlist.h b/include/asm-generic/scatterlist.h
index 5de07355fad4..959f51572a8e 100644
--- a/include/asm-generic/scatterlist.h
+++ b/include/asm-generic/scatterlist.h
@@ -7,8 +7,17 @@ struct scatterlist {
#ifdef CONFIG_DEBUG_SG
unsigned long sg_magic;
#endif
+#ifdef CONFIG_HAVE_DMA_PFN
+ union {
+ __pfn_t pfn;
+ struct scatterlist *next;
+ };
+ unsigned short offset;
+ unsigned short sg_flags;
+#else
unsigned long page_link;
unsigned int offset;
+#endif
unsigned int length;
dma_addr_t dma_address;
#ifdef CONFIG_NEED_SG_DMA_LENGTH
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 20e4226a2e14..7296d89a50b2 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -25,6 +25,15 @@
#include <linux/scatterlist.h>
#include <linux/sched.h>

+#ifdef CONFIG_HAVE_DMA_PFN
+/*
+ * If we're using PFNs, the architecture must also have been converted to
+ * support SG_CHAIN. So we can use the generic code instead of custom
+ * code.
+ */
+#define scatterwalk_sg_chain(prv, num, sgl) sg_chain(prv, num, sgl)
+#define scatterwalk_sg_next(sgl) sg_next(sgl)
+#else
static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
struct scatterlist *sg2)
{
@@ -32,6 +41,7 @@ static inline void scatterwalk_sg_chain(struct scatterlist *sg1, int num,
sg1[num - 1].page_link &= ~0x02;
sg1[num - 1].page_link |= 0x01;
}
+#endif

static inline void scatterwalk_crypto_chain(struct scatterlist *head,
struct scatterlist *sg,
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ed8f9e70df9b..9d423e559bdb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
#include <linux/bug.h>
#include <linux/mm.h>

+#include <asm/page.h>
#include <asm/types.h>
#include <asm/scatterlist.h>
#include <asm/io.h>
@@ -18,8 +19,14 @@ struct sg_table {
/*
* Notes on SG table design.
*
- * Architectures must provide an unsigned long page_link field in the
- * scatterlist struct. We use that to place the page pointer AND encode
+ * Architectures may define CONFIG_HAVE_DMA_PFN to indicate that they wish
+ * to support SGLs that point to pages which do not have a struct page to
+ * describe them. If so, they should provide an sg_flags field in their
+ * scatterlist struct (see asm-generic for an example) as well as a pfn
+ * field.
+ *
+ * Otherwise, architectures must provide an unsigned long page_link field in
+ * the scatterlist struct. We use that to place the page pointer AND encode
* information about the sg table as well. The two lower bits are reserved
* for this information.
*
@@ -33,16 +40,25 @@ struct sg_table {
*/

#define SG_MAGIC 0x87654321
-
+#define SG_FLAGS_CHAIN 0x0001
+#define SG_FLAGS_LAST 0x0002
+#define SG_FLAGS_PAGE 0x0004
+
+#ifdef CONFIG_HAVE_DMA_PFN
+#define sg_is_chain(sg) ((sg)->sg_flags & SG_FLAGS_CHAIN)
+#define sg_is_last(sg) ((sg)->sg_flags & SG_FLAGS_LAST)
+#define sg_chain_ptr(sg) ((sg)->next)
+#else /* !CONFIG_HAVE_DMA_PFN */
/*
* We overload the LSB of the page pointer to indicate whether it's
* a valid sg entry, or whether it points to the start of a new scatterlist.
* Those low bits are there for everyone! (thanks mason :-)
*/
-#define sg_is_chain(sg) ((sg)->page_link & 0x01)
-#define sg_is_last(sg) ((sg)->page_link & 0x02)
+#define sg_is_chain(sg) ((sg)->page_link & SG_FLAGS_CHAIN)
+#define sg_is_last(sg) ((sg)->page_link & SG_FLAGS_LAST)
#define sg_chain_ptr(sg) \
((struct scatterlist *) ((sg)->page_link & ~0x03))
+#endif /* !CONFIG_HAVE_DMA_PFN */

/**
* sg_assign_page - Assign a given page to an SG entry
@@ -56,6 +72,14 @@ struct sg_table {
**/
static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
{
+#ifdef CONFIG_HAVE_DMA_PFN
+#ifdef CONFIG_DEBUG_SG
+ BUG_ON(sg->sg_magic != SG_MAGIC);
+ BUG_ON(sg_is_chain(sg));
+#endif
+ sg->pfn = page_to_pfn_t(page);
+ sg->sg_flags |= SG_FLAGS_PAGE;
+#else /* !CONFIG_HAVE_DMA_PFN */
unsigned long page_link = sg->page_link & 0x3;

/*
@@ -68,6 +92,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
BUG_ON(sg_is_chain(sg));
#endif
sg->page_link = page_link | (unsigned long) page;
+#endif /* !CONFIG_HAVE_DMA_PFN */
}

/**
@@ -88,17 +113,39 @@ static inline void sg_set_page(struct scatterlist *sg, struct page *page,
unsigned int len, unsigned int offset)
{
sg_assign_page(sg, page);
+ BUG_ON(offset > 65535);
sg->offset = offset;
sg->length = len;
}

+#ifdef CONFIG_HAVE_DMA_PFN
+static inline void sg_set_pfn(struct scatterlist *sg, __pfn_t pfn,
+ unsigned int len, unsigned int offset)
+{
+#ifdef CONFIG_DEBUG_SG
+ BUG_ON(sg->sg_magic != SG_MAGIC);
+ BUG_ON(sg_is_chain(sg));
+#endif
+ sg->pfn = pfn;
+ BUG_ON(offset > 65535);
+ sg->offset = offset;
+ sg->sg_flags = 0;
+ sg->length = len;
+}
+#endif
+
static inline struct page *sg_page(struct scatterlist *sg)
{
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
BUG_ON(sg_is_chain(sg));
#endif
+#ifdef CONFIG_HAVE_DMA_PFN
+ BUG_ON(!(sg->sg_flags & SG_FLAGS_PAGE));
+ return __pfn_t_to_page(sg->pfn);
+#else
return (struct page *)((sg)->page_link & ~0x3);
+#endif
}

/**
@@ -150,7 +197,12 @@ static inline void sg_chain(struct scatterlist *prv, unsigned int prv_nents,
* Set lowest bit to indicate a link pointer, and make sure to clear
* the termination bit if it happens to be set.
*/
+#ifdef CONFIG_HAVE_DMA_PFN
+ prv[prv_nents - 1].next = sgl;
+ prv[prv_nents - 1].sg_flags = SG_FLAGS_CHAIN;
+#else
prv[prv_nents - 1].page_link = ((unsigned long) sgl | 0x01) & ~0x02;
+#endif
}

/**
@@ -170,8 +222,13 @@ static inline void sg_mark_end(struct scatterlist *sg)
/*
* Set termination bit, clear potential chain bit
*/
- sg->page_link |= 0x02;
- sg->page_link &= ~0x01;
+#ifdef CONFIG_HAVE_DMA_PFN
+ sg->sg_flags |= SG_FLAGS_LAST;
+ sg->sg_flags &= ~SG_FLAGS_CHAIN;
+#else
+ sg->page_link |= SG_FLAGS_LAST;
+ sg->page_link &= ~SG_FLAGS_CHAIN;
+#endif
}

/**
@@ -187,7 +244,11 @@ static inline void sg_unmark_end(struct scatterlist *sg)
#ifdef CONFIG_DEBUG_SG
BUG_ON(sg->sg_magic != SG_MAGIC);
#endif
- sg->page_link &= ~0x02;
+#ifdef CONFIG_HAVE_DMA_PFN
+ sg->sg_flags &= ~SG_FLAGS_LAST;
+#else
+ sg->page_link &= ~SG_FLAGS_LAST;
+#endif
}

/**
@@ -202,7 +263,11 @@ static inline void sg_unmark_end(struct scatterlist *sg)
**/
static inline dma_addr_t sg_phys(struct scatterlist *sg)
{
+#ifdef CONFIG_HAVE_DMA_PFN
+ return __pfn_t_to_phys(sg->pfn) + sg->offset;
+#else
return page_to_phys(sg_page(sg)) + sg->offset;
+#endif
}

/**
@@ -217,7 +282,15 @@ static inline dma_addr_t sg_phys(struct scatterlist *sg)
**/
static inline void *sg_virt(struct scatterlist *sg)
{
- return page_address(sg_page(sg)) + sg->offset;
+ struct page *page;
+
+#ifdef CONFIG_HAVE_DMA_PFN
+ page = __pfn_t_to_page(sg->pfn) + sg->offset;
+ BUG_ON(!page); /* don't use sg_virt() on unmapped memory */
+#else
+ page = sg_page(sg);
+#endif
+ return page_address(page) + sg->offset;
}

int sg_nents(struct scatterlist *sg);

2015-05-06 20:09:29

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 07/10] x86: support dma_map_pfn()

Fix up x86 dma_map_ops to allow pfn-only mappings.

As long as a dma_map_sg() implementation uses the generic sg_phys()
helpers it can support scatterlists that use __pfn_t instead of struct
page.

Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 5 +++++
arch/x86/kernel/amd_gart_64.c | 22 +++++++++++++++++-----
arch/x86/kernel/pci-nommu.c | 22 +++++++++++++++++-----
arch/x86/kernel/pci-swiotlb.c | 4 ++++
arch/x86/pci/sta2x11-fixup.c | 4 ++++
arch/x86/xen/pci-swiotlb-xen.c | 4 ++++
drivers/iommu/amd_iommu.c | 21 ++++++++++++++++-----
drivers/iommu/intel-iommu.c | 22 +++++++++++++++++-----
drivers/xen/swiotlb-xen.c | 29 +++++++++++++++++++----------
include/asm-generic/dma-mapping-common.h | 4 ++--
include/asm-generic/scatterlist.h | 1 +
include/linux/swiotlb.h | 4 ++++
lib/swiotlb.c | 20 +++++++++++++++-----
13 files changed, 125 insertions(+), 37 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..1fae5e842423 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -796,6 +796,7 @@ config CALGARY_IOMMU
bool "IBM Calgary IOMMU support"
select SWIOTLB
depends on X86_64 && PCI
+ depends on !HAVE_DMA_PFN
---help---
Support for hardware IOMMUs in IBM's xSeries x366 and x460
systems. Needed to run systems with more than 3GB of memory
@@ -1432,6 +1433,10 @@ config X86_PMEM_LEGACY

Say Y if unsure.

+config X86_PMEM_DMA
+ def_bool PMEM_IO
+ select HAVE_DMA_PFN
+
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
depends on HIGHMEM
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 8e3842fc8bea..8fad83c8dfd2 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -239,13 +239,13 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem,
}

/* Map a single area into the IOMMU */
-static dma_addr_t gart_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t gart_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
unsigned long bus;
- phys_addr_t paddr = page_to_phys(page) + offset;
+ phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;

if (!dev)
dev = &x86_dma_fallback_dev;
@@ -259,6 +259,14 @@ static dma_addr_t gart_map_page(struct device *dev, struct page *page,
return bus;
}

+static __maybe_unused dma_addr_t gart_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return gart_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
/*
* Free a DMA mapping.
*/
@@ -699,7 +707,11 @@ static __init int init_amd_gatt(struct agp_kern_info *info)
static struct dma_map_ops gart_dma_ops = {
.map_sg = gart_map_sg,
.unmap_sg = gart_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = gart_map_pfn,
+#else
.map_page = gart_map_page,
+#endif
.unmap_page = gart_unmap_page,
.alloc = gart_alloc_coherent,
.free = gart_free_coherent,
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index da15918d1c81..876dacfbabf6 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -25,12 +25,12 @@ check_addr(char *name, struct device *hwdev, dma_addr_t bus, size_t size)
return 1;
}

-static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t nommu_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- dma_addr_t bus = page_to_phys(page) + offset;
+ dma_addr_t bus = __pfn_t_to_phys(pfn) + offset;
WARN_ON(size == 0);
if (!check_addr("map_single", dev, bus, size))
return DMA_ERROR_CODE;
@@ -38,6 +38,14 @@ static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
return bus;
}

+static __maybe_unused dma_addr_t nommu_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return nommu_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
/* Map a set of buffers described by scatterlist in streaming
* mode for DMA. This is the scatter-gather version of the
* above pci_map_single interface. Here the scatter gather list
@@ -92,7 +100,11 @@ struct dma_map_ops nommu_dma_ops = {
.alloc = dma_generic_alloc_coherent,
.free = dma_generic_free_coherent,
.map_sg = nommu_map_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = nommu_map_pfn,
+#else
.map_page = nommu_map_page,
+#endif
.sync_single_for_device = nommu_sync_single_for_device,
.sync_sg_for_device = nommu_sync_sg_for_device,
.is_phys = 1,
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index 77dd0ad58be4..5351eb8c8f7f 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -48,7 +48,11 @@ static struct dma_map_ops swiotlb_dma_ops = {
.sync_sg_for_device = swiotlb_sync_sg_for_device,
.map_sg = swiotlb_map_sg_attrs,
.unmap_sg = swiotlb_unmap_sg_attrs,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = swiotlb_map_pfn,
+#else
.map_page = swiotlb_map_page,
+#endif
.unmap_page = swiotlb_unmap_page,
.dma_supported = NULL,
};
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 5ceda85b8687..d1c6e3808bb5 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -182,7 +182,11 @@ static void *sta2x11_swiotlb_alloc_coherent(struct device *dev,
static struct dma_map_ops sta2x11_dma_ops = {
.alloc = sta2x11_swiotlb_alloc_coherent,
.free = x86_swiotlb_free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = swiotlb_map_pfn,
+#else
.map_page = swiotlb_map_page,
+#endif
.unmap_page = swiotlb_unmap_page,
.map_sg = swiotlb_map_sg_attrs,
.unmap_sg = swiotlb_unmap_sg_attrs,
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 0e98e5d241d0..e65ea48d7aed 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -28,7 +28,11 @@ static struct dma_map_ops xen_swiotlb_dma_ops = {
.sync_sg_for_device = xen_swiotlb_sync_sg_for_device,
.map_sg = xen_swiotlb_map_sg_attrs,
.unmap_sg = xen_swiotlb_unmap_sg_attrs,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = xen_swiotlb_map_pfn,
+#else
.map_page = xen_swiotlb_map_page,
+#endif
.unmap_page = xen_swiotlb_unmap_page,
.dma_supported = xen_swiotlb_dma_supported,
};
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index e43d48956dea..ee8f70224b73 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2754,16 +2754,15 @@ static void __unmap_single(struct dma_ops_domain *dma_dom,
/*
* The exported map_single function for dma_ops.
*/
-static dma_addr_t map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t map_pfn(struct device *dev, __pfn_t pfn, unsigned long offset,
+ size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
unsigned long flags;
struct protection_domain *domain;
dma_addr_t addr;
u64 dma_mask;
- phys_addr_t paddr = page_to_phys(page) + offset;
+ phys_addr_t paddr = __pfn_t_to_phys(pfn) + offset;

INC_STATS_COUNTER(cnt_map_single);

@@ -2788,6 +2787,14 @@ out:
spin_unlock_irqrestore(&domain->lock, flags);

return addr;
+
+}
+
+static __maybe_unused dma_addr_t map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ return map_pfn(dev, page_to_pfn_t(page), offset, size, dir, attrs);
}

/*
@@ -3062,7 +3069,11 @@ static void __init prealloc_protection_domains(void)
static struct dma_map_ops amd_iommu_dma_ops = {
.alloc = alloc_coherent,
.free = free_coherent,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = map_pfn,
+#else
.map_page = map_page,
+#endif
.unmap_page = unmap_page,
.map_sg = map_sg,
.unmap_sg = unmap_sg,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 9b9ada71e0d3..6d9a0f85b827 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3086,15 +3086,23 @@ error:
return 0;
}

-static dma_addr_t intel_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+static dma_addr_t intel_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- return __intel_map_single(dev, page_to_phys(page) + offset, size,
+ return __intel_map_single(dev, __pfn_t_to_phys(pfn) + offset, size,
dir, *dev->dma_mask);
}

+static __maybe_unused dma_addr_t intel_map_page(struct device *dev,
+ struct page *page, unsigned long offset, size_t size,
+ enum dma_data_direction dir, struct dma_attrs *attrs)
+{
+ return intel_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
+
static void flush_unmaps(void)
{
int i, j;
@@ -3380,7 +3388,11 @@ struct dma_map_ops intel_dma_ops = {
.free = intel_free_coherent,
.map_sg = intel_map_sg,
.unmap_sg = intel_unmap_sg,
+#ifdef CONFIG_HAVE_DMA_PFN
+ .map_pfn = intel_map_pfn,
+#else
.map_page = intel_map_page,
+#endif
.unmap_page = intel_unmap_page,
.mapping_error = intel_mapping_error,
};
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 810ad419e34c..bd29d09bbacc 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -382,10 +382,10 @@ EXPORT_SYMBOL_GPL(xen_swiotlb_free_coherent);
* Once the device is given the dma address, the device owns this memory until
* either xen_swiotlb_unmap_page or xen_swiotlb_dma_sync_single is performed.
*/
-dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+dma_addr_t xen_swiotlb_map_pfn(struct device *dev, unsigned long pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
phys_addr_t map, phys = page_to_phys(page) + offset;
dma_addr_t dev_addr = xen_phys_to_bus(phys);
@@ -429,6 +429,16 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
}
return dev_addr;
}
+EXPORT_SYMBOL_GPL(xen_swiotlb_map_pfn);
+
+dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ return xen_swiotlb_map_pfn(dev, page_to_pfn(page), offset, size, dir,
+ attrs);
+}
EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);

/*
@@ -582,15 +592,14 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
attrs);
sg->dma_address = xen_phys_to_bus(map);
} else {
+ __pfn_t pfn = { .pfn = paddr >> PAGE_SHIFT };
+ unsigned long offset = paddr & ~PAGE_MASK;
+
/* we are not interested in the dma_addr returned by
* xen_dma_map_page, only in the potential cache flushes executed
* by the function. */
- xen_dma_map_page(hwdev, pfn_to_page(paddr >> PAGE_SHIFT),
- dev_addr,
- paddr & ~PAGE_MASK,
- sg->length,
- dir,
- attrs);
+ xen_dma_map_pfn(hwdev, pfn, dev_addr, offset,
+ sg->length, attrs);
sg->dma_address = dev_addr;
}
sg_dma_len(sg) = sg->length;
diff --git a/include/asm-generic/dma-mapping-common.h b/include/asm-generic/dma-mapping-common.h
index 7305efb1bac6..e031b079ce4e 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -18,7 +18,7 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
kmemcheck_mark_initialized(ptr, size);
BUG_ON(!valid_dma_direction(dir));
#ifdef CONFIG_HAVE_DMA_PFN
- addr = ops->map_pfn(dev, page_to_pfn_typed(virt_to_page(ptr)),
+ addr = ops->map_pfn(dev, page_to_pfn_t(virt_to_page(ptr)),
(unsigned long)ptr & ~PAGE_MASK, size,
dir, attrs);
#else
@@ -99,7 +99,7 @@ static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
enum dma_data_direction dir)
{
kmemcheck_mark_initialized(page_address(page) + offset, size);
- return dma_map_pfn(dev, page_to_pfn_typed(page), offset, size, dir);
+ return dma_map_pfn(dev, page_to_pfn_t(page), offset, size, dir);
}
#else
static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
diff --git a/include/asm-generic/scatterlist.h b/include/asm-generic/scatterlist.h
index 959f51572a8e..ddc743513e55 100644
--- a/include/asm-generic/scatterlist.h
+++ b/include/asm-generic/scatterlist.h
@@ -2,6 +2,7 @@
#define __ASM_GENERIC_SCATTERLIST_H

#include <linux/types.h>
+#include <linux/mm.h>

struct scatterlist {
#ifdef CONFIG_DEBUG_SG
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e7a018eaf3a2..5093fc8d2825 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -66,6 +66,10 @@ extern dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction dir,
struct dma_attrs *attrs);
+extern dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs);
extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
size_t size, enum dma_data_direction dir,
struct dma_attrs *attrs);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 4abda074ea45..4183d691bf2e 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -727,12 +727,12 @@ swiotlb_full(struct device *dev, size_t size, enum dma_data_direction dir,
* Once the device is given the dma address, the device owns this memory until
* either swiotlb_unmap_page or swiotlb_dma_sync_single is performed.
*/
-dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
+dma_addr_t swiotlb_map_pfn(struct device *dev, __pfn_t pfn,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
{
- phys_addr_t map, phys = page_to_phys(page) + offset;
+ phys_addr_t map, phys = __pfn_t_to_phys(pfn) + offset;
dma_addr_t dev_addr = phys_to_dma(dev, phys);

BUG_ON(dir == DMA_NONE);
@@ -763,6 +763,16 @@ dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,

return dev_addr;
}
+EXPORT_SYMBOL_GPL(swiotlb_map_pfn);
+
+dma_addr_t swiotlb_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ return swiotlb_map_pfn(dev, page_to_pfn_t(page), offset, size, dir,
+ attrs);
+}
EXPORT_SYMBOL_GPL(swiotlb_map_page);

/*

2015-05-06 20:08:26

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory

It would be unfortunate if the kmap infrastructure escaped its current
32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user
has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t()
implementation to scan a list of pre-mapped persistent memory address
ranges inserted by the pmem driver.

The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
but there are two mitigating factors:

1/ The number of persistent memory ranges is bounded by the number of
DIMMs which is on the order of 10s of DIMMs, not hundreds.

2/ The lookup yields the entire range, if it becomes inefficient to do a
kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
advantage of the fact that the lookup can be amortized for all kmap
operations it needs to perform in a given range.

Signed-off-by: Dan Williams <[email protected]>
---
arch/Kconfig | 3 +
arch/x86/Kconfig | 2 +
arch/x86/kernel/Makefile | 1
arch/x86/kernel/kmap.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/block/pmem.c | 6 +++
include/linux/highmem.h | 23 +++++++++++
6 files changed, 130 insertions(+)
create mode 100644 arch/x86/kernel/kmap.c

diff --git a/arch/Kconfig b/arch/Kconfig
index f7f800860c00..69d3a3fa21af 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS
config HAVE_DMA_PFN
bool

+config HAVE_KMAP_PFN
+ bool
+
config GENERIC_SMP_IDLE_THREAD
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1fae5e842423..eddaea839500 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY
Say Y if unsure.

config X86_PMEM_DMA
+ depends on !HIGHMEM
def_bool PMEM_IO
+ select HAVE_KMAP_PFN
select HAVE_DMA_PFN

config HIGHPTE
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9bcd0b56ca17..44c323342996 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_X86_PMEM_LEGACY) += pmem.o
+obj-$(CONFIG_X86_PMEM_DMA) += kmap.o

obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o

diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c
new file mode 100644
index 000000000000..d597c475377b
--- /dev/null
+++ b/arch/x86/kernel/kmap.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/rcupdate.h>
+#include <linux/rculist.h>
+#include <linux/highmem.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+static LIST_HEAD(ranges);
+
+struct kmap {
+ struct list_head list;
+ struct resource *res;
+ struct device *dev;
+ void *base;
+};
+
+static void teardown_kmap(void *data)
+{
+ struct kmap *kmap = data;
+
+ dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
+ list_del_rcu(&kmap->list);
+ synchronize_rcu();
+ kfree(kmap);
+}
+
+int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
+ void *base)
+{
+ struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
+ int rc;
+
+ if (!kmap)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&kmap->list);
+ kmap->res = res;
+ kmap->base = base;
+ kmap->dev = dev;
+ rc = devm_add_action(dev, teardown_kmap, kmap);
+ if (rc) {
+ kfree(kmap);
+ return rc;
+ }
+ dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
+ list_add_rcu(&kmap->list, &ranges);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
+
+void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+ struct page *page = __pfn_t_to_page(pfn);
+ resource_size_t addr;
+ struct kmap *kmap;
+
+ if (page)
+ return kmap_atomic(page);
+ addr = __pfn_t_to_phys(pfn);
+ rcu_read_lock();
+ list_for_each_entry_rcu(kmap, &ranges, list)
+ if (addr >= kmap->res->start && addr <= kmap->res->end)
+ return kmap->base + addr - kmap->res->start;
+
+ /* only unlock in the error case */
+ rcu_read_unlock();
+ return NULL;
+}
+EXPORT_SYMBOL(kmap_atomic_pfn_t);
+
+void kunmap_atomic_pfn_t(void *addr)
+{
+ rcu_read_unlock();
+
+ /*
+ * If the original __pfn_t had an entry in the memmap then
+ * 'addr' will be outside of vmalloc space i.e. it came from
+ * page_address()
+ */
+ if (!is_vmalloc_addr(addr))
+ kunmap_atomic(addr);
+}
+EXPORT_SYMBOL(kunmap_atomic_pfn_t);
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 41bb424533e6..2a847651f8de 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -23,6 +23,7 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/slab.h>
+#include <linux/highmem.h>

#define PMEM_MINORS 16

@@ -147,6 +148,11 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
if (!pmem->virt_addr)
goto out_release_region;

+ err = devm_register_kmap_pfn_range(dev, res, pmem->virt_addr);
+ if (err)
+ goto out_unmap;
+
+ err = -ENOMEM;
pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
if (!pmem->pmem_queue)
goto out_unmap;
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 9286a46b7d69..85fd52d43a9a 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -83,6 +83,29 @@ static inline void __kunmap_atomic(void *addr)

#endif /* CONFIG_HIGHMEM */

+#ifdef CONFIG_HAVE_KMAP_PFN
+extern void *kmap_atomic_pfn_t(__pfn_t pfn);
+extern void kunmap_atomic_pfn_t(void *addr);
+extern int devm_register_kmap_pfn_range(struct device *dev,
+ struct resource *res, void *base);
+#else
+static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
+{
+ return kmap_atomic(__pfn_t_to_page(pfn));
+}
+
+static inline void kunmap_atomic_pfn_t(void *addr)
+{
+ __kunmap_atomic(addr);
+}
+
+static inline int devm_register_kmap_pfn_range(struct device *dev,
+ struct resource *res, void *base)
+{
+ return 0;
+}
+#endif /* CONFIG_HAVE_KMAP_PFN */
+
#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)

DECLARE_PER_CPU(int, __kmap_atomic_idx);

2015-05-06 20:08:29

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 09/10] dax: convert to __pfn_t

The primary source for non-page-backed page-frames to enter the system
is via the pmem driver's ->direct_access() method. The pfns returned by
the top-level bdev_direct_access() may be passed to any other subsystem
in the kernel and those sub-systems either need to assume that the pfn
is page backed (CONFIG_PMEM_IO=n) or be prepared to handle non-page
backed case (CONFIG_PMEM_IO=y). Currently the pfns returned by
->direct_access() are only ever used by vm_insert_mixed() which does not
care if the pfn is mapped. As we go to add more usages of these pfns
add the type-safety of __pfn_t.

Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 4 ++--
drivers/block/brd.c | 4 ++--
drivers/block/pmem.c | 8 +++++---
drivers/s390/block/dcssblk.c | 6 +++---
fs/block_dev.c | 2 +-
fs/dax.c | 9 +++++----
include/asm-generic/pfn.h | 7 +++++++
include/linux/blkdev.h | 4 ++--
8 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 9bb5da7f2c0c..069cb5285f18 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,13 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
*/
static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, __pfn_t *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(virt_to_phys(*kaddr));

return bank->size - offset;
}
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 115c6cf9cb43..57f4cd787ea2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,

#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, __pfn_t *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -382,7 +382,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
- *pfn = page_to_pfn(page);
+ *pfn = page_to_pfn_t(page);

/*
* TODO: If size > PAGE_SIZE, we could look to see if the next page in
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 2a847651f8de..18edb48e405e 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -98,8 +98,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return 0;
}

-static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn, long size)
+static long __maybe_unused pmem_direct_access(struct block_device *bdev,
+ sector_t sector, void **kaddr, __pfn_t *pfn, long size)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
size_t offset = sector << 9;
@@ -108,7 +108,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
return -ENODEV;

*kaddr = pmem->virt_addr + offset;
- *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(pmem->phys_addr + offset);

return pmem->size - offset;
}
@@ -116,7 +116,9 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.rw_page = pmem_rw_page,
+#if IS_ENABLED(CONFIG_PMEM_IO)
.direct_access = pmem_direct_access,
+#endif
};

static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 5da8515b8fb9..8616c1d33786 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn, long size);
+ void **kaddr, __pfn_t *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -879,7 +879,7 @@ fail:

static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn, long size)
+ void **kaddr, __pfn_t *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
@@ -890,7 +890,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
dev_sz = dev_info->end - dev_info->start;
offset = secnum * 512;
*kaddr = (void *) (dev_info->start + offset);
- *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
+ *pfn = phys_to_pfn_t(virt_to_phys(*kaddr));

return dev_sz - offset;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7e4163ede87..7285c31f7e30 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -437,7 +437,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
* accessible at this address.
*/
long bdev_direct_access(struct block_device *bdev, sector_t sector,
- void **addr, unsigned long *pfn, long size)
+ void **addr, __pfn_t *pfn, long size)
{
long avail;
const struct block_device_operations *ops = bdev->bd_disk->fops;
diff --git a/fs/dax.c b/fs/dax.c
index 6f65f00e58ec..198bd0e4b5ae 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -35,7 +35,7 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
might_sleep();
do {
void *addr;
- unsigned long pfn;
+ __pfn_t pfn;
long count;

count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
@@ -65,7 +65,8 @@ EXPORT_SYMBOL_GPL(dax_clear_blocks);

static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
{
- unsigned long pfn;
+ __pfn_t pfn;
+
sector_t sector = bh->b_blocknr << (blkbits - 9);
return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
}
@@ -274,7 +275,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
unsigned long vaddr = (unsigned long)vmf->virtual_address;
void *addr;
- unsigned long pfn;
+ __pfn_t pfn;
pgoff_t size;
int error;

@@ -304,7 +305,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
if (buffer_unwritten(bh) || buffer_new(bh))
clear_page(addr);

- error = vm_insert_mixed(vma, vaddr, pfn);
+ error = vm_insert_mixed(vma, vaddr, __pfn_t_to_pfn(pfn));

out:
i_mmap_unlock_read(mapping);
diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
index c1fdf41fb726..af219dc96792 100644
--- a/include/asm-generic/pfn.h
+++ b/include/asm-generic/pfn.h
@@ -49,6 +49,13 @@ static inline __pfn_t page_to_pfn_t(struct page *page)
return pfn;
}

+static inline __pfn_t phys_to_pfn_t(phys_addr_t addr)
+{
+ __pfn_t pfn = { .pfn = addr >> PAGE_SHIFT };
+
+ return pfn;
+}
+
static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
{
#if IS_ENABLED(CONFIG_PMEM_IO)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516f24de..2692d3936f5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1605,7 +1605,7 @@ struct block_device_operations {
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
long (*direct_access)(struct block_device *, sector_t,
- void **, unsigned long *pfn, long size);
+ void **, __pfn_t *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1624,7 +1624,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
- unsigned long *pfn, long size);
+ __pfn_t *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;

2015-05-06 20:08:38

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 10/10] block: base support for pfn i/o

Allow block device drivers to opt-in to receiving bio(s) where the
bio_vec(s) point to memory that is not backed by struct page entries.
When a driver opts in it asserts that it will use the __pfn_t versions of the
dma_map/kmap/scatterlist apis in its bio submission path.

Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
block/bio.c | 48 ++++++++++++++++++++++++++++++++++++++-------
block/blk-core.c | 9 ++++++++
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 2 ++
4 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7100fd6d5898..9c506dd6a093 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -567,6 +567,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio->bi_rw = bio_src->bi_rw;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
+ bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
}
EXPORT_SYMBOL(__bio_clone_fast);

@@ -658,6 +659,8 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
goto integrity_clone;
}

+ bio->bi_flags |= bio_src->bi_flags & (1 << BIO_PFN);
+
bio_for_each_segment(bv, bio_src, iter)
bio->bi_io_vec[bio->bi_vcnt++] = bv;

@@ -699,9 +702,9 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned int max_sectors)
+static int __bio_add_pfn(struct request_queue *q, struct bio *bio,
+ __pfn_t pfn, unsigned int len, unsigned int offset,
+ unsigned int max_sectors)
{
int retried_segments = 0;
struct bio_vec *bvec;
@@ -723,7 +726,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (bio->bi_vcnt > 0) {
struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];

- if (page == bvec_page(prev) &&
+ if (pfn.pfn == prev->bv_pfn.pfn &&
offset == prev->bv_offset + prev->bv_len) {
unsigned int prev_bv_len = prev->bv_len;
prev->bv_len += len;
@@ -768,7 +771,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
* cannot add the page
*/
bvec = &bio->bi_io_vec[bio->bi_vcnt];
- bvec_set_page(bvec, page);
+ bvec->bv_pfn = pfn;
bvec->bv_len = len;
bvec->bv_offset = offset;
bio->bi_vcnt++;
@@ -818,7 +821,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
return len;

failed:
- bvec_set_page(bvec, NULL);
+ bvec->bv_pfn.pfn = 0;
bvec->bv_len = 0;
bvec->bv_offset = 0;
bio->bi_vcnt--;
@@ -845,7 +848,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- return __bio_add_page(q, bio, page, len, offset,
+ return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
queue_max_hw_sectors(q));
}
EXPORT_SYMBOL(bio_add_pc_page);
@@ -872,10 +875,39 @@ int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
max_sectors = len >> 9;

- return __bio_add_page(q, bio, page, len, offset, max_sectors);
+ return __bio_add_pfn(q, bio, page_to_pfn_t(page), len, offset,
+ max_sectors);
}
EXPORT_SYMBOL(bio_add_page);

+/**
+ * bio_add_pfn - attempt to add pfn to bio
+ * @bio: destination bio
+ * @pfn: pfn to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Identical to bio_add_page() except this variant flags the bio as
+ * not have struct page backing. A given request_queue must assert
+ * that it is prepared to handle this constraint before bio(s)
+ * flagged in the manner can be passed.
+ */
+int bio_add_pfn(struct bio *bio, __pfn_t pfn, unsigned int len,
+ unsigned int offset)
+{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ unsigned int max_sectors;
+
+ if (!blk_queue_pfn(q))
+ return 0;
+ set_bit(BIO_PFN, &bio->bi_flags);
+ max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
+ if ((max_sectors < (len >> 9)) && !bio->bi_iter.bi_size)
+ max_sectors = len >> 9;
+
+ return __bio_add_pfn(q, bio, pfn, len, offset, max_sectors);
+}
+
struct submit_bio_ret {
struct completion event;
int error;
diff --git a/block/blk-core.c b/block/blk-core.c
index 94d2c6ccf801..4eefff363986 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1856,6 +1856,15 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}

+ if (bio_flagged(bio, BIO_PFN)) {
+ if (IS_ENABLED(CONFIG_PMEM_IO) && blk_queue_pfn(q))
+ /* pass */;
+ else {
+ err = -EOPNOTSUPP;
+ goto end_io;
+ }
+ }
+
/*
* Various block parts want %current->io_context and lazy ioc
* allocation ends up trading a lot of pain for a small amount of
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2a15c4943db6..5716c7edc4c4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -139,6 +139,7 @@ struct bio {
#define BIO_NULL_MAPPED 8 /* contains invalid user pages */
#define BIO_QUIET 9 /* Make BIO Quiet */
#define BIO_SNAP_STABLE 10 /* bio data must be snapshotted during write */
+#define BIO_PFN 11 /* bio_vec references memory without struct page */

/*
* Flags starting here get preserved by bio_reset() - this includes
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2692d3936f5f..e6adcfeb72db 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -513,6 +513,7 @@ struct request_queue {
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
#define QUEUE_FLAG_SG_GAPS 22 /* queue doesn't support SG gaps */
+#define QUEUE_FLAG_PFN 23 /* queue supports pfn-only bio_vec(s) */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
@@ -594,6 +595,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
#define blk_queue_noxmerges(q) \
test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
#define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
+#define blk_queue_pfn(q) test_bit(QUEUE_FLAG_PFN, &(q)->queue_flags)
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
#define blk_queue_stackable(q) \

2015-05-06 20:20:39

by Dan Williams

[permalink] [raw]
Subject: Re: [Linux-nvdimm] [PATCH v2 08/10] x86: support kmap_atomic_pfn_t() for persistent memory

On Wed, May 6, 2015 at 1:05 PM, Dan Williams <[email protected]> wrote:
> It would be unfortunate if the kmap infrastructure escaped its current
> 32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user
> has enabled CONFIG_PMEM_IO we direct the kmap_atomic_pfn_t()
> implementation to scan a list of pre-mapped persistent memory address
> ranges inserted by the pmem driver.
>
> The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
> but there are two mitigating factors:
>
> 1/ The number of persistent memory ranges is bounded by the number of
> DIMMs which is on the order of 10s of DIMMs, not hundreds.
>
> 2/ The lookup yields the entire range, if it becomes inefficient to do a
> kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
> advantage of the fact that the lookup can be amortized for all kmap
> operations it needs to perform in a given range.
>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> arch/Kconfig | 3 +
> arch/x86/Kconfig | 2 +
> arch/x86/kernel/Makefile | 1
> arch/x86/kernel/kmap.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
> drivers/block/pmem.c | 6 +++
> include/linux/highmem.h | 23 +++++++++++
> 6 files changed, 130 insertions(+)
> create mode 100644 arch/x86/kernel/kmap.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f7f800860c00..69d3a3fa21af 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -206,6 +206,9 @@ config HAVE_DMA_CONTIGUOUS
> config HAVE_DMA_PFN
> bool
>
> +config HAVE_KMAP_PFN
> + bool
> +
> config GENERIC_SMP_IDLE_THREAD
> bool
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1fae5e842423..eddaea839500 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1434,7 +1434,9 @@ config X86_PMEM_LEGACY
> Say Y if unsure.
>
> config X86_PMEM_DMA
> + depends on !HIGHMEM
> def_bool PMEM_IO
> + select HAVE_KMAP_PFN
> select HAVE_DMA_PFN
>
> config HIGHPTE
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 9bcd0b56ca17..44c323342996 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -96,6 +96,7 @@ obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
> obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
> obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
> obj-$(CONFIG_X86_PMEM_LEGACY) += pmem.o
> +obj-$(CONFIG_X86_PMEM_DMA) += kmap.o
>
> obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
>
> diff --git a/arch/x86/kernel/kmap.c b/arch/x86/kernel/kmap.c
> new file mode 100644
> index 000000000000..d597c475377b
> --- /dev/null
> +++ b/arch/x86/kernel/kmap.c
> @@ -0,0 +1,95 @@
> +/*
> + * Copyright(c) 2015 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/rcupdate.h>
> +#include <linux/rculist.h>
> +#include <linux/highmem.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/mm.h>
> +
> +static LIST_HEAD(ranges);
> +
> +struct kmap {
> + struct list_head list;
> + struct resource *res;
> + struct device *dev;
> + void *base;
> +};
> +
> +static void teardown_kmap(void *data)
> +{
> + struct kmap *kmap = data;
> +
> + dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
> + list_del_rcu(&kmap->list);
> + synchronize_rcu();
> + kfree(kmap);
> +}
> +
> +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
> + void *base)
> +{
> + struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
> + int rc;
> +
> + if (!kmap)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&kmap->list);
> + kmap->res = res;
> + kmap->base = base;
> + kmap->dev = dev;
> + rc = devm_add_action(dev, teardown_kmap, kmap);
> + if (rc) {
> + kfree(kmap);
> + return rc;
> + }
> + dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
> + list_add_rcu(&kmap->list, &ranges);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
> +
> +void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> + struct page *page = __pfn_t_to_page(pfn);
> + resource_size_t addr;
> + struct kmap *kmap;
> +
> + if (page)
> + return kmap_atomic(page);
> + addr = __pfn_t_to_phys(pfn);
> + rcu_read_lock();
> + list_for_each_entry_rcu(kmap, &ranges, list)
> + if (addr >= kmap->res->start && addr <= kmap->res->end)
> + return kmap->base + addr - kmap->res->start;
> +
> + /* only unlock in the error case */
> + rcu_read_unlock();
> + return NULL;
> +}
> +EXPORT_SYMBOL(kmap_atomic_pfn_t);
> +
> +void kunmap_atomic_pfn_t(void *addr)
> +{
> + rcu_read_unlock();
> +
> + /*
> + * If the original __pfn_t had an entry in the memmap then
> + * 'addr' will be outside of vmalloc space i.e. it came from
> + * page_address()
> + */
> + if (!is_vmalloc_addr(addr))
> + kunmap_atomic(addr);

rcu_read_unlock() should move here.

2015-05-06 20:50:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 06, 2015 at 04:04:53PM -0400, Dan Williams wrote:
> Changes since v1 [1]:
>
> 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers.
>
> 2/ added kmap_atomic_pfn_t()
>
> 3/ rebased on v4.1-rc2
>
> [1]: http://marc.info/?l=linux-kernel&m=142653770511970&w=2
>
> ---
>
> A lead in note, this looks scarier than it is. Most of the code thrash
> is automated via Coccinelle. Also the subtle differences behind an
> 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a
> Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls
> whether a pfn and a __pfn_t are equivalent.
>
> The motivation for this change is persistent memory and the desire to
> use it not only via the pmem driver, but also as a memory target for I/O
> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel. Aside
> from the pmem driver and DAX, persistent memory is not able to be used
> in these I/O scenarios due to the lack of a backing struct page, i.e.
> persistent memory is not part of the memmap. This patchset takes the
> position that the solution is to teach I/O paths that want to operate on
> persistent memory to do so by referencing a __pfn_t. The alternatives
> are discussed in the changelog for "[PATCH v2 01/10] arch: introduce
> __pfn_t for persistent memory i/o", copied here:
>
> Alternatives:
>
> 1/ Provide struct page coverage for persistent memory in
> DRAM. The expectation is that persistent memory capacities make
> this untenable in the long term.
>
> 2/ Provide struct page coverage for persistent memory with
> persistent memory. While persistent memory may have near DRAM
> performance characteristics it may not have the same
> write-endurance of DRAM. Given the update frequency of struct
> page objects it may not be suitable for persistent memory.
>
> 3/ Dynamically allocate struct page. This appears to be on
> the order of the complexity of converting code paths to use
> __pfn_t references instead of struct page, and the amount of
> setup required to establish a valid struct page reference is
> mostly wasted when the only usage in the block stack is to
> perform a page_to_pfn() conversion for dma-mapping. Instances
> of kmap() / kmap_atomic() usage appear to be the only occasions
> in the block stack where struct page is non-trivially used. A
> new kmap_atomic_pfn_t() is proposed to handle those cases.

*grumble*

What are you going to do with things like iov_iter_get_pages()? Long-term,
that is, after you go for "this pfn has no struct page for it"...

2015-05-06 22:10:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 6, 2015 at 1:04 PM, Dan Williams <[email protected]> wrote:
>
> The motivation for this change is persistent memory and the desire to
> use it not only via the pmem driver, but also as a memory target for I/O
> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.

I detest this approach.

I'd much rather go exactly the other way around, and do the dynamic
"struct page" instead.

Add a flag to "struct page" to mark it as a fake entry and teach
"page_to_pfn()" to look up the actual pfn some way (that union tha
contains "index" looks like a good target to also contain 'pfn', for
example).

Especially if this is mainly for persistent storage, we'll never have
issues with worrying about writing it back under memory pressure, so
allocating a "struct page" for these things shouldn't be a problem.
There's likely only a few paths that actually generate IO for those
things.

In other words, I'd really like our basic infrastructure to be for the
*normal* case, and the "struct page" is about so much more than just
"what's the target for IO". For normal IO, "struct page" is also what
serializes the IO so that you have a consistent view of the end
result, and there's obviously the reference count there too. So I
really *really* think that "struct page" is the better entity for
describing the actual IO, because it's the common and the generic
thing, while a "pfn" is not actually *enough* for IO in general, and
you now end up having to look up the "struct page" for the locking and
refcounting etc.

If you go the other way, and instead generate a "struct page" from the
pfn for the few cases that need it, you put the onus on odd behavior
where it belongs.

Yes, it might not be any simpler in the end, but I think it would be
conceptually much better.

Linus

2015-05-06 23:47:20

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, May 6, 2015 at 1:04 PM, Dan Williams <[email protected]> wrote:
>>
>> The motivation for this change is persistent memory and the desire to
>> use it not only via the pmem driver, but also as a memory target for I/O
>> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.
>
> I detest this approach.
>

Hmm, yes, I can't argue against "put the onus on odd behavior where it
belongs."...

> I'd much rather go exactly the other way around, and do the dynamic
> "struct page" instead.
>
> Add a flag to "struct page"

Ok, given I had already precluded 32-bit systems in this __pfn_t
approach we should have flag space for this on 64-bit.

> to mark it as a fake entry and teach
> "page_to_pfn()" to look up the actual pfn some way (that union tha
> contains "index" looks like a good target to also contain 'pfn', for
> example).
>
> Especially if this is mainly for persistent storage, we'll never have
> issues with worrying about writing it back under memory pressure, so
> allocating a "struct page" for these things shouldn't be a problem.
> There's likely only a few paths that actually generate IO for those
> things.
>
> In other words, I'd really like our basic infrastructure to be for the
> *normal* case, and the "struct page" is about so much more than just
> "what's the target for IO". For normal IO, "struct page" is also what
> serializes the IO so that you have a consistent view of the end
> result, and there's obviously the reference count there too. So I
> really *really* think that "struct page" is the better entity for
> describing the actual IO, because it's the common and the generic
> thing, while a "pfn" is not actually *enough* for IO in general, and
> you now end up having to look up the "struct page" for the locking and
> refcounting etc.
>
> If you go the other way, and instead generate a "struct page" from the
> pfn for the few cases that need it, you put the onus on odd behavior
> where it belongs.
>
> Yes, it might not be any simpler in the end, but I think it would be
> conceptually much better.

Conceptually better, but certainly more difficult to audit if the fake
struct page is initialized in a subtle way that breaks when/if it
leaks to some unwitting context. The one benefit I may need to
concede is a mechanism to opt-in to handle these fake pages to the few
paths that know what they are doing. That was easy with __pfn_t, but
a struct page can go silently almost anywhere. Certainly nothing is
prepared a for a given struct page pointer to change the pfn it points
to on the fly, which I think is what we would end up doing for
something like a raid cache. Keep a pool of struct pages around and
point them at persistent memory pfns while I/O is in flight.

2015-05-07 00:19:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 6, 2015 at 4:47 PM, Dan Williams <[email protected]> wrote:
>
> Conceptually better, but certainly more difficult to audit if the fake
> struct page is initialized in a subtle way that breaks when/if it
> leaks to some unwitting context.

Maybe. It could go either way, though. In particular, with the
"dynamically allocated struct page" approach, if somebody uses it past
the supposed lifetime of the use, things like poisoning the temporary
"struct page" could be fairly effective. You can't really poison the
pfn - it's just a number, and if somebody uses it later than you think
(and you have re-used that physical memory for something else), you'll
never ever know.

I'd *assume* that most users of the dynamic "struct page" allocation
have very clear lifetime rules. Those things would presumably normally
get looked-up by some extended version of "get_user_pages()", and
there's a clear use of the result, with no longer lifetime. Also, you
do need to have some higher-level locking when you do this, to make
sure that the persistent pages don't magically get re-assigned. We're
presumably talking about having a filesystem in that persistent
memory, so we cannot be doing IO to the pages (from some other source
- whether RDMA or some special zero-copy model) while the underlying
filesystem is reassigning the storage because somebody deleted the
file.

IOW, there had better be other external rules about when - and how
long - you can use a particular persistent page. No? So the whole
"when/how to allocate the temporary 'struct page'" is just another
detail in that whole thing.

And yes, some uses may not ever actually see that. If the whole of
persistent memory is just assigned to a database or something, and the
DB just wants to do a "flush this range of persistent memory to
long-term disk storage", then there may not be much of a "lifetime"
issue for the persistent memory. But even then you're going to have IO
completion callbacks etc to let the DB know that it has hit the disk,
so..

What is the primary thing that is driving this need? Do we have a very
concrete example?

Linus

2015-05-07 02:36:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, May 6, 2015 at 4:47 PM, Dan Williams <[email protected]> wrote:
>>
>> Conceptually better, but certainly more difficult to audit if the fake
>> struct page is initialized in a subtle way that breaks when/if it
>> leaks to some unwitting context.
>
> Maybe. It could go either way, though. In particular, with the
> "dynamically allocated struct page" approach, if somebody uses it past
> the supposed lifetime of the use, things like poisoning the temporary
> "struct page" could be fairly effective. You can't really poison the
> pfn - it's just a number, and if somebody uses it later than you think
> (and you have re-used that physical memory for something else), you'll
> never ever know.

True, but there's little need to poison a _pfn_t because it's
permanent once discovered via ->direct_access() on the hosting struct
block_device. Sure, kmap_atomic_pfn_t() may fail when the pmem driver
unbinds from a device, but the __pfn_t is still valid. Obviously, we
can only support atomic kmap(s) with this property, and it would be
nice to fault if someone continued to use the __pfn_t after the
hosting device was disabled. To be clear, DAX has this same problem
today. Nothing stops whomever called ->direct_access() to continue
using the pfn after the backing device has been disabled.

> I'd *assume* that most users of the dynamic "struct page" allocation
> have very clear lifetime rules. Those things would presumably normally
> get looked-up by some extended version of "get_user_pages()", and
> there's a clear use of the result, with no longer lifetime. Also, you
> do need to have some higher-level locking when you do this, to make
> sure that the persistent pages don't magically get re-assigned. We're
> presumably talking about having a filesystem in that persistent
> memory, so we cannot be doing IO to the pages (from some other source
> - whether RDMA or some special zero-copy model) while the underlying
> filesystem is reassigning the storage because somebody deleted the
> file.
>
> IOW, there had better be other external rules about when - and how
> long - you can use a particular persistent page. No? So the whole
> "when/how to allocate the temporary 'struct page'" is just another
> detail in that whole thing.
>
> And yes, some uses may not ever actually see that. If the whole of
> persistent memory is just assigned to a database or something, and the
> DB just wants to do a "flush this range of persistent memory to
> long-term disk storage", then there may not be much of a "lifetime"
> issue for the persistent memory. But even then you're going to have IO
> completion callbacks etc to let the DB know that it has hit the disk,
> so..
>
> What is the primary thing that is driving this need? Do we have a very
> concrete example?

My pet concrete example is covered by __pfn_t. Referencing persistent
memory in an md/dm hierarchical storage configuration. Setting aside
the thrash to get existing block users to do "bvec_set_page(page)"
instead of "bvec->page = page" the onus is on that md/dm
implementation and backing storage device driver to operate on
__pfn_t. That use case is simple because there is no use of page
locking or refcounting in that path, just dma_map_page() and
kmap_atomic(). The more difficult use case is precisely what Al
picked up on, O_DIRECT and RDMA. This patchset does nothing to
address those use cases outside of not needing a struct page when they
eventually craft a bio.

I know Matthew Wilcox has explored the idea of "get_user_sg()" and let
the scatterlist hold the reference count and locks, but I'll let him
speak to that.

I still see __pfn_t as generally useful for the simple in-kernel
stacked-block-i/o use case.

2015-05-07 09:02:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Dan Williams <[email protected]> wrote:

> > What is the primary thing that is driving this need? Do we have a
> > very concrete example?
>
> My pet concrete example is covered by __pfn_t. Referencing
> persistent memory in an md/dm hierarchical storage configuration.
> Setting aside the thrash to get existing block users to do
> "bvec_set_page(page)" instead of "bvec->page = page" the onus is on
> that md/dm implementation and backing storage device driver to
> operate on __pfn_t. That use case is simple because there is no use
> of page locking or refcounting in that path, just dma_map_page() and
> kmap_atomic(). The more difficult use case is precisely what Al
> picked up on, O_DIRECT and RDMA. This patchset does nothing to
> address those use cases outside of not needing a struct page when
> they eventually craft a bio.

So why not do a dual approach?

There are code paths where the 'pfn' of a persistent device is mostly
used as a sector_t equivalent of terabytes of storage, not as an index
of a memory object.

It's not an address to a cache, it's an index into a huge storage
space - which happens to be (flash) RAM. For them using pfn_t seems
natural and using struct page * is a strained (not to mention
expensive) model.

For more complex facilities, where persistent memory is used as a
memory object, especially where the underlying device is true,
unfinitely writable RAM (not flash), treating it as a memory zone, or
setting up dynamic struct page would be the natural approach. (with
the inevitable cost of setup/teardown in the latter case)

I'd say that for anything where the dynamic struct page is torn down
unconditionally after completion of only a single use, the natural API
is probably pfn_t, not struct page. Any synchronization is already
handled at the block request layer already, and it's storage op
synchronization, not memory access synchronization really.

For anything more complex, that maps any of this storage to
user-space, or exposes it to higher level struct page based APIs,
etc., where references matter and it's more of a cache with
potentially multiple users, not an IO space, the natural API is struct
page.

I'd say that this particular series mostly addresses the 'pfn as
sector_t' side of the equation, where persistent memory is IO space,
not memory space, and as such it is the more natural and thus also the
cheaper/faster approach.

Linus probably disagrees? :-)

Thanks,

Ingo

2015-05-07 14:42:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Ingo Molnar <[email protected]> wrote:

> [...]
>
> For anything more complex, that maps any of this storage to
> user-space, or exposes it to higher level struct page based APIs,
> etc., where references matter and it's more of a cache with
> potentially multiple users, not an IO space, the natural API is
> struct page.

Let me walk back on this:

> I'd say that this particular series mostly addresses the 'pfn as
> sector_t' side of the equation, where persistent memory is IO space,
> not memory space, and as such it is the more natural and thus also
> the cheaper/faster approach.

... but that does not appear to be the case: this series replaces a
'struct page' interface with a pure pfn interface for the express
purpose of being able to DMA to/from 'memory areas' that are not
struct page backed.

> Linus probably disagrees? :-)

[ and he'd disagree rightfully ;-) ]

So what this patch set tries to achieve is (sector_t -> sector_t) IO
between storage devices (i.e. a rare and somewhat weird usecase), and
does it by squeezing one device's storage address into our formerly
struct page backed descriptor, via a pfn.

That looks like a layering violation and a mistake to me. If we want
to do direct (sector_t -> sector_t) IO, with no serialization worries,
it should have its own (simple) API - which things like hierarchical
RAID or RDMA APIs could use.

If what we want to do is to support say an mmap() of a file on
persistent storage, and then read() into that file from another device
via DMA, then I think we should have allocated struct page backing at
mmap() time already, and all regular syscall APIs would 'just work'
from that point on - far above what page-less, pfn-based APIs can do.

The temporary struct page backing can then be freed at munmap() time.

And if the usage is pure fd based, we don't really have fd-to-fd APIs
beyond the rarely used splice variants (and even those don't do pure
cross-IO, they use a pipe as an intermediary), so there's no problem
to solve I suspect.

Thanks,

Ingo

2015-05-07 14:55:47

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o

Hi Dan,

On Wed, 06 May 2015 16:04:59 -0400 Dan Williams <[email protected]> wrote:
>
> diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
> new file mode 100644
> index 000000000000..91171e0285d9
> --- /dev/null
> +++ b/include/asm-generic/pfn.h
> @@ -0,0 +1,51 @@
> +#ifndef __ASM_PFN_H
> +#define __ASM_PFN_H
> +
> +#ifndef __pfn_to_phys
> +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT)

Why dma_addr_t and not phys_addr_t? i.e. it could use a comment if it
is correct.
--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (819.00 B)
OpenPGP digital signature

2015-05-07 15:00:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 6, 2015 at 7:36 PM, Dan Williams <[email protected]> wrote:
>
> My pet concrete example is covered by __pfn_t. Referencing persistent
> memory in an md/dm hierarchical storage configuration. Setting aside
> the thrash to get existing block users to do "bvec_set_page(page)"
> instead of "bvec->page = page" the onus is on that md/dm
> implementation and backing storage device driver to operate on
> __pfn_t. That use case is simple because there is no use of page
> locking or refcounting in that path, just dma_map_page() and
> kmap_atomic().

So clarify for me: are you trying to make the IO stack in general be
able to use the persistent memory as a source (or destination) for IO
to _other_ devices, or are you talking about just internally shuffling
things around for something like RAID on top of persistent memory?

Because I think those are two very different things.

For example, one of the things I worry about is for people doing IO
from persistent memory directly to some "slow stable storage" (aka
disk). That was what I thought you were aiming for: infrastructure so
that you can make a bio for a *disk* device contain a page list that
is the persistent memory.

And I think that is a very dangerous operation to do, because the
persistent memory itself is going to have some filesystem on it, so
anything that looks up the persistent memory pages is *not* going to
have a stable pfn: the pfn will point to a fixed part of the
persistent memory, but the file that was there may be deleted and the
memory reassigned to something else.

That's the kind of thing that "struct page" helps with for normal IO
devices. It's both a source of serialization and indirection, so that
when somebody does a "truncate()" on a file, we don't end up doing IO
to random stale locations on the disk that got reassigned to another
file.

So "struct page" is very fundamental. It's *not* just a "this is the
physical source/drain of the data you are doing IO on".

So if you are looking at some kind of "zero-copy IO", where you can do
IO from a filesystem on persistent storage to *another* filesystem on
(say, a big rotational disk used for long-term storage) by just doing
a bo that targets the disk, but has the persistent memory as the
source memory, I really want to understand how you are going to
serialize this.

So *that* is what I meant by "What is the primary thing that is
driving this need? Do we have a very concrete example?"

I abvsolutely do *not* want to teach the bio subsystem to just
randomly be able to take the source/destination of the IO as being
some random pfn without knowing what the actual uses are and how these
IO's are generated in the first place.

I was assuming that you wanted to do something where you mmap() the
persistent memory, and then write it out to another device (possibly
using aio_write()). But that really does require some kind of
serialization at a higher level, because you can't just look up the
pfn's in the page table and assume they are stable: they are *not*
stable.

Linus

2015-05-07 15:40:37

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, May 6, 2015 at 7:36 PM, Dan Williams <[email protected]> wrote:
>>
>> My pet concrete example is covered by __pfn_t. Referencing persistent
>> memory in an md/dm hierarchical storage configuration. Setting aside
>> the thrash to get existing block users to do "bvec_set_page(page)"
>> instead of "bvec->page = page" the onus is on that md/dm
>> implementation and backing storage device driver to operate on
>> __pfn_t. That use case is simple because there is no use of page
>> locking or refcounting in that path, just dma_map_page() and
>> kmap_atomic().
>
> So clarify for me: are you trying to make the IO stack in general be
> able to use the persistent memory as a source (or destination) for IO
> to _other_ devices, or are you talking about just internally shuffling
> things around for something like RAID on top of persistent memory?
>
> Because I think those are two very different things.

Yes, they are, and I am referring to the former, persistent memory as
a source/destination to other devices.

> For example, one of the things I worry about is for people doing IO
> from persistent memory directly to some "slow stable storage" (aka
> disk). That was what I thought you were aiming for: infrastructure so
> that you can make a bio for a *disk* device contain a page list that
> is the persistent memory.
>
> And I think that is a very dangerous operation to do, because the
> persistent memory itself is going to have some filesystem on it, so
> anything that looks up the persistent memory pages is *not* going to
> have a stable pfn: the pfn will point to a fixed part of the
> persistent memory, but the file that was there may be deleted and the
> memory reassigned to something else.

Indeed, truncate() in the absence of struct page has been a major
hurdle for persistent memory enabling. But it does not impact this
specific md/dm use case. md/dm will have taken an exclusive claim on
an entire pmem block device (or partition), so there will be no
competing with a filesystem.

> That's the kind of thing that "struct page" helps with for normal IO
> devices. It's both a source of serialization and indirection, so that
> when somebody does a "truncate()" on a file, we don't end up doing IO
> to random stale locations on the disk that got reassigned to another
> file.
>
> So "struct page" is very fundamental. It's *not* just a "this is the
> physical source/drain of the data you are doing IO on".
>
> So if you are looking at some kind of "zero-copy IO", where you can do
> IO from a filesystem on persistent storage to *another* filesystem on
> (say, a big rotational disk used for long-term storage) by just doing
> a bo that targets the disk, but has the persistent memory as the
> source memory, I really want to understand how you are going to
> serialize this.
>
> So *that* is what I meant by "What is the primary thing that is
> driving this need? Do we have a very concrete example?"
>
> I abvsolutely do *not* want to teach the bio subsystem to just
> randomly be able to take the source/destination of the IO as being
> some random pfn without knowing what the actual uses are and how these
> IO's are generated in the first place.

blkdev_get(FMODE_EXCL) is the protection in this case.

> I was assuming that you wanted to do something where you mmap() the
> persistent memory, and then write it out to another device (possibly
> using aio_write()). But that really does require some kind of
> serialization at a higher level, because you can't just look up the
> pfn's in the page table and assume they are stable: they are *not*
> stable.

We want to get there eventually, but this patchset does not address that case.

2015-05-07 15:52:20

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> [...]
>>
>> For anything more complex, that maps any of this storage to
>> user-space, or exposes it to higher level struct page based APIs,
>> etc., where references matter and it's more of a cache with
>> potentially multiple users, not an IO space, the natural API is
>> struct page.
>
> Let me walk back on this:
>
>> I'd say that this particular series mostly addresses the 'pfn as
>> sector_t' side of the equation, where persistent memory is IO space,
>> not memory space, and as such it is the more natural and thus also
>> the cheaper/faster approach.
>
> ... but that does not appear to be the case: this series replaces a
> 'struct page' interface with a pure pfn interface for the express
> purpose of being able to DMA to/from 'memory areas' that are not
> struct page backed.
>
>> Linus probably disagrees? :-)
>
> [ and he'd disagree rightfully ;-) ]
>
> So what this patch set tries to achieve is (sector_t -> sector_t) IO
> between storage devices (i.e. a rare and somewhat weird usecase), and
> does it by squeezing one device's storage address into our formerly
> struct page backed descriptor, via a pfn.
>
> That looks like a layering violation and a mistake to me. If we want
> to do direct (sector_t -> sector_t) IO, with no serialization worries,
> it should have its own (simple) API - which things like hierarchical
> RAID or RDMA APIs could use.

I'm wrapped around the idea that __pfn_t *is* that simple api for the
tiered storage driver use case. For RDMA I think we need struct page
because I assume that would be coordinated through a filesystem an
truncate() is back in play.

What does an alternative API look like?

> If what we want to do is to support say an mmap() of a file on
> persistent storage, and then read() into that file from another device
> via DMA, then I think we should have allocated struct page backing at
> mmap() time already, and all regular syscall APIs would 'just work'
> from that point on - far above what page-less, pfn-based APIs can do.
>
> The temporary struct page backing can then be freed at munmap() time.

Yes, passing around mmap()'d (DAX) persistent memory will need more
than a __pfn_t.

2015-05-07 15:58:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 8:40 AM, Dan Williams <[email protected]> wrote:
>
> blkdev_get(FMODE_EXCL) is the protection in this case.

Ugh. That looks like a horrible nasty big hammer that will bite us
badly some day. Since you'd have to hold it for the whole IO. But I
guess it at least works.

Anyway, I did want to say that while I may not be convinced about the
approach, I think the patches themselves don't look horrible. I
actually like your "__pfn_t". So while I (very obviously) have some
doubts about this approach, it may be that the most convincing
argument is just in the code.

Linus

2015-05-07 16:03:33

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 7, 2015 at 8:40 AM, Dan Williams <[email protected]> wrote:
>>
>> blkdev_get(FMODE_EXCL) is the protection in this case.
>
> Ugh. That looks like a horrible nasty big hammer that will bite us
> badly some day. Since you'd have to hold it for the whole IO. But I
> guess it at least works.

Oh no, that wouldn't be per-I/O that would be permanent at
configuration set up time just like a raid member device.

Something like:
mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda

> Anyway, I did want to say that while I may not be convinced about the
> approach, I think the patches themselves don't look horrible. I
> actually like your "__pfn_t". So while I (very obviously) have some
> doubts about this approach, it may be that the most convincing
> argument is just in the code.

Ok, I'll keep thinking about this and come back when we have a better
story about passing mmap'd persistent memory around in userspace.

2015-05-07 16:18:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> What is the primary thing that is driving this need? Do we have a very
> concrete example?

FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
ue pages for that. The code just merge for 4.1 can easily support page
backing, and I plan to use that for now. This still leaves support
for the gigantic intel nvdimms discovered over EFI out, but given that
I don't have access to them, and I dont know of any publically available
there's little I can do for now. But adding on demand allocate struct
pages for the seems like the easiest way forward. Boaz already has
code to allocate pages for them, although not on demand but at boot / plug in
time.

2015-05-07 16:41:59

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
>> What is the primary thing that is driving this need? Do we have a very
>> concrete example?
>
> FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
> ue pages for that. The code just merge for 4.1 can easily support page
> backing, and I plan to use that for now. This still leaves support
> for the gigantic intel nvdimms discovered over EFI out, but given that
> I don't have access to them, and I dont know of any publically available
> there's little I can do for now. But adding on demand allocate struct
> pages for the seems like the easiest way forward. Boaz already has
> code to allocate pages for them, although not on demand but at boot / plug in
> time.

Hmmm, the capacities of persistent memory that would be assigned for a
raid accelerator would be limited by diminishing returns. I.e. there
seems to be no point to assign more than 8GB or so to the cache? If
that's the case the capacity argument loses some teeth, just
"blk_get(FMODE_EXCL) + memory_hotplug a small capacity" and be done.

2015-05-07 17:31:10

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 07, 2015 at 06:18:07PM +0200, Christoph Hellwig wrote:
> On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> > What is the primary thing that is driving this need? Do we have a very
> > concrete example?
>
> FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
> ue pages for that. The code just merge for 4.1 can easily support page
> backing, and I plan to use that for now. This still leaves support
> for the gigantic intel nvdimms discovered over EFI out, but given that
> I don't have access to them, and I dont know of any publically available
> there's little I can do for now. But adding on demand allocate struct
> pages for the seems like the easiest way forward. Boaz already has
> code to allocate pages for them, although not on demand but at boot / plug in
> time.

I think here other folks might be interested, i am ccing Paul. But for GPU
we are facing similar issue of trying to present the GPU memory to the kernel
in a coherent way (coherent from the design and linux kernel concept POV).

For this dynamicaly allocated struct page might effectively be a solution that
could be share btw persistent memory and GPU folks. We can even enforce thing
like VMEMMAP and have special region carveout where we can dynamicly map/unmap
backing page for range of device pfn. This would also allow to catch people
trying to access such page, we could add a set of new helper like :
get_page_dev()/put_page_dev() ... and only the _dev version would works on
this new kind of memory, regular get_page()/put_page() would throw error.
This should allow to make sure only legitimate users are referencing such
page.

Issue might be that we can run out of kernel address space with 48bits but if
such monstruous computer ever see the light of day they might consider using
CPU with more bits.

Another issue is that we might care for the 32bits platform too, but that's
solvable at a small cost.

Cheers,
J?r?me

2015-05-07 17:36:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Dan Williams <[email protected]> wrote:

> > Anyway, I did want to say that while I may not be convinced about
> > the approach, I think the patches themselves don't look horrible.
> > I actually like your "__pfn_t". So while I (very obviously) have
> > some doubts about this approach, it may be that the most
> > convincing argument is just in the code.
>
> Ok, I'll keep thinking about this and come back when we have a
> better story about passing mmap'd persistent memory around in
> userspace.

So is there anything fundamentally wrong about creating struct page
backing at mmap() time (and making sure aliased mmaps share struct
page arrays)?

Because if that is done, then the DMA agent won't even know about the
memory being persistent RAM. It's just a regular struct page, that
happens to point to persistent RAM. Same goes for all the high level
VM APIs, futexes, etc. Everything will Just Work.

It will also be relatively fast: mmap() is a relative slowpath,
comparatively.

As far as RAID is concerned: that's a relatively easy situation, as
there's only a single user of the devices, the RAID context that
manages all component devices exclusively. Device to device DMA can
use the block layer directly, i.e. most of the patches you've got here
in this series, except:

74287 C May 06 Dan Williams ( 232) ├─>[PATCH v2 09/10] dax: convert to __pfn_t

I think DAX mmap()s need struct page backing.

I think there's a simple rule: if a page is visible to user-space via
the MMU then it needs struct page backing. If it's "hidden", like
behind a RAID abstraction, it probably doesn't.

With the remaining patches a high level RAID driver ought to be able
to send pfn-to-sector and sector-to-pfn requests to other block
drivers, without any unnecessary struct page allocation overhead,
right?

As long as the pfn concept remains a clever way to reuse our
ram<->sector interfaces to implement sector<->sector IO, in the cases
where the IO has no serialization or MMU concerns, not using struct
page and using pfn_t looks natural.

The moment it starts reaching user space APIs, like in the DAX case,
and especially if it becomes user-MMU visible, it's a mistake to not
have struct page backing, I think.

(In that sense the current DAX mmap() code is already a partial
mistake.)

Thanks,

Ingo

2015-05-07 17:42:54

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar <[email protected]> wrote:
>
> * Dan Williams <[email protected]> wrote:
>
>> > Anyway, I did want to say that while I may not be convinced about
>> > the approach, I think the patches themselves don't look horrible.
>> > I actually like your "__pfn_t". So while I (very obviously) have
>> > some doubts about this approach, it may be that the most
>> > convincing argument is just in the code.
>>
>> Ok, I'll keep thinking about this and come back when we have a
>> better story about passing mmap'd persistent memory around in
>> userspace.
>
> So is there anything fundamentally wrong about creating struct page
> backing at mmap() time (and making sure aliased mmaps share struct
> page arrays)?

Something like "get_user_pages() triggers memory hotplug for
persistent memory", so they are actual real struct pages? Can we do
memory hotplug at that granularity?

2015-05-07 17:43:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 9:03 AM, Dan Williams <[email protected]> wrote:
>
> Ok, I'll keep thinking about this and come back when we have a better
> story about passing mmap'd persistent memory around in userspace.

Ok. And if we do decide to go with your kind of "__pfn" type, I'd
probably prefer that we encode the type in the low bits of the word
rather than compare against PAGE_OFFSET. On some architectures
PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
but even on x86 it's a *lot* cheaper to test the low bit than it is to
compare against a big constant.

We know "struct page *" is supposed to be at least aligned to at least
"unsigned long", so you'd have two bits of type information (and we
could easily make it three). With "0" being a real pointer, so that
you can use the pointer itself without masking.

And the "hide type in low bits of pointer" is something we've done
quite a lot, so it's more "kernel coding style" anyway.

Linus

2015-05-07 17:52:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Dan Williams <[email protected]> wrote:

> > That looks like a layering violation and a mistake to me. If we
> > want to do direct (sector_t -> sector_t) IO, with no serialization
> > worries, it should have its own (simple) API - which things like
> > hierarchical RAID or RDMA APIs could use.
>
> I'm wrapped around the idea that __pfn_t *is* that simple api for
> the tiered storage driver use case. [...]

I agree. (see my previous mail)

> [...] For RDMA I think we need struct page because I assume that
> would be coordinated through a filesystem an truncate() is back in
> play.

So I don't think RDMA is necessarily special, it's just a weirdly
programmed DMA request:

- If it is used internally by an exclusively managed complex storage
driver, then it can use low level block APIs and pfn_t.

- If RDMA is exposed all the way to user-space (do we have such
APIs?), allowing users to initiate RDMA IO into user buffers, then
(the user visible) buffer needs struct page backing. (which in turn
will then at some lower level convert to pfns.)

That's true for both regular RAM pages and mmap()-ed persistent RAM
pages as well.

Thanks,

Ingo

2015-05-07 17:56:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On 05/07/2015 10:42 AM, Dan Williams wrote:
> On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar <[email protected]> wrote:
>> * Dan Williams <[email protected]> wrote:
>> So is there anything fundamentally wrong about creating struct page
>> backing at mmap() time (and making sure aliased mmaps share struct
>> page arrays)?
>
> Something like "get_user_pages() triggers memory hotplug for
> persistent memory", so they are actual real struct pages? Can we do
> memory hotplug at that granularity?

We've traditionally limited them to SECTION_SIZE granularity, which is
128MB IIRC. There are also assumptions in places that you can do page++
within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

But, in all practicality, a lot of those places are in code like the
buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're not
ever expecting these fake 'struct page's to hit these code paths, it
probably doesn't matter.

You can probably get away with just allocating PAGE_SIZE worth of
'struct page' (which is 64) and mapping it in to vmemmap[]. The worst
case is that you'll eat 1 page of space for each outstanding page of
I/O. That's a lot better than 2MB of temporary 'struct page' space per
page of I/O that it would take with a traditional hotplug operation.

2015-05-07 18:40:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Dan Williams <[email protected]> wrote:

> On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig <[email protected]> wrote:
> > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> >> What is the primary thing that is driving this need? Do we have a very
> >> concrete example?
> >
> > FYI, I plan to to implement RAID acceleration using nvdimms, and I
> > plan to ue pages for that. The code just merge for 4.1 can easily
> > support page backing, and I plan to use that for now. This still
> > leaves support for the gigantic intel nvdimms discovered over EFI
> > out, but given that I don't have access to them, and I dont know
> > of any publically available there's little I can do for now. But
> > adding on demand allocate struct pages for the seems like the
> > easiest way forward. Boaz already has code to allocate pages for
> > them, although not on demand but at boot / plug in time.
>
> Hmmm, the capacities of persistent memory that would be assigned for
> a raid accelerator would be limited by diminishing returns. I.e.
> there seems to be no point to assign more than 8GB or so to the
> cache? [...]

Why would that be the case?

If it's not a temporary cache but a persistent cache that hosts all
the data even after writeback completes then going to huge sizes will
bring similar benefits to using a large, fast SSD disk on your
desktop... The larger, the better. And it also persists across
reboots.

It could also host the RAID write intent bitmap (the dirty
stripes/chunks bitmap) for extra speedups. (This bitmap is pretty
small, but important to speed up resyncs after crashes or power loss.)

Thanks,

Ingo

2015-05-07 19:11:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Dave Hansen <[email protected]> wrote:

> On 05/07/2015 10:42 AM, Dan Williams wrote:
> > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar <[email protected]> wrote:
> >> * Dan Williams <[email protected]> wrote:
> >>
> >> So is there anything fundamentally wrong about creating struct
> >> page backing at mmap() time (and making sure aliased mmaps share
> >> struct page arrays)?
> >
> > Something like "get_user_pages() triggers memory hotplug for
> > persistent memory", so they are actual real struct pages? Can we
> > do memory hotplug at that granularity?
>
> We've traditionally limited them to SECTION_SIZE granularity, which
> is 128MB IIRC. There are also assumptions in places that you can do
> page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

I really don't think that's very practical: memory hotplug is slow,
it's really not on the same abstraction level as mmap(), and the zone
data structures are also fundamentally very coarse: not just because
RAM ranges are huge, but also so that the pfn->page transformation
stays relatively simple and fast.

> But, in all practicality, a lot of those places are in code like the
> buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're
> not ever expecting these fake 'struct page's to hit these code
> paths, it probably doesn't matter.
>
> You can probably get away with just allocating PAGE_SIZE worth of
> 'struct page' (which is 64) and mapping it in to vmemmap[]. The
> worst case is that you'll eat 1 page of space for each outstanding
> page of I/O. That's a lot better than 2MB of temporary 'struct
> page' space per page of I/O that it would take with a traditional
> hotplug operation.

So I think the main value of struct page is if everyone on the system
sees the same struct page for the same pfn - not just the temporary IO
instance.

The idea of having very temporary struct page arrays misses the point
I think: if struct page is used as essentially an IO sglist then most
of the synchronization properties are lost: then we might as well use
the real deal in that case and skip the dynamic allocation and use
pfns directly and avoid the dynamic allocation overhead.

Stable, global page-struct descriptors are a given for real RAM, where
we allocate a struct page for every page in nice, large, mostly linear
arrays.

We'd really need that for pmem too, to get the full power of struct
page: and that means allocating them in nice, large, predictable
places - such as on the device itself ...

It might even be 'scattered' across the device, with 64 byte struct
page size we can pack 64 descriptors into a single page, so every 65
pages we could have a page-struct page.

Finding a pmem page's struct page would thus involve rounding it
modulo 65 and reading that page.

The problem with that is fourfold:

- that we now turn a very kernel internal API and data structure into
an ABI. If struct page grows beyond 64 bytes it's a problem.

- on bootup (or device discovery time) we'd have to initialize all
the page structs. We could probably do this in a hierarchical way,
by dividing continuous pmem ranges into power-of-two groups of
blocks, and organizing them like the buddy allocator does.

- 1.5% of storage space lost.

- will wear-leveling properly migrate these 'hot' pages around?

The alternative would be some global interval-rbtree of struct page
backed pmem ranges.

Beyond the synchronization problems of such a data structure (which
looks like a nightmare) I don't think it's even feasible: especially
if there's a filesystem on the pmem device then the block allocations
could be physically fragmented (and there's no fundamental reason why
they couldn't be fragmented), so a continuous mmap() of a file on it
will yield wildly fragmented device-pfn ranges, exploding the rbtree.
Think 1 million node interval-rbtree with an average depth of 20:
cachemiss country for even simple lookups - not to mention the
freeing/recycling complexity of unused struct pages to not allow it to
grow too large.

I might be wrong though about all this :)

Thanks,

Ingo

2015-05-07 19:36:48

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 07, 2015 at 09:11:07PM +0200, Ingo Molnar wrote:
>
> * Dave Hansen <[email protected]> wrote:
>
> > On 05/07/2015 10:42 AM, Dan Williams wrote:
> > > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar <[email protected]> wrote:
> > >> * Dan Williams <[email protected]> wrote:
> > >>
> > >> So is there anything fundamentally wrong about creating struct
> > >> page backing at mmap() time (and making sure aliased mmaps share
> > >> struct page arrays)?
> > >
> > > Something like "get_user_pages() triggers memory hotplug for
> > > persistent memory", so they are actual real struct pages? Can we
> > > do memory hotplug at that granularity?
> >
> > We've traditionally limited them to SECTION_SIZE granularity, which
> > is 128MB IIRC. There are also assumptions in places that you can do
> > page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.
>
> I really don't think that's very practical: memory hotplug is slow,
> it's really not on the same abstraction level as mmap(), and the zone
> data structures are also fundamentally very coarse: not just because
> RAM ranges are huge, but also so that the pfn->page transformation
> stays relatively simple and fast.
>
> > But, in all practicality, a lot of those places are in code like the
> > buddy allocator. If your PTEs all have _PAGE_SPECIAL set and we're
> > not ever expecting these fake 'struct page's to hit these code
> > paths, it probably doesn't matter.
> >
> > You can probably get away with just allocating PAGE_SIZE worth of
> > 'struct page' (which is 64) and mapping it in to vmemmap[]. The
> > worst case is that you'll eat 1 page of space for each outstanding
> > page of I/O. That's a lot better than 2MB of temporary 'struct
> > page' space per page of I/O that it would take with a traditional
> > hotplug operation.
>
> So I think the main value of struct page is if everyone on the system
> sees the same struct page for the same pfn - not just the temporary IO
> instance.
>
> The idea of having very temporary struct page arrays misses the point
> I think: if struct page is used as essentially an IO sglist then most
> of the synchronization properties are lost: then we might as well use
> the real deal in that case and skip the dynamic allocation and use
> pfns directly and avoid the dynamic allocation overhead.
>
> Stable, global page-struct descriptors are a given for real RAM, where
> we allocate a struct page for every page in nice, large, mostly linear
> arrays.
>
> We'd really need that for pmem too, to get the full power of struct
> page: and that means allocating them in nice, large, predictable
> places - such as on the device itself ...

Is handling kernel pagefault on the vmemmap completely out of the
picture ? So we would carveout a chunck of kernel address space for
those pfn and use it for vmemmap and handle pagefault on it.

Again here i think that GPU folks would like a solution where they can
have a page struct but it would not be PMEM just device memory. So if
we can come up with something generic enough to server both purpose
that would be better in my view.

Cheers,
J?r?me

2015-05-07 19:44:21

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar <[email protected]> wrote:
>
> * Dan Williams <[email protected]> wrote:
>
>> On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig <[email protected]> wrote:
>> > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
>> >> What is the primary thing that is driving this need? Do we have a very
>> >> concrete example?
>> >
>> > FYI, I plan to to implement RAID acceleration using nvdimms, and I
>> > plan to ue pages for that. The code just merge for 4.1 can easily
>> > support page backing, and I plan to use that for now. This still
>> > leaves support for the gigantic intel nvdimms discovered over EFI
>> > out, but given that I don't have access to them, and I dont know
>> > of any publically available there's little I can do for now. But
>> > adding on demand allocate struct pages for the seems like the
>> > easiest way forward. Boaz already has code to allocate pages for
>> > them, although not on demand but at boot / plug in time.
>>
>> Hmmm, the capacities of persistent memory that would be assigned for
>> a raid accelerator would be limited by diminishing returns. I.e.
>> there seems to be no point to assign more than 8GB or so to the
>> cache? [...]
>
> Why would that be the case?
>
> If it's not a temporary cache but a persistent cache that hosts all
> the data even after writeback completes then going to huge sizes will
> bring similar benefits to using a large, fast SSD disk on your
> desktop... The larger, the better. And it also persists across
> reboots.

True, that's more "dm-cache" than "RAID accelerator", but point taken.

2015-05-07 19:48:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Jerome Glisse <[email protected]> wrote:

> > So I think the main value of struct page is if everyone on the
> > system sees the same struct page for the same pfn - not just the
> > temporary IO instance.
> >
> > The idea of having very temporary struct page arrays misses the
> > point I think: if struct page is used as essentially an IO sglist
> > then most of the synchronization properties are lost: then we
> > might as well use the real deal in that case and skip the dynamic
> > allocation and use pfns directly and avoid the dynamic allocation
> > overhead.
> >
> > Stable, global page-struct descriptors are a given for real RAM,
> > where we allocate a struct page for every page in nice, large,
> > mostly linear arrays.
> >
> > We'd really need that for pmem too, to get the full power of
> > struct page: and that means allocating them in nice, large,
> > predictable places - such as on the device itself ...
>
> Is handling kernel pagefault on the vmemmap completely out of the
> picture ? So we would carveout a chunck of kernel address space for
> those pfn and use it for vmemmap and handle pagefault on it.

That's pretty clever. The page fault doesn't even have to do remote
TLB shootdown, because it only establishes mappings - so it's pretty
atomic, a bit like the minor vmalloc() area faults we are doing.

Some sort of LRA (least recently allocated) scheme could unmap the
area in chunks if it's beyond a certain size, to keep a limit on size.
Done from the same context and would use remote TLB shootdown.

The only limitation I can see is that such faults would have to be
able to sleep, to do the allocation. So pfn_to_page() could not be
used in arbitrary contexts.

> Again here i think that GPU folks would like a solution where they
> can have a page struct but it would not be PMEM just device memory.
> So if we can come up with something generic enough to server both
> purpose that would be better in my view.

Yes.

Thanks,

Ingo

2015-05-07 19:53:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Ingo Molnar <[email protected]> wrote:

> > Is handling kernel pagefault on the vmemmap completely out of the
> > picture ? So we would carveout a chunck of kernel address space
> > for those pfn and use it for vmemmap and handle pagefault on it.
>
> That's pretty clever. The page fault doesn't even have to do remote
> TLB shootdown, because it only establishes mappings - so it's pretty
> atomic, a bit like the minor vmalloc() area faults we are doing.
>
> Some sort of LRA (least recently allocated) scheme could unmap the
> area in chunks if it's beyond a certain size, to keep a limit on
> size. Done from the same context and would use remote TLB shootdown.
>
> The only limitation I can see is that such faults would have to be
> able to sleep, to do the allocation. So pfn_to_page() could not be
> used in arbitrary contexts.

So another complication would be that we cannot just unmap such pages
when we want to recycle them, because the struct page in them might be
in use - so all struct page uses would have to refcount the underlying
page. We don't really do that today: code just looks up struct pages
and assumes they never go away.

Thanks,

Ingo

2015-05-07 20:06:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 7, 2015 at 9:03 AM, Dan Williams <[email protected]> wrote:
>>
>> Ok, I'll keep thinking about this and come back when we have a better
>> story about passing mmap'd persistent memory around in userspace.
>
> Ok. And if we do decide to go with your kind of "__pfn" type, I'd
> probably prefer that we encode the type in the low bits of the word
> rather than compare against PAGE_OFFSET. On some architectures
> PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
> but even on x86 it's a *lot* cheaper to test the low bit than it is to
> compare against a big constant.
>
> We know "struct page *" is supposed to be at least aligned to at least
> "unsigned long", so you'd have two bits of type information (and we
> could easily make it three). With "0" being a real pointer, so that
> you can use the pointer itself without masking.
>
> And the "hide type in low bits of pointer" is something we've done
> quite a lot, so it's more "kernel coding style" anyway.

Ok. Although __pfn_t also stores pfn values directly which will
consume those 2 bits so we'll need to shift pfns up when storing.

2015-05-07 20:18:30

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Thu, May 07, 2015 at 09:53:13PM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > Is handling kernel pagefault on the vmemmap completely out of the
> > > picture ? So we would carveout a chunck of kernel address space
> > > for those pfn and use it for vmemmap and handle pagefault on it.
> >
> > That's pretty clever. The page fault doesn't even have to do remote
> > TLB shootdown, because it only establishes mappings - so it's pretty
> > atomic, a bit like the minor vmalloc() area faults we are doing.
> >
> > Some sort of LRA (least recently allocated) scheme could unmap the
> > area in chunks if it's beyond a certain size, to keep a limit on
> > size. Done from the same context and would use remote TLB shootdown.
> >
> > The only limitation I can see is that such faults would have to be
> > able to sleep, to do the allocation. So pfn_to_page() could not be
> > used in arbitrary contexts.
>
> So another complication would be that we cannot just unmap such pages
> when we want to recycle them, because the struct page in them might be
> in use - so all struct page uses would have to refcount the underlying
> page. We don't really do that today: code just looks up struct pages
> and assumes they never go away.

I still think this is doable, like i said in another email, i think we
should introduce a special pfn_to_page_dev|pmem|waffle|somethingyoulike()
to place that are allowed to allocate the underlying struct page.

For instance we can use a default page to backup all this special vmem
range with some specialy crafted struct page that says that its is
invalid memory (make this mapping read only so all write to this
special struct page is forbidden).

Now once an authorized user comes along and need a real struct page it
trigger a page allocation that replace the page full of fake invalid
struct page with a page with correct valid struct page that can be
manipulated by other part of the kernel.

So regular pfn_to_page() would test against special vmemmap and if
special test the content of struct page for some flag. If it's the
invalid page flag it returns 0.

But once a proper struct page is allocated then pfn_page would return
the struct page as expected.

That way you will catch all invalid user of such page ie user that use
the page after its lifetime is done. You will also limit the creation
of the underlying proper struct page to only code that are legitimate
to ask for a proper struct page for given pfn.

Also you would get kernel write fault on the page full of fake struct
page and that would allow to catch further wrong use.

Anyway this is how i envision this and i think it would work for my
usecase too (GPU it is for me :))

Cheers,
J?r?me

2015-05-08 00:21:43

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] arch: introduce __pfn_t for persistent memory i/o

On Thu, May 7, 2015 at 7:55 AM, Stephen Rothwell <[email protected]> wrote:
> Hi Dan,
>
> On Wed, 06 May 2015 16:04:59 -0400 Dan Williams <[email protected]> wrote:
>>
>> diff --git a/include/asm-generic/pfn.h b/include/asm-generic/pfn.h
>> new file mode 100644
>> index 000000000000..91171e0285d9
>> --- /dev/null
>> +++ b/include/asm-generic/pfn.h
>> @@ -0,0 +1,51 @@
>> +#ifndef __ASM_PFN_H
>> +#define __ASM_PFN_H
>> +
>> +#ifndef __pfn_to_phys
>> +#define __pfn_to_phys(pfn) ((dma_addr_t)(pfn) << PAGE_SHIFT)
>
> Why dma_addr_t and not phys_addr_t? i.e. it could use a comment if it
> is correct.

Hmm, this was derived from:

#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)

in arch/x86/include/asm/io.h

The primary users of __pfn_to_phys() is dma_map_page(). I'll add a
comment to that effect.

2015-05-08 05:38:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


Al,

I was wondering about the struct page rules of
iov_iter_get_pages_alloc(), used in various places. There's no
documentation whatsoever in lib/iov_iter.c, nor in
include/linux/uio.h, and the changelog that introduced it only says:

commit 91f79c43d1b54d7154b118860d81b39bad07dfff
Author: Al Viro <[email protected]>
Date: Fri Mar 21 04:58:33 2014 -0400

new helper: iov_iter_get_pages_alloc()

same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free. Lustre and NFS ->direct_IO() switched to it.

Signed-off-by: Al Viro <[email protected]>

So if code does iov_iter_get_pages_alloc() on a user address that has
a real struct page behind it - and some other code does a regular
get_user_pages() on it, we'll have two sets of struct page
descriptors, the 'real' one, and a fake allocated one, right?

How does that work? Nobody else can ever discover these fake page
structs, so they don't really serve any 'real' synchronization purpose
other than the limited role of IO completion.

Thanks,

Ingo

2015-05-08 09:21:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:

> same as iov_iter_get_pages(), except that pages array is allocated
> (kmalloc if possible, vmalloc if that fails) and left for caller to
> free. Lustre and NFS ->direct_IO() switched to it.
>
> Signed-off-by: Al Viro <[email protected]>
>
> So if code does iov_iter_get_pages_alloc() on a user address that has
> a real struct page behind it - and some other code does a regular
> get_user_pages() on it, we'll have two sets of struct page
> descriptors, the 'real' one, and a fake allocated one, right?

Huh? iov_iter_get_pages() is given an array of pointers to struct page,
which it fills with what it finds. iov_iter_get_pages_alloc() *allocates*
such an array, fills that with what it finds and gives the allocated array
to caller.

We are not allocating any struct page instances in either of those.

2015-05-08 09:26:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Al Viro <[email protected]> wrote:

> On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
>
> > So if code does iov_iter_get_pages_alloc() on a user address that
> > has a real struct page behind it - and some other code does a
> > regular get_user_pages() on it, we'll have two sets of struct page
> > descriptors, the 'real' one, and a fake allocated one, right?
>
> Huh? iov_iter_get_pages() is given an array of pointers to struct
> page, which it fills with what it finds. iov_iter_get_pages_alloc()
> *allocates* such an array, fills that with what it finds and gives
> the allocated array to caller.
>
> We are not allocating any struct page instances in either of those.

Ah, stupid me - thanks for the explanation!

Ingo

2015-05-08 10:02:00

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 08, 2015 at 11:26:01AM +0200, Ingo Molnar wrote:
>
> * Al Viro <[email protected]> wrote:
>
> > On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
> >
> > > So if code does iov_iter_get_pages_alloc() on a user address that
> > > has a real struct page behind it - and some other code does a
> > > regular get_user_pages() on it, we'll have two sets of struct page
> > > descriptors, the 'real' one, and a fake allocated one, right?
> >
> > Huh? iov_iter_get_pages() is given an array of pointers to struct
> > page, which it fills with what it finds. iov_iter_get_pages_alloc()
> > *allocates* such an array, fills that with what it finds and gives
> > the allocated array to caller.
> >
> > We are not allocating any struct page instances in either of those.
>
> Ah, stupid me - thanks for the explanation!

My fault, actually - this "pages array" should've been either
"'pages' array" or "array of pointers to struct page".

2015-05-08 13:46:15

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On 05/07/2015 03:11 PM, Ingo Molnar wrote:

> Stable, global page-struct descriptors are a given for real RAM, where
> we allocate a struct page for every page in nice, large, mostly linear
> arrays.
>
> We'd really need that for pmem too, to get the full power of struct
> page: and that means allocating them in nice, large, predictable
> places - such as on the device itself ...
>
> It might even be 'scattered' across the device, with 64 byte struct
> page size we can pack 64 descriptors into a single page, so every 65
> pages we could have a page-struct page.
>
> Finding a pmem page's struct page would thus involve rounding it
> modulo 65 and reading that page.
>
> The problem with that is fourfold:
>
> - that we now turn a very kernel internal API and data structure into
> an ABI. If struct page grows beyond 64 bytes it's a problem.
>
> - on bootup (or device discovery time) we'd have to initialize all
> the page structs. We could probably do this in a hierarchical way,
> by dividing continuous pmem ranges into power-of-two groups of
> blocks, and organizing them like the buddy allocator does.
>
> - 1.5% of storage space lost.
>
> - will wear-leveling properly migrate these 'hot' pages around?

MST and I have been doing some thinking about how to address some of
the issues above.

One way could be to invert the PG_compound logic we have today, by
allocating one struct page for every PMD / THP sized area (2MB on
x86), and dynamically allocating struct pages for the 4kB pages
inside only if the area gets split. They can be freed again when
the area is not being accessed in 4kB chunks.

That way we would always look at the struct page for the 2MB area
first, and if the PG_split bit is set, we look at the array of
dynamically allocated struct pages for this area.

The advantages are obvious: boot time memory overhead and
initialization time are reduced by a factor 512. CPUs could also
take a whole 2MB area in order to do CPU-local 4kB allocations,
defragmentation policies may become a little clearer, etc...

The disadvantage is pretty obvious too: 4kB pages would no longer
be the fast case, with an indirection. I do not know how much of
an issue that would be, or whether it even makes sense for 4kB
pages to continue being the fast case going forward.

Memory trends point in one direction, file size trends in another.

For persistent memory, we would not need 4kB page struct pages unless
memory from a particular area was in small files AND those files were
being actively accessed. Large files (mapped in 2MB chunks) or inactive
small files would not need the 4kB page structs around.

--
All rights reversed

2015-05-08 14:06:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t


* Rik van Riel <[email protected]> wrote:

> The disadvantage is pretty obvious too: 4kB pages would no longer be
> the fast case, with an indirection. I do not know how much of an
> issue that would be, or whether it even makes sense for 4kB pages to
> continue being the fast case going forward.

I strongly disagree that 4kB does not matter as much: it is _the_
bread and butter of 99% of Linux usecases. 4kB isn't going away
anytime soon - THP might look nice in benchmarks, but it does not
matter nearly as much in practice and for filesystems and IO it's
absolutely crazy to think about 2MB granularity.

Having said that, I don't think a single jump of indirection is a big
issue - except for the present case where all the pmem IO space is
mapped non-cacheable. Write-through caching patches are in the works
though, and that should make it plenty fast.

> Memory trends point in one direction, file size trends in another.
>
> For persistent memory, we would not need 4kB page struct pages
> unless memory from a particular area was in small files AND those
> files were being actively accessed. [...]

Average file size on my system's /usr is 12.5K:

triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc
12502

> [...] Large files (mapped in 2MB chunks) or inactive small files
> would not need the 4kB page structs around.

... they are the utter uncommon case. 4K is here to stay, and for a
very long time - until humans use computers I suspect.

But I don't think the 2MB metadata chunking is wrong per se.

Thanks,

Ingo

2015-05-08 14:45:36

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

>>>>> "Ingo" == Ingo Molnar <[email protected]> writes:

Ingo> * Rik van Riel <[email protected]> wrote:

>> The disadvantage is pretty obvious too: 4kB pages would no longer be
>> the fast case, with an indirection. I do not know how much of an
>> issue that would be, or whether it even makes sense for 4kB pages to
>> continue being the fast case going forward.

Ingo> I strongly disagree that 4kB does not matter as much: it is _the_
Ingo> bread and butter of 99% of Linux usecases. 4kB isn't going away
Ingo> anytime soon - THP might look nice in benchmarks, but it does not
Ingo> matter nearly as much in practice and for filesystems and IO it's
Ingo> absolutely crazy to think about 2MB granularity.

Ingo> Having said that, I don't think a single jump of indirection is a big
Ingo> issue - except for the present case where all the pmem IO space is
Ingo> mapped non-cacheable. Write-through caching patches are in the works
Ingo> though, and that should make it plenty fast.

>> Memory trends point in one direction, file size trends in another.
>>
>> For persistent memory, we would not need 4kB page struct pages
>> unless memory from a particular area was in small files AND those
>> files were being actively accessed. [...]

Ingo> Average file size on my system's /usr is 12.5K:

Ingo> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") |
Ingo> sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n"
Ingo> | wc -l; ) | bc 12502

Now go and look at your /home or /data/ or /work areas, where the
endusers are actually keeping their day to day work. Photos, mp3,
design files, source code, object code littered around, etc.

Now I also have 12Tb filesystems with 30+ million files in them, which
just *suck* for backup, esp incrementals. I have one monster with 85+
million files (time to get beat on users again ...) which needs to be
pruned.

So I'm not arguing against you, I'm just saying you need better more
representative numbers across more day to day work. Running this
exact same command against my home directory gets:

528989

So I'm not arguing one way or another... just providing numbers.

2015-05-08 14:54:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On 05/08/2015 10:05 AM, Ingo Molnar wrote:
> * Rik van Riel <[email protected]> wrote:

>> Memory trends point in one direction, file size trends in another.
>>
>> For persistent memory, we would not need 4kB page struct pages
>> unless memory from a particular area was in small files AND those
>> files were being actively accessed. [...]
>
> Average file size on my system's /usr is 12.5K:
>
> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc
> 12502
>
>> [...] Large files (mapped in 2MB chunks) or inactive small files
>> would not need the 4kB page structs around.
>
> ... they are the utter uncommon case. 4K is here to stay, and for a
> very long time - until humans use computers I suspect.

There's a bit of an 80/20 thing going on, though.

The average file size may be small, but most data is used by
large files.

Additionally, a 2MB pmem area that has no small files on it that
are currently open will also not need 4kB page structs.

A system with 2TB of pmem might still only have a few thousand
small files open at any point in time. The rest of the memory
is either in large files, or in small files that have not been
opened recently. We can reclaim the struct pages of 4kB pages
that are not currently in use.

--
All rights reversed

2015-05-08 15:54:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
>
> Now go and look at your /home or /data/ or /work areas, where the
> endusers are actually keeping their day to day work. Photos, mp3,
> design files, source code, object code littered around, etc.

However, the big files in that list are almost immaterial from a
caching standpoint.

Caching source code is a big deal - just try not doing it and you'll
figure it out. And the kernel C source files used to have a median
size around 4k.

The big files in your home directory? Let me make an educated guess.
Very few to *none* of them are actually in your page cache right now.
And you'd never even care if they ever made it into your page cache
*at*all*. Much less whether you could ever cache them using large
pages using some very fancy cache.

There are big files that care about caches, but they tend to be
binaries, and for other reasons (things like randomization) you would
never want to use largepages for those anyway.

So from a page cache standpoint, I think the 4kB size still matters. A
*lot*. largepages are a complete red herring, and will continue to be
so pretty much forever (anonymous largepages perhaps less so).

Linus

2015-05-08 15:59:13

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] block: add helpers for accessing a bio_vec page

On Wed, May 6, 2015 at 1:05 PM, Dan Williams <[email protected]> wrote:
> In preparation for converting struct bio_vec to carry a __pfn_t instead
> of struct page.
>
> This change is prompted by the desire to add in-kernel DMA support
> (O_DIRECT, hierarchical storage, RDMA, etc) for persistent memory which
> lacks struct page coverage.
>
> Alternatives:
>
> 1/ Provide struct page coverage for persistent memory in DRAM. The
> expectation is that persistent memory capacities make this untenable
> in the long term.
>
> 2/ Provide struct page coverage for persistent memory with persistent
> memory. While persistent memory may have near DRAM performance
> characteristics it may not have the same write-endurance of DRAM.
> Given the update frequency of struct page objects it may not be
> suitable for persistent memory.
>
> 3/ Dynamically allocate struct page. This appears to be on the order
> of the complexity of converting code paths to use __pfn_t references
> instead of struct page, and the amount of setup required to establish
> a valid struct page reference is mostly wasted when the only usage in
> the block stack is to perform a page_to_pfn() conversion for
> dma-mapping. Instances of kmap() / kmap_atomic() usage appear to be
> the only occasions in the block stack where struct page is
> non-trivially used. A new kmap_atomic_pfn_t() is proposed to handle
> those cases.
>
> Generated with the following semantic patch:
>
> // bv_page.cocci: convert usage of ->bv_page to use set/get helpers
> // usage: make coccicheck COCCI=bv_page.cocci MODE=patch

Now that it looks like this patchset can move forward, what do about
this one? Run the Coccinelle script late in the merge window to catch
all the new bv_page usages targeted for 4.2-rc1?

2015-05-08 16:28:36

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 08, 2015 at 08:54:06AM -0700, Linus Torvalds wrote:
> However, the big files in that list are almost immaterial from a
> caching standpoint.

.git/objects/pack/* caching matters a lot, though...

2015-05-08 16:59:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
>>
>> Now go and look at your /home or /data/ or /work areas, where the
>> endusers are actually keeping their day to day work. Photos, mp3,
>> design files, source code, object code littered around, etc.
>
> However, the big files in that list are almost immaterial from a
> caching standpoint.

> The big files in your home directory? Let me make an educated guess.
> Very few to *none* of them are actually in your page cache right now.
> And you'd never even care if they ever made it into your page cache
> *at*all*. Much less whether you could ever cache them using large
> pages using some very fancy cache.

However, for persistent memory, all of the files will be "in memory".

Not instantiating the 4kB struct pages for 2MB areas that are not
currently being accessed with small files may make a difference.
For dynamically allocated 4kB page structs, we need some way to
discover where they are. It may make sense, from a simplicity point
of view, to have one mechanism that works both for pmem and for
normal system memory.

I agree that 4kB granularity needs to continue to work pretty much
forever, though. As long as people continue creating text files,
they will just not be very large.

--
All rights reversed

2015-05-08 20:42:17

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
>>
>> Now go and look at your /home or /data/ or /work areas, where the
>> endusers are actually keeping their day to day work. Photos, mp3,
>> design files, source code, object code littered around, etc.

Linus> However, the big files in that list are almost immaterial from a
Linus> caching standpoint.

Linus> Caching source code is a big deal - just try not doing it and
Linus> you'll figure it out. And the kernel C source files used to
Linus> have a median size around 4k.

Caching any files is a big deal, and if I'm doing batch edits of large
jpegs, won't they get cached as well?

Linus> The big files in your home directory? Let me make an educated
Linus> guess. Very few to *none* of them are actually in your page
Linus> cache right now. And you'd never even care if they ever made
Linus> it into your page cache *at*all*. Much less whether you could
Linus> ever cache them using large pages using some very fancy cache.

Hmm... probably not honestly, since I'm not a home and not using the
system actively right now. But I can see situations where being able
to mix different page sizes efficiently might be a good thing.

Linus> There are big files that care about caches, but they tend to be
Linus> binaries, and for other reasons (things like randomization) you
Linus> would never want to use largepages for those anyway.

Or large design files, like my users at $WORK use, which can be 4Gb in
size for a large design, which is ASIC chip layout work. So I'm a
little bit in the minority there.

And yes I do have other users will millions of itty bitty files as
well.

Linus> So from a page cache standpoint, I think the 4kB size still
Linus> matters. A *lot*. largepages are a complete red herring, and
Linus> will continue to be so pretty much forever (anonymous
Linus> largepages perhaps less so).

I think in the future, being able to efficiently mix page sizes will
become useful, if only to lower the memory overhead of keeping track
of large numbers of pages.

John

2015-05-09 01:14:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 8, 2015 at 9:59 AM, Rik van Riel <[email protected]> wrote:
>
> However, for persistent memory, all of the files will be "in memory".

Yes. However, I doubt you will find a very sane rw filesystem that
then also makes them contiguous and aligns them at 2MB boundaries.

Anything is possible, I guess, but things like that are *hard*. The
fragmentation issues etc cause it to a really challenging thing.

And if they aren't aligned big contiguous allocations, then they
aren't relevant from any largepage cases. You'll still have to map
them 4k at a time etc.

Linus

2015-05-09 03:02:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On 05/08/2015 09:14 PM, Linus Torvalds wrote:
> On Fri, May 8, 2015 at 9:59 AM, Rik van Riel <[email protected]> wrote:
>>
>> However, for persistent memory, all of the files will be "in memory".
>
> Yes. However, I doubt you will find a very sane rw filesystem that
> then also makes them contiguous and aligns them at 2MB boundaries.
>
> Anything is possible, I guess, but things like that are *hard*. The
> fragmentation issues etc cause it to a really challenging thing.

The TLB performance bonus of accessing the large files with
large pages may make it worthwhile to solve that hard problem.

> And if they aren't aligned big contiguous allocations, then they
> aren't relevant from any largepage cases. You'll still have to map
> them 4k at a time etc.

Absolutely, but we only need the 4k struct pages when the
files are mapped. I suspect a lot of the files will just
sit around idle, without being used.

I am not convinced that the idea I wrote down earlier in
this thread is worthwhile now, but it may turn out to be
at some point in the future. It all depends on how much
data people store on DAX filesystems, and how many files
they have open at once.

--
All rights reversed

2015-05-09 03:52:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 8, 2015 at 8:02 PM, Rik van Riel <[email protected]> wrote:
>
> The TLB performance bonus of accessing the large files with
> large pages may make it worthwhile to solve that hard problem.

Very few people can actually measure that TLB advantage on systems
with good TLB's.

It's largely a myth, fed by some truly crappy TLB fill systems
(particularly sw-filled TLB's on some early RISC CPU's, but even
"modern" CPU's sometimes have glass jaws here because they cant'
prefetch TLB entries or do concurrent page table walks etc).

There are *very* few loads that actually have the kinds of access
patterns where TLB accesses dominate - or are even noticeable -
compared to the normal memory access costs.

That is doubly true with file-backed storage. The main reason you get
TLB costs to be noticeable is with very sparse access patterns, where
you hit as many TLB entries as you hit pages. That simply doesn't
happen with file mappings.

Really. The whole thing about TLB advantages of hugepages is this
almost entirely made-up stupid myth. You almost have to make up the
benchmark for it (_that_ part is easy) to even see it.

Linus

2015-05-09 08:45:25

by Ingo Molnar

[permalink] [raw]
Subject: "Directly mapped persistent memory page cache"


* Rik van Riel <[email protected]> wrote:

> On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
> >>
> >> Now go and look at your /home or /data/ or /work areas, where the
> >> endusers are actually keeping their day to day work. Photos, mp3,
> >> design files, source code, object code littered around, etc.
> >
> > However, the big files in that list are almost immaterial from a
> > caching standpoint.
>
> > The big files in your home directory? Let me make an educated guess.
> > Very few to *none* of them are actually in your page cache right now.
> > And you'd never even care if they ever made it into your page cache
> > *at*all*. Much less whether you could ever cache them using large
> > pages using some very fancy cache.
>
> However, for persistent memory, all of the files will be "in
> memory".
>
> Not instantiating the 4kB struct pages for 2MB areas that are not
> currently being accessed with small files may make a difference.
>
> For dynamically allocated 4kB page structs, we need some way to
> discover where they are. It may make sense, from a simplicity point
> of view, to have one mechanism that works both for pmem and for
> normal system memory.

I don't think we need to or want to allocate page structs dynamically,
which makes the model really simple and robust.

If we 'think big', we can create something very exciting IMHO, that
also gets rid of most of the complications with DIO, DAX, etc:

"Directly mapped pmem integrated into the page cache":
------------------------------------------------------

- The pmem filesystem is mapped directly in all cases, it has device
side struct page arrays, and its struct pages are directly in the
page cache, write-through cached. (See further below about how we
can do this.)

Note that this is radically different from the current approach
that tries to use DIO and DAX to provide specialized "direct
access" APIs.

With the 'directly mapped' approach we have numerous advantages:

- no double buffering to main RAM: the device pages represent
file content.

- no bdflush, no VM pressure, no writeback pressure, no
swapping: this is a very simple VM model where the device is
RAM and we don't have much dirty state. The primary kernel
cache is the dcache and the directly mapped page cache, which
is not a writeback cache in this case but essentially a
logical->physical index cache of filesystem indexing
metadata.

- every binary mmap()ed would be XIP mapped in essence

- every read() would be equivalent a DIO read, without the
complexity of DIO.

- every read() or write() done into a data mmap() area would
allow device-to-device zero copy DMA.

- main RAM caching would still be avilable and would work in
many cases by default: as most apps use file processing
buffers in anonymous memory into which they read() data.

We can achieve this by statically allocating all page structs on the
device, in the following way:

- For every 128MB of pmem data we allocate 2MB of struct-page
descriptors, 64 bytes each, that describes that 128MB data range
in a 4K granular way. We never have to allocate page structs as
they are always there.

- Filesystems don't directly see the preallocated page arrays, they
still get a 'logical block space' presented that to them looks
like a continuous block device (which is 1.5% smaller than the
true size of the device): this allows arbitrary filesystems to be
put into such pmem devices, fsck will just work, etc.

I.e. no special pmem filesystem: the full range of existing block
device based Linux filesystems can be used.

- These page structs are initialized in three layers:

- a single bit at 128MB data granularity: the first struct page
of the 2MB large array (32,768 struct page array members)
represents the initialization state of all of them.

- a single bit at 2MB data granularity: the first struct page
of every 32K array within the 2MB array represents the whole
2MB data area. There are 64 such bits per 2MB array.

- a single bit at 4K data granularity: the whole page array.

A page marked uninitialized at a higher layer means all lower
layer struct pages are in their initial state.

This is a variant of your suggestion: one that keeps everything
2MB aligned, so that a single kernel side 2MB TLB covers a
continuous chunk of the page array. This allows us to create a
linear VMAP physical memory model to simplify index mapping.

- Looking up such a struct page (from a pfn) involves two simple,
easily computable indirections. With locality of access
present, 'hot' struct pages will be in the CPU cache. Them being
64 bytes each will help this. The on-device format is so simple
and so temporary that no fsck is needed for it.

- 2MB mappings, where desired, are 'natural' in such a layout:
everything's 2MB aligned both for kernel and user space use, while
4K granularity is still a first class citizen as well.

- For TB range storage we could make it 1GB granular: We'd allocate
a 1GB array for every 64 GB of data. This would also allow gbpage
TLBs to be taken advantage of: especially on the kernel side
(vmapping the 1GB page array) this might be useful, even if all
actual file usage is 4KB granular. The last block would be allowed
to be smaller than 64GB, but size would still be rounded to 1GB to
keep the mapping simple.

What do you think?

Thanks,

Ingo

2015-05-09 15:56:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

Ingo Molnar <[email protected]> writes:

> * Rik van Riel <[email protected]> wrote:
>
>> On 05/08/2015 11:54 AM, Linus Torvalds wrote:
>> > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
>> >>
>> >> Now go and look at your /home or /data/ or /work areas, where the
>> >> endusers are actually keeping their day to day work. Photos, mp3,
>> >> design files, source code, object code littered around, etc.
>> >
>> > However, the big files in that list are almost immaterial from a
>> > caching standpoint.
>>
>> > The big files in your home directory? Let me make an educated guess.
>> > Very few to *none* of them are actually in your page cache right now.
>> > And you'd never even care if they ever made it into your page cache
>> > *at*all*. Much less whether you could ever cache them using large
>> > pages using some very fancy cache.
>>
>> However, for persistent memory, all of the files will be "in
>> memory".
>>
>> Not instantiating the 4kB struct pages for 2MB areas that are not
>> currently being accessed with small files may make a difference.
>>
>> For dynamically allocated 4kB page structs, we need some way to
>> discover where they are. It may make sense, from a simplicity point
>> of view, to have one mechanism that works both for pmem and for
>> normal system memory.
>
> I don't think we need to or want to allocate page structs dynamically,
> which makes the model really simple and robust.
>
> If we 'think big', we can create something very exciting IMHO, that
> also gets rid of most of the complications with DIO, DAX, etc:
>
> "Directly mapped pmem integrated into the page cache":
> ------------------------------------------------------
>
> - The pmem filesystem is mapped directly in all cases, it has device
> side struct page arrays, and its struct pages are directly in the
> page cache, write-through cached. (See further below about how we
> can do this.)
>
> Note that this is radically different from the current approach
> that tries to use DIO and DAX to provide specialized "direct
> access" APIs.
>
> With the 'directly mapped' approach we have numerous advantages:
>
> - no double buffering to main RAM: the device pages represent
> file content.
>
> - no bdflush, no VM pressure, no writeback pressure, no
> swapping: this is a very simple VM model where the device is
> RAM and we don't have much dirty state. The primary kernel
> cache is the dcache and the directly mapped page cache, which
> is not a writeback cache in this case but essentially a
> logical->physical index cache of filesystem indexing
> metadata.
>
> - every binary mmap()ed would be XIP mapped in essence
>
> - every read() would be equivalent a DIO read, without the
> complexity of DIO.
>
> - every read() or write() done into a data mmap() area would
> allow device-to-device zero copy DMA.
>
> - main RAM caching would still be avilable and would work in
> many cases by default: as most apps use file processing
> buffers in anonymous memory into which they read() data.
>
> We can achieve this by statically allocating all page structs on the
> device, in the following way:
>
> - For every 128MB of pmem data we allocate 2MB of struct-page
> descriptors, 64 bytes each, that describes that 128MB data range
> in a 4K granular way. We never have to allocate page structs as
> they are always there.
>
> - Filesystems don't directly see the preallocated page arrays, they
> still get a 'logical block space' presented that to them looks
> like a continuous block device (which is 1.5% smaller than the
> true size of the device): this allows arbitrary filesystems to be
> put into such pmem devices, fsck will just work, etc.
>
> I.e. no special pmem filesystem: the full range of existing block
> device based Linux filesystems can be used.
>
> - These page structs are initialized in three layers:
>
> - a single bit at 128MB data granularity: the first struct page
> of the 2MB large array (32,768 struct page array members)
> represents the initialization state of all of them.
>
> - a single bit at 2MB data granularity: the first struct page
> of every 32K array within the 2MB array represents the whole
> 2MB data area. There are 64 such bits per 2MB array.
>
> - a single bit at 4K data granularity: the whole page array.
>
> A page marked uninitialized at a higher layer means all lower
> layer struct pages are in their initial state.
>
> This is a variant of your suggestion: one that keeps everything
> 2MB aligned, so that a single kernel side 2MB TLB covers a
> continuous chunk of the page array. This allows us to create a
> linear VMAP physical memory model to simplify index mapping.
>
> - Looking up such a struct page (from a pfn) involves two simple,
> easily computable indirections. With locality of access
> present, 'hot' struct pages will be in the CPU cache. Them being
> 64 bytes each will help this. The on-device format is so simple
> and so temporary that no fsck is needed for it.
>
> - 2MB mappings, where desired, are 'natural' in such a layout:
> everything's 2MB aligned both for kernel and user space use, while
> 4K granularity is still a first class citizen as well.
>
> - For TB range storage we could make it 1GB granular: We'd allocate
> a 1GB array for every 64 GB of data. This would also allow gbpage
> TLBs to be taken advantage of: especially on the kernel side
> (vmapping the 1GB page array) this might be useful, even if all
> actual file usage is 4KB granular. The last block would be allowed
> to be smaller than 64GB, but size would still be rounded to 1GB to
> keep the mapping simple.
>
> What do you think?

The tricky bit is what happens when you reboot and run a different
version of the kernel, especially a kernel with things debugging
features like kmemcheck that increase the size of struct page.

I think we could reserve space for struct page entries in the persistent
memory and 64bytes appears to be a reasonable size. But it would have
to be something that we initialize on mount or initialize on demand.

I don't think we could have persistent struct page entries, as the exact
contents of the struct page entries is too volatile and too different
between architectures. Especially architecture changes that a pmem
store is likely to see such as switching between a 32bit and a 64bit
kernel.

Further I think where in the persistent memory the struct page arrays
live is something we could leave up to the filesystem. We could have
some reasonable constraints to make it fast but I think whoever decides
where things live on the persistent memory can make that choice.

For small persistent memories it probably make sense to allocate the
struct page array describing them out of ordinary ram. For small
memories I don't think we are talking enough memory to worry about.
For TB+ persistent memories where you need 16GiB per TiB it makes sense
to allocate a one or several regions to store your struct page arrays,
as you can't count on ordinary ram having enough capacity, and you may
not even be talking about a system that actually has ordinary ram at
that point.

Eric

2015-05-09 18:24:15

by Dan Williams

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Sat, May 9, 2015 at 1:45 AM, Ingo Molnar <[email protected]> wrote:
>
> * Rik van Riel <[email protected]> wrote:
>
>> On 05/08/2015 11:54 AM, Linus Torvalds wrote:
>> > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
>> >>
>> >> Now go and look at your /home or /data/ or /work areas, where the
>> >> endusers are actually keeping their day to day work. Photos, mp3,
>> >> design files, source code, object code littered around, etc.
>> >
>> > However, the big files in that list are almost immaterial from a
>> > caching standpoint.
>>
>> > The big files in your home directory? Let me make an educated guess.
>> > Very few to *none* of them are actually in your page cache right now.
>> > And you'd never even care if they ever made it into your page cache
>> > *at*all*. Much less whether you could ever cache them using large
>> > pages using some very fancy cache.
>>
>> However, for persistent memory, all of the files will be "in
>> memory".
>>
>> Not instantiating the 4kB struct pages for 2MB areas that are not
>> currently being accessed with small files may make a difference.
>>
>> For dynamically allocated 4kB page structs, we need some way to
>> discover where they are. It may make sense, from a simplicity point
>> of view, to have one mechanism that works both for pmem and for
>> normal system memory.
>
> I don't think we need to or want to allocate page structs dynamically,
> which makes the model really simple and robust.
>
> If we 'think big', we can create something very exciting IMHO, that
> also gets rid of most of the complications with DIO, DAX, etc:
>
> "Directly mapped pmem integrated into the page cache":
> ------------------------------------------------------
>
> - The pmem filesystem is mapped directly in all cases, it has device
> side struct page arrays, and its struct pages are directly in the
> page cache, write-through cached. (See further below about how we
> can do this.)
>
> Note that this is radically different from the current approach
> that tries to use DIO and DAX to provide specialized "direct
> access" APIs.
>
> With the 'directly mapped' approach we have numerous advantages:
>
> - no double buffering to main RAM: the device pages represent
> file content.
>
> - no bdflush, no VM pressure, no writeback pressure, no
> swapping: this is a very simple VM model where the device is
> RAM and we don't have much dirty state. The primary kernel
> cache is the dcache and the directly mapped page cache, which
> is not a writeback cache in this case but essentially a
> logical->physical index cache of filesystem indexing
> metadata.
>
> - every binary mmap()ed would be XIP mapped in essence
>
> - every read() would be equivalent a DIO read, without the
> complexity of DIO.
>
> - every read() or write() done into a data mmap() area would
> allow device-to-device zero copy DMA.
>
> - main RAM caching would still be avilable and would work in
> many cases by default: as most apps use file processing
> buffers in anonymous memory into which they read() data.
>
> We can achieve this by statically allocating all page structs on the
> device, in the following way:
>
> - For every 128MB of pmem data we allocate 2MB of struct-page
> descriptors, 64 bytes each, that describes that 128MB data range
> in a 4K granular way. We never have to allocate page structs as
> they are always there.
>
> - Filesystems don't directly see the preallocated page arrays, they
> still get a 'logical block space' presented that to them looks
> like a continuous block device (which is 1.5% smaller than the
> true size of the device): this allows arbitrary filesystems to be
> put into such pmem devices, fsck will just work, etc.
>
> I.e. no special pmem filesystem: the full range of existing block
> device based Linux filesystems can be used.
>
> - These page structs are initialized in three layers:
>
> - a single bit at 128MB data granularity: the first struct page
> of the 2MB large array (32,768 struct page array members)
> represents the initialization state of all of them.
>
> - a single bit at 2MB data granularity: the first struct page
> of every 32K array within the 2MB array represents the whole
> 2MB data area. There are 64 such bits per 2MB array.
>
> - a single bit at 4K data granularity: the whole page array.
>
> A page marked uninitialized at a higher layer means all lower
> layer struct pages are in their initial state.
>
> This is a variant of your suggestion: one that keeps everything
> 2MB aligned, so that a single kernel side 2MB TLB covers a
> continuous chunk of the page array. This allows us to create a
> linear VMAP physical memory model to simplify index mapping.
>
> - Looking up such a struct page (from a pfn) involves two simple,
> easily computable indirections. With locality of access
> present, 'hot' struct pages will be in the CPU cache. Them being
> 64 bytes each will help this. The on-device format is so simple
> and so temporary that no fsck is needed for it.
>
> - 2MB mappings, where desired, are 'natural' in such a layout:
> everything's 2MB aligned both for kernel and user space use, while
> 4K granularity is still a first class citizen as well.
>
> - For TB range storage we could make it 1GB granular: We'd allocate
> a 1GB array for every 64 GB of data. This would also allow gbpage
> TLBs to be taken advantage of: especially on the kernel side
> (vmapping the 1GB page array) this might be useful, even if all
> actual file usage is 4KB granular. The last block would be allowed
> to be smaller than 64GB, but size would still be rounded to 1GB to
> keep the mapping simple.
>
> What do you think?

Nice, I think it makes sense as an area that gets reserved at file
system creation time. You are not proposing that this gets
automatically reserved at the device level, right? For the use cases
of persistent memory in the absence of a file system (in-kernel
managed hierarchical storage driver, or mmap() the pmem block device
directly) there's still a need for pfn-based DAX. In other words,
simple usages skip the overhead. They assume pfns only, discover
those pmem pfns via ->direct_access(), and arrange for exclusive block
device ownership. Everything else that requires struct page in turn
requires a file system to have opt-ed into this reservation and
provide pmem-struct-page infrastructure, pmem-aware-DIO etc.

2015-05-09 21:57:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

On Fri, May 08, 2015 at 11:02:28PM -0400, Rik van Riel wrote:
> On 05/08/2015 09:14 PM, Linus Torvalds wrote:
> > On Fri, May 8, 2015 at 9:59 AM, Rik van Riel <[email protected]> wrote:
> >>
> >> However, for persistent memory, all of the files will be "in memory".
> >
> > Yes. However, I doubt you will find a very sane rw filesystem that
> > then also makes them contiguous and aligns them at 2MB boundaries.
> >
> > Anything is possible, I guess, but things like that are *hard*. The
> > fragmentation issues etc cause it to a really challenging thing.
>
> The TLB performance bonus of accessing the large files with
> large pages may make it worthwhile to solve that hard problem.

FWIW, for DAX ththe filesystem allocation side is already mostly
solved - this is just an allocation alignment hint, analogous to
RAID stripe alignment. We don't need to reinvent the wheel here.
i.e. On XFS, use a 2MB stripe unit for the fs, a 2MB extent size
hint for files you want to use large pages on and you'll get 2MB
sized and aligned allocations from the filesystem for as long as
there are such freespace regions available.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-10 09:46:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"


* Dan Williams <[email protected]> wrote:

> > "Directly mapped pmem integrated into the page cache":
> > ------------------------------------------------------

> Nice, I think it makes sense as an area that gets reserved at file
> system creation time. You are not proposing that this gets
> automatically reserved at the device level, right? [...]

Well, it's most practical if the device does it automatically (the
layout is determined prior filesystem creation), and the filesystem
does not necessarily have to be aware of it - but obviously as a user
opt-in.

> [...] For the use cases of persistent memory in the absence of a
> file system (in-kernel managed hierarchical storage driver, or
> mmap() the pmem block device directly) there's still a need for
> pfn-based DAX. [...]

Yes, but knowing that there's a sane model we can take a hard stance
against pfn proliferation and say: 'we let pfns this far and no
farther'.

> [...] In other words, simple usages skip the overhead. They assume
> pfns only, discover those pmem pfns via ->direct_access(), and
> arrange for exclusive block device ownership. Everything else that
> requires struct page in turn requires a file system to have opt-ed
> into this reservation and provide pmem-struct-page infrastructure,
> pmem-aware-DIO etc.

Yes.

Thanks,

Ingo

2015-05-10 10:07:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"


* Eric W. Biederman <[email protected]> wrote:

> > What do you think?
>
> The tricky bit is what happens when you reboot and run a different
> version of the kernel, especially a kernel with things debugging
> features like kmemcheck that increase the size of struct page.

Yes - but I think that's relatively easy to handle, as most 'weird'
page struct usages can be cordoned off:

I.e. we could define a 64-bit "core" struct page, denote it with a
single PG_ flag and stick with it: the only ABI is its size
essentially, as we (lazily) re-initialize it after every bootup.

The 'extended' (often debug) part of a struct page, such as
page->shadow on kmemcheck, can simply be handled in a special way
based on the PG_ flag:

- for example in the kmemcheck case no page->shadow means no leak
tracking: that's perfectly fine as these pages aren't part of the
buddy allocator and kmalloc() anyway.

- or NUMA_BALANCING's page->_last_cpupid can be 0 as well, as these
pages aren't (normally) NUMA-migrated.

The extended fields would have to be accessed via small wrappers,
which return 0 if the extended part is not present, but that's pretty
much all.

> I don't think we could have persistent struct page entries, as the
> exact contents of the struct page entries is too volatile and too
> different between architectures. [...]

Especially with the 2MB (and 1GB) granular lazy initialization
approach persisting them across reboots does not seem necessary
either.

Even main RAM is already doing lazy initialization: Mel's patches that
do that just went into -mm.

> [...] Especially architecture changes that a pmem store is likely to
> see such as switching between a 32bit and a 64bit kernel.

We'd not want to ABI-restrict the layout of struct page. But to say
that there's a core 64-byte descriptor per 4K page is not an overly
strict promise to keep.

> Further I think where in the persistent memory the struct page
> arrays live is something we could leave up to the filesystem. We
> could have some reasonable constraints to make it fast but I think
> whoever decides where things live on the persistent memory can make
> that choice.

So the beauty of the scheme is that in its initial incarnation it's
filesystem independent: you can create any filesystem on top of it
seemlessly, the filesystem simply sees a linear block device that is
1.5% smaller than the underlying storage. It won't even (normally)
have access to the struct page areas. This kind of data space
separation also protects against filesystem originated data
corruption.

Now in theory a filesystem might be aware of it, but I think it's far
more important to keep this scheme simple, robust, fast and
predictable.

> For small persistent memories it probably make sense to allocate the
> struct page array describing them out of ordinary ram. For small
> memories I don't think we are talking enough memory to worry about.
> For TB+ persistent memories where you need 16GiB per TiB it makes
> sense to allocate a one or several regions to store your struct page
> arrays, as you can't count on ordinary ram having enough capacity,
> and you may not even be talking about a system that actually has
> ordinary ram at that point.

Correct - if there's no ordinary page cache in main DRAM then for many
appliances ordinary RAM could be something like SRAM: really fast and
not wasted on dirty state and IO caches - a huge, directly mapped L4
or L5 CPU cache in essence.

Thanks,

Ingo

2015-05-10 17:29:05

by Dan Williams

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Sun, May 10, 2015 at 2:46 AM, Ingo Molnar <[email protected]> wrote:
>
> * Dan Williams <[email protected]> wrote:
>
>> > "Directly mapped pmem integrated into the page cache":
>> > ------------------------------------------------------
>
>> Nice, I think it makes sense as an area that gets reserved at file
>> system creation time. You are not proposing that this gets
>> automatically reserved at the device level, right? [...]
>
> Well, it's most practical if the device does it automatically (the
> layout is determined prior filesystem creation), and the filesystem
> does not necessarily have to be aware of it - but obviously as a user
> opt-in.
>

Hmm, my only hesitation is that the raw size of a pmem device is
visible outside of Linux (UEFI/BIOS other OSes, etc). What about a
simple layered block-device that fronts a raw pmem device? It can
store a small superblock signature and then reserve / init the struct
page space. This is where we can use the __pfn_t flags that Linus
suggested. Whereas the raw device ->direct_access() returns __pfn_t
values with a 'PFN_DEV' flag indicating originating from
device-memory, a ->direct_access() on this struct-page-provider-device
yields __pfn_t's with 'PFN_DEV | PFN_MAPPED' indicating that
__pfn_t_to_page() can attempt to lookup the page in a device-specific
manner (similar to how kmap_atomic_pfn_t is implemented in the
'evacuate struct page from the block layer' patch set).

2015-05-11 08:25:47

by Dave Chinner

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
>
> * Rik van Riel <[email protected]> wrote:
>
> > On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
> > >>
> > >> Now go and look at your /home or /data/ or /work areas, where the
> > >> endusers are actually keeping their day to day work. Photos, mp3,
> > >> design files, source code, object code littered around, etc.
> > >
> > > However, the big files in that list are almost immaterial from a
> > > caching standpoint.
> >
> > > The big files in your home directory? Let me make an educated guess.
> > > Very few to *none* of them are actually in your page cache right now.
> > > And you'd never even care if they ever made it into your page cache
> > > *at*all*. Much less whether you could ever cache them using large
> > > pages using some very fancy cache.
> >
> > However, for persistent memory, all of the files will be "in
> > memory".
> >
> > Not instantiating the 4kB struct pages for 2MB areas that are not
> > currently being accessed with small files may make a difference.
> >
> > For dynamically allocated 4kB page structs, we need some way to
> > discover where they are. It may make sense, from a simplicity point
> > of view, to have one mechanism that works both for pmem and for
> > normal system memory.
>
> I don't think we need to or want to allocate page structs dynamically,
> which makes the model really simple and robust.
>
> If we 'think big', we can create something very exciting IMHO, that
> also gets rid of most of the complications with DIO, DAX, etc:
>
> "Directly mapped pmem integrated into the page cache":
> ------------------------------------------------------
>
> - The pmem filesystem is mapped directly in all cases, it has device
> side struct page arrays, and its struct pages are directly in the
> page cache, write-through cached. (See further below about how we
> can do this.)
>
> Note that this is radically different from the current approach
> that tries to use DIO and DAX to provide specialized "direct
> access" APIs.
>
> With the 'directly mapped' approach we have numerous advantages:
>
> - no double buffering to main RAM: the device pages represent
> file content.
>
> - no bdflush, no VM pressure, no writeback pressure, no
> swapping: this is a very simple VM model where the device is

But, OTOH, no encryption, no compression, no
mirroring/redundancy/repair, etc. i.e. it's a model where it is
impossible to do data transformations in the IO path....

> - every read() would be equivalent a DIO read, without the
> complexity of DIO.

Sure, it is replaced with the complexity of the buffered read path.
Swings and roundabouts.

> - every read() or write() done into a data mmap() area would
> allow device-to-device zero copy DMA.
>
> - main RAM caching would still be avilable and would work in
> many cases by default: as most apps use file processing
> buffers in anonymous memory into which they read() data.
>
> We can achieve this by statically allocating all page structs on the
> device, in the following way:
>
> - For every 128MB of pmem data we allocate 2MB of struct-page
> descriptors, 64 bytes each, that describes that 128MB data range
> in a 4K granular way. We never have to allocate page structs as
> they are always there.

Who allocates them, when do they get allocated, what happens when
they get corrupted?

> - Filesystems don't directly see the preallocated page arrays, they
> still get a 'logical block space' presented that to them looks
> like a continuous block device (which is 1.5% smaller than the
> true size of the device): this allows arbitrary filesystems to be
> put into such pmem devices, fsck will just work, etc.

Again, what happens when the page arrays get corrupted? You can't
just reboot to make the corruption go away.

i.e. what's the architecture of the supporting userspace utilities
that are needed to manage this persistent page array area?

> I.e. no special pmem filesystem: the full range of existing block
> device based Linux filesystems can be used.
>
> - These page structs are initialized in three layers:
>
> - a single bit at 128MB data granularity: the first struct page
> of the 2MB large array (32,768 struct page array members)
> represents the initialization state of all of them.
>
> - a single bit at 2MB data granularity: the first struct page
> of every 32K array within the 2MB array represents the whole
> 2MB data area. There are 64 such bits per 2MB array.
>
> - a single bit at 4K data granularity: the whole page array.

Why wouldn't you just initialise them for the whole device in one
go? If they are transparent to the filesystem address space, then
you have to reserve space for the entire pmem range up front, so
why wouldn't you just initialise them when you reserve the space?

> A page marked uninitialized at a higher layer means all lower
> layer struct pages are in their initial state.
>
> This is a variant of your suggestion: one that keeps everything
> 2MB aligned, so that a single kernel side 2MB TLB covers a
> continuous chunk of the page array. This allows us to create a
> linear VMAP physical memory model to simplify index mapping.

What is doing this aligned allocation of the persistent memory
extents? The filesystem, right?

All this talk about page arrays and aligned allocation of pages
for mapping as large pages has to come from the filesystem
allocating large aligned extents. IOWs, the only way we can get
large page mappings in the VM for persistent memory is if the
filesystem managing the persistent memory /does the right thing/.

And, of course, different platforms have different page sizes, so
designing page array structures to be optimal for x86-64 is just a
wee bit premature.

What we need to do is work out how we are going to tell the
filesystem that is managing the persistent memory what the alignment
constraints it needs to work under are.

> - For TB range storage we could make it 1GB granular: We'd allocate
> a 1GB array for every 64 GB of data. This would also allow gbpage
> TLBs to be taken advantage of: especially on the kernel side

A properly designed extent allocator will understand this second
level of alignment.
VM FS
Minimum unit of allocation: PAGE_SIZE block size
First unit of alignment: Large Page Size stripe unit
Second unit of alignment: Giant Page Size stripe width

i.e. this is the information the pmem device needs to feed
the filesystem mkfs program to start it down the correct path.

Next, you need a hint for each file to tell the filesystem what
alignment it should try to allocate with. XFS has extent size hints
for this, and for a 4k page/block size this allows up to 4GB hints
to be set. XFs allocates these as unwritten extents, so if the VM
can only map it as PAGE_SIZE mappings, then everything will still
just work - the dirtied pages will get converted to written, and
everything else will appear as zeros because they remain unwritten.

Map the page as a single 2MB chunk, and then the fs will have to
zero the entire chunk on the first write page fault so it can mark
the entire extent as written data.

IOWs the initialisation state of the struct pages is actually a
property of the filesystem space usage, not a property of the
virtual mappings that are currently active. If the space is in use,
then then the struct pages must be initialised, if the pages are
free space then we don't care what their contents are as nobody can
be accessing them. Further, we cannot validate that the page array
structures are valid in isolation (we must be able to independently
validate them if they are persistent) and hence we need to know
whether the pages are referenced by the filesystem or not to
determine whether their state is correct.

Which comes back to my original question: if the struct page arrays
are outside the visibility of the filesystem, how do we manage them
in a safe and consistent manner? How do we verify they areD correct
coherent with the filesystem using the device when the filesystem
knows nothing about page mapping space, and the page mapping space
knowns nothing about the contents of the pmem device? Indeed, how do we
do transactionally safe updates to thea page arrays to mark them
initialised so that they are atomic w.r.t. the associated filesystem
free space state changes? And dare I say "truncate"?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-11 09:18:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"


* Dave Chinner <[email protected]> wrote:

> On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> >
> > * Rik van Riel <[email protected]> wrote:
> >
> > > On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > > > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
> > > >>
> > > >> Now go and look at your /home or /data/ or /work areas, where the
> > > >> endusers are actually keeping their day to day work. Photos, mp3,
> > > >> design files, source code, object code littered around, etc.
> > > >
> > > > However, the big files in that list are almost immaterial from a
> > > > caching standpoint.
> > >
> > > > The big files in your home directory? Let me make an educated guess.
> > > > Very few to *none* of them are actually in your page cache right now.
> > > > And you'd never even care if they ever made it into your page cache
> > > > *at*all*. Much less whether you could ever cache them using large
> > > > pages using some very fancy cache.
> > >
> > > However, for persistent memory, all of the files will be "in
> > > memory".
> > >
> > > Not instantiating the 4kB struct pages for 2MB areas that are not
> > > currently being accessed with small files may make a difference.
> > >
> > > For dynamically allocated 4kB page structs, we need some way to
> > > discover where they are. It may make sense, from a simplicity point
> > > of view, to have one mechanism that works both for pmem and for
> > > normal system memory.
> >
> > I don't think we need to or want to allocate page structs dynamically,
> > which makes the model really simple and robust.
> >
> > If we 'think big', we can create something very exciting IMHO, that
> > also gets rid of most of the complications with DIO, DAX, etc:
> >
> > "Directly mapped pmem integrated into the page cache":
> > ------------------------------------------------------
> >
> > - The pmem filesystem is mapped directly in all cases, it has device
> > side struct page arrays, and its struct pages are directly in the
> > page cache, write-through cached. (See further below about how we
> > can do this.)
> >
> > Note that this is radically different from the current approach
> > that tries to use DIO and DAX to provide specialized "direct
> > access" APIs.
> >
> > With the 'directly mapped' approach we have numerous advantages:
> >
> > - no double buffering to main RAM: the device pages represent
> > file content.
> >
> > - no bdflush, no VM pressure, no writeback pressure, no
> > swapping: this is a very simple VM model where the device is
>
> But, OTOH, no encryption, no compression, no
> mirroring/redundancy/repair, etc. [...]

mirroring/redundancy/repair should be relatively easy to add without
hurting the the simplicity of the scheme - but it can also be part of
the filesystem.

Compression and encryption is not able to directly represent content
in pram anyway. You could still do per file encryption and
compression, if the filesystem supports it. Any block based filesystem
can be used.

> [...] i.e. it's a model where it is impossible to do data
> transformations in the IO path....

So the limitation is to not do destructive data transformations, so
that we can map 'storage content' to 'user memory' directly. (FWIMBW)

But you are wrong about mirroring/redundancy/repair: these concepts do
not require destructive data (content) transformation: they mostly
work by transforming addresses (or at most adding extra metadata),
they don't destroy the original content.

> > - every read() would be equivalent a DIO read, without the
> > complexity of DIO.
>
> Sure, it is replaced with the complexity of the buffered read path.
> Swings and roundabouts.

So you say this as if it was a bad thing, while the regular read()
path is Linux's main VFS and IO path. So I'm not sure what your point
is here.

> > - every read() or write() done into a data mmap() area would
> > allow device-to-device zero copy DMA.
> >
> > - main RAM caching would still be avilable and would work in
> > many cases by default: as most apps use file processing
> > buffers in anonymous memory into which they read() data.
> >
> > We can achieve this by statically allocating all page structs on the
> > device, in the following way:
> >
> > - For every 128MB of pmem data we allocate 2MB of struct-page
> > descriptors, 64 bytes each, that describes that 128MB data range
> > in a 4K granular way. We never have to allocate page structs as
> > they are always there.
>
> Who allocates them, when do they get allocated, [...]

Multiple models can be used for that: the simplest would be at device
creation time with some exceedingly simple tooling that just sets a
superblock to make it easy to autodetect. (Should the superblock get
corrupted, it can be re-created with the same parameters,
non-destructively, etc.)

There's nothing unusual here, there are no extra tradeoffs that I can
see.

> [...] what happens when they get corrupted?

Nothing unexpected should happen, they get reinitialized on every
reboot, see the lazy initialization scheme I describe later in the
proposal.

> > - Filesystems don't directly see the preallocated page arrays, they
> > still get a 'logical block space' presented that to them looks
> > like a continuous block device (which is 1.5% smaller than the
> > true size of the device): this allows arbitrary filesystems to be
> > put into such pmem devices, fsck will just work, etc.
>
> Again, what happens when the page arrays get corrupted? You can't
> just reboot to make the corruption go away.

That's exactly what you can do - just like what you do when the
regular DRAM page array gets corrupted.

> i.e. what's the architecture of the supporting userspace utilities
> that are needed to manage this persistent page array area?

The structure is so simple and is essentially lazy initialized again
from scratch on bootup (like regular RAM page arrays) so that no
utilities are needed for the kernel to make use of them.

> > I.e. no special pmem filesystem: the full range of existing block
> > device based Linux filesystems can be used.
> >
> > - These page structs are initialized in three layers:
> >
> > - a single bit at 128MB data granularity: the first struct page
> > of the 2MB large array (32,768 struct page array members)
> > represents the initialization state of all of them.
> >
> > - a single bit at 2MB data granularity: the first struct page
> > of every 32K array within the 2MB array represents the whole
> > 2MB data area. There are 64 such bits per 2MB array.
> >
> > - a single bit at 4K data granularity: the whole page array.
>
> Why wouldn't you just initialise them for the whole device in one
> go? If they are transparent to the filesystem address space, then
> you have to reserve space for the entire pmem range up front, so why
> wouldn't you just initialise them when you reserve the space?

Partly because we don't want to make the contents of struct page an
ABI, and also because this fits the regular 'memory zone' model
better.

> > A page marked uninitialized at a higher layer means all lower
> > layer struct pages are in their initial state.
> >
> > This is a variant of your suggestion: one that keeps everything
> > 2MB aligned, so that a single kernel side 2MB TLB covers a
> > continuous chunk of the page array. This allows us to create a
> > linear VMAP physical memory model to simplify index mapping.
>
> What is doing this aligned allocation of the persistent memory
> extents? The filesystem, right?

No, it happens at the (block) device level, the filesystem does not
see anything from this, it's transparent.

> All this talk about page arrays and aligned allocation of pages for
> mapping as large pages has to come from the filesystem allocating
> large aligned extents. IOWs, the only way we can get large page
> mappings in the VM for persistent memory is if the filesystem
> managing the persistent memory /does the right thing/.

No, it does not come from the filesystem, in my suggested scheme it's
allocated at the pmem device level.

> And, of course, different platforms have different page sizes, so
> designing page array structures to be optimal for x86-64 is just a
> wee bit premature.

4K is the smallest one on x86 and ARM, and it's also a IMHO pretty
sane default from a human workflow point of view.

But oddball configs with larger page sizes could also be supported at
device creation time (via a simple superblock structure).

> What we need to do is work out how we are going to tell the
> filesystem that is managing the persistent memory what the alignment
> constraints it needs to work under are.

The filesystem does not need to know about any of this: it sees a
linear, continuous range of storage space - the page arrays are hidden
from it.

> [...]
>
> Which comes back to my original question: if the struct page arrays
> are outside the visibility of the filesystem, how do we manage them
> in a safe and consistent manner? How do we verify they areD correct
> coherent with the filesystem using the device when the filesystem
> knows nothing about page mapping space, and the page mapping space
> knowns nothing about the contents of the pmem device?

The page arrays are outside the filessystem's visibility just like the
management of regular main RAM page arrays are outside the
filesystem's visibility.

> [...] Indeed, how do we do transactionally safe updates to thea page
> arrays to mark them initialised so that they are atomic w.r.t. the
> associated filesystem free space state changes?

We don't need transaction safe updates of the 'initialized' bits, as
the highest level is marked to zero at bootup, we only need them to be
SMP coherent - which the regular page flag ops guaantee.

> [...] And dare I say "truncate"?

truncate has no relation to this: the filesystem manages its free
space like it did previously.

Really, I'd be blind to not notice your hostility and I'd like to
understand its source. What's the problem?

Thanks,

Ingo

2015-05-11 10:13:22

by Zuckerman, Boris

[permalink] [raw]
Subject: RE: "Directly mapped persistent memory page cache"

Hi,

Data transformation (EC, encryption, etc) is commonly done by storage systems today. But let's think about other less common existing and PM specific upcoming features like data sharing between multiple consumers (computers for example), support for atomicity (to avoid journaling in PM space), etc.

Support for such features really calls for more advanced run-time handling of memory resources in OS. In my mind that naturally calls today for dynamic struct page allocation, but may need to go even beyond that into understanding what's persistent what's volatile, extending and shrinking memory, etc...

Boris
Sent from my Verizon Wireless 4G LTE smartphone


-------- Original message --------
From: Ingo Molnar <[email protected]>
Date: 05/11/2015 5:20 AM (GMT-05:00)
To: Dave Chinner <[email protected]>
Cc: Rik van Riel <[email protected]>, Linus Torvalds <[email protected]>, John Stoffel <[email protected]>, Dave Hansen <[email protected]>, Dan Williams <[email protected]>, Linux Kernel Mailing List <[email protected]>, Boaz Harrosh <[email protected]>, Jan Kara <[email protected]>, Mike Snitzer <[email protected]>, Neil Brown <[email protected]>, Benjamin Herrenschmidt <[email protected]>, Heiko Carstens <[email protected]>, Chris Mason <[email protected]>, Paul Mackerras <[email protected]>, "H. Peter Anvin" <[email protected]>, Christoph Hellwig <[email protected]>, Alasdair Kergon <[email protected]>, "[email protected]" <[email protected]>, Mel Gorman <[email protected]>, Matthew Wilcox <[email protected]>, Ross Zwisler <[email protected]>, Martin Schwidefsky <[email protected]>, Jens Axboe <[email protected]>, Theodore Ts'o <[email protected]>, "Martin K. Petersen" <[email protected]>, Julia Lawall <[email protected]>, Tejun Heo <[email protected]>, linux-fsdevel <[email protected]>, Andrew Morton <[email protected]>
Subject: Re: "Directly mapped persistent memory page cache"


* Dave Chinner <[email protected]> wrote:

> On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> >
> > * Rik van Riel <[email protected]> wrote:
> >
> > > On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > > > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
> > > >>
> > > >> Now go and look at your /home or /data/ or /work areas, where the
> > > >> endusers are actually keeping their day to day work. Photos, mp3,
> > > >> design files, source code, object code littered around, etc.
> > > >
> > > > However, the big files in that list are almost immaterial from a
> > > > caching standpoint.
> > >
> > > > The big files in your home directory? Let me make an educated guess.
> > > > Very few to *none* of them are actually in your page cache right now.
> > > > And you'd never even care if they ever made it into your page cache
> > > > *at*all*. Much less whether you could ever cache them using large
> > > > pages using some very fancy cache.
> > >
> > > However, for persistent memory, all of the files will be "in
> > > memory".
> > >
> > > Not instantiating the 4kB struct pages for 2MB areas that are not
> > > currently being accessed with small files may make a difference.
> > >
> > > For dynamically allocated 4kB page structs, we need some way to
> > > discover where they are. It may make sense, from a simplicity point
> > > of view, to have one mechanism that works both for pmem and for
> > > normal system memory.
> >
> > I don't think we need to or want to allocate page structs dynamically,
> > which makes the model really simple and robust.
> >
> > If we 'think big', we can create something very exciting IMHO, that
> > also gets rid of most of the complications with DIO, DAX, etc:
> >
> > "Directly mapped pmem integrated into the page cache":
> > ------------------------------------------------------
> >
> > - The pmem filesystem is mapped directly in all cases, it has device
> > side struct page arrays, and its struct pages are directly in the
> > page cache, write-through cached. (See further below about how we
> > can do this.)
> >
> > Note that this is radically different from the current approach
> > that tries to use DIO and DAX to provide specialized "direct
> > access" APIs.
> >
> > With the 'directly mapped' approach we have numerous advantages:
> >
> > - no double buffering to main RAM: the device pages represent
> > file content.
> >
> > - no bdflush, no VM pressure, no writeback pressure, no
> > swapping: this is a very simple VM model where the device is
>
> But, OTOH, no encryption, no compression, no
> mirroring/redundancy/repair, etc. [...]

mirroring/redundancy/repair should be relatively easy to add without
hurting the the simplicity of the scheme - but it can also be part of
the filesystem.

Compression and encryption is not able to directly represent content
in pram anyway. You could still do per file encryption and
compression, if the filesystem supports it. Any block based filesystem
can be used.

> [...] i.e. it's a model where it is impossible to do data
> transformations in the IO path....

So the limitation is to not do destructive data transformations, so
that we can map 'storage content' to 'user memory' directly. (FWIMBW)

But you are wrong about mirroring/redundancy/repair: these concepts do
not require destructive data (content) transformation: they mostly
work by transforming addresses (or at most adding extra metadata),
they don't destroy the original content.

> > - every read() would be equivalent a DIO read, without the
> > complexity of DIO.
>
> Sure, it is replaced with the complexity of the buffered read path.
> Swings and roundabouts.

So you say this as if it was a bad thing, while the regular read()
path is Linux's main VFS and IO path. So I'm not sure what your point
is here.

> > - every read() or write() done into a data mmap() area would
> > allow device-to-device zero copy DMA.
> >
> > - main RAM caching would still be avilable and would work in
> > many cases by default: as most apps use file processing
> > buffers in anonymous memory into which they read() data.
> >
> > We can achieve this by statically allocating all page structs on the
> > device, in the following way:
> >
> > - For every 128MB of pmem data we allocate 2MB of struct-page
> > descriptors, 64 bytes each, that describes that 128MB data range
> > in a 4K granular way. We never have to allocate page structs as
> > they are always there.
>
> Who allocates them, when do they get allocated, [...]

Multiple models can be used for that: the simplest would be at device
creation time with some exceedingly simple tooling that just sets a
superblock to make it easy to autodetect. (Should the superblock get
corrupted, it can be re-created with the same parameters,
non-destructively, etc.)

There's nothing unusual here, there are no extra tradeoffs that I can
see.

> [...] what happens when they get corrupted?

Nothing unexpected should happen, they get reinitialized on every
reboot, see the lazy initialization scheme I describe later in the
proposal.

> > - Filesystems don't directly see the preallocated page arrays, they
> > still get a 'logical block space' presented that to them looks
> > like a continuous block device (which is 1.5% smaller than the
> > true size of the device): this allows arbitrary filesystems to be
> > put into such pmem devices, fsck will just work, etc.
>
> Again, what happens when the page arrays get corrupted? You can't
> just reboot to make the corruption go away.

That's exactly what you can do - just like what you do when the
regular DRAM page array gets corrupted.

> i.e. what's the architecture of the supporting userspace utilities
> that are needed to manage this persistent page array area?

The structure is so simple and is essentially lazy initialized again
from scratch on bootup (like regular RAM page arrays) so that no
utilities are needed for the kernel to make use of them.

> > I.e. no special pmem filesystem: the full range of existing block
> > device based Linux filesystems can be used.
> >
> > - These page structs are initialized in three layers:
> >
> > - a single bit at 128MB data granularity: the first struct page
> > of the 2MB large array (32,768 struct page array members)
> > represents the initialization state of all of them.
> >
> > - a single bit at 2MB data granularity: the first struct page
> > of every 32K array within the 2MB array represents the whole
> > 2MB data area. There are 64 such bits per 2MB array.
> >
> > - a single bit at 4K data granularity: the whole page array.
>
> Why wouldn't you just initialise them for the whole device in one
> go? If they are transparent to the filesystem address space, then
> you have to reserve space for the entire pmem range up front, so why
> wouldn't you just initialise them when you reserve the space?

Partly because we don't want to make the contents of struct page an
ABI, and also because this fits the regular 'memory zone' model
better.

> > A page marked uninitialized at a higher layer means all lower
> > layer struct pages are in their initial state.
> >
> > This is a variant of your suggestion: one that keeps everything
> > 2MB aligned, so that a single kernel side 2MB TLB covers a
> > continuous chunk of the page array. This allows us to create a
> > linear VMAP physical memory model to simplify index mapping.
>
> What is doing this aligned allocation of the persistent memory
> extents? The filesystem, right?

No, it happens at the (block) device level, the filesystem does not
see anything from this, it's transparent.

> All this talk about page arrays and aligned allocation of pages for
> mapping as large pages has to come from the filesystem allocating
> large aligned extents. IOWs, the only way we can get large page
> mappings in the VM for persistent memory is if the filesystem
> managing the persistent memory /does the right thing/.

No, it does not come from the filesystem, in my suggested scheme it's
allocated at the pmem device level.

> And, of course, different platforms have different page sizes, so
> designing page array structures to be optimal for x86-64 is just a
> wee bit premature.

4K is the smallest one on x86 and ARM, and it's also a IMHO pretty
sane default from a human workflow point of view.

But oddball configs with larger page sizes could also be supported at
device creation time (via a simple superblock structure).

> What we need to do is work out how we are going to tell the
> filesystem that is managing the persistent memory what the alignment
> constraints it needs to work under are.

The filesystem does not need to know about any of this: it sees a
linear, continuous range of storage space - the page arrays are hidden
from it.

> [...]
>
> Which comes back to my original question: if the struct page arrays
> are outside the visibility of the filesystem, how do we manage them
> in a safe and consistent manner? How do we verify they areD correct
> coherent with the filesystem using the device when the filesystem
> knows nothing about page mapping space, and the page mapping space
> knowns nothing about the contents of the pmem device?

The page arrays are outside the filessystem's visibility just like the
management of regular main RAM page arrays are outside the
filesystem's visibility.

> [...] Indeed, how do we do transactionally safe updates to thea page
> arrays to mark them initialised so that they are atomic w.r.t. the
> associated filesystem free space state changes?

We don't need transaction safe updates of the 'initialized' bits, as
the highest level is marked to zero at bootup, we only need them to be
SMP coherent - which the regular page flag ops guaantee.

> [...] And dare I say "truncate"?

truncate has no relation to this: the filesystem manages its free
space like it did previously.

Really, I'd be blind to not notice your hostility and I'd like to
understand its source. What's the problem?

Thanks,

Ingo

2015-05-11 10:38:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"


* Zuckerman, Boris <[email protected]> wrote:

> Hi,
>
> Data transformation (EC, encryption, etc) is commonly done by
> storage systems today. [...]

That's a strawman argument: if you do encryption/compression in the
storage space then you don't need complex struct page descriptors for
the storage space: as the resulting content won't be mappable nor
DMA-able into from high level APIs...

My proposal adds a RAM-integrated usage model for devices that are
directly mapped in physical RAM space (such as persistent memory),
where integration with high level Linux APIs is possible and
desirable.

If pmem is used as a front-side cache for a larger storage system
behind, then the disk side can still be encrypted/compressed/etc.

( Also note that if the pmem hardware itself adds an encryption pass
then actual stored content might still be encrypted. )

> [...] But let's think about other less common existing and PM
> specific upcoming features like data sharing between multiple
> consumers (computers for example), [...]

RDMA should work fine if the pmem hardware exposes it.

> [...] support for atomicity (to avoid journaling in PM space), etc.

This too should work fine, by way of the SMP coherency protocol, if
atomic instructions are used on the relevant metadata.

Thanks,

Ingo

2015-05-11 14:31:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> If we 'think big', we can create something very exciting IMHO, that
> also gets rid of most of the complications with DIO, DAX, etc:
>
> "Directly mapped pmem integrated into the page cache":
> ------------------------------------------------------
>
> - The pmem filesystem is mapped directly in all cases, it has device
> side struct page arrays, and its struct pages are directly in the
> page cache, write-through cached. (See further below about how we
> can do this.)
>
> Note that this is radically different from the current approach
> that tries to use DIO and DAX to provide specialized "direct
> access" APIs.
>
> With the 'directly mapped' approach we have numerous advantages:
>
> - no double buffering to main RAM: the device pages represent
> file content.
>
> - no bdflush, no VM pressure, no writeback pressure, no
> swapping: this is a very simple VM model where the device is
> RAM and we don't have much dirty state. The primary kernel
> cache is the dcache and the directly mapped page cache, which
> is not a writeback cache in this case but essentially a
> logical->physical index cache of filesystem indexing
> metadata.
>
> - every binary mmap()ed would be XIP mapped in essence
>
> - every read() would be equivalent a DIO read, without the
> complexity of DIO.
>
> - every read() or write() done into a data mmap() area would
> allow device-to-device zero copy DMA.
>
> - main RAM caching would still be avilable and would work in
> many cases by default: as most apps use file processing
> buffers in anonymous memory into which they read() data.

I admire your big vision, but I think there are problems that it doesn't
solve.

1. The difference in lifetimes between filesystem blocks and page cache
pages that represent them. Existing filesystems have their own block
allocators which have their own notions of when blocks are available for
reallocation which may differ from when a page in the page cache can be
reused for caching another block.

Concrete example: A mapped page of a file is used as the source or target
of a direct I/O. That file is simultaneously truncated, which in our
current paths calls the filesystem to free the block, while leaving the
page cache page in place in order to be the source or destination of
the I/O. Once the I/O completes, the page's reference count drops to
zero and the page can be freed.

If we do not modify the filesystem, that page/block may end up referring
to a block in a different file, with the usual security & integrity
problems.

2. Some of the media which currently exist (not exactly supported
well by the current DAX framework either) have great read properties,
but abysmal write properties. For example, they may have only a small
number of write cycles, or they may take milliseconds to absorb a write.
These media might work well for mapping some read-mostly files directly,
but be poor choices for putting things like struct page in, which contains
cachelines which are frquently modified.

> We can achieve this by statically allocating all page structs on the
> device, in the following way:
>
> - For every 128MB of pmem data we allocate 2MB of struct-page
> descriptors, 64 bytes each, that describes that 128MB data range
> in a 4K granular way. We never have to allocate page structs as
> they are always there.
>
> - Filesystems don't directly see the preallocated page arrays, they
> still get a 'logical block space' presented that to them looks
> like a continuous block device (which is 1.5% smaller than the
> true size of the device): this allows arbitrary filesystems to be
> put into such pmem devices, fsck will just work, etc.
>
> I.e. no special pmem filesystem: the full range of existing block
> device based Linux filesystems can be used.

I think the goal of "use any Linux filesystem" is laudable, but
impractical. Since we're modifying filesystems anyway, is there an
advantage to doing this in the block device instead of just allocating the
struct pages in a special file in the filesystem (like modern filesystems
do for various structures)?

2015-05-11 14:52:01

by Jeff Moyer

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

Ingo Molnar <[email protected]> writes:

>> [...] support for atomicity (to avoid journaling in PM space), etc.
>
> This too should work fine, by way of the SMP coherency protocol, if
> atomic instructions are used on the relevant metadata.

This isn't true. Visibility and durability are two very different
things. That's what pcommit is all about. Search for it in this
document, if you aren't already familiar with it:
https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference

However, in the context of your page structures that are re-initialized
every boot, durability doesn't matter. So maybe that's what you were
saying?

Cheers,
Jeff

2015-05-11 20:02:06

by Jerome Glisse

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Mon, May 11, 2015 at 10:31:14AM -0400, Matthew Wilcox wrote:
> On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> > If we 'think big', we can create something very exciting IMHO, that
> > also gets rid of most of the complications with DIO, DAX, etc:
> >
> > "Directly mapped pmem integrated into the page cache":
> > ------------------------------------------------------
> >
> > - The pmem filesystem is mapped directly in all cases, it has device
> > side struct page arrays, and its struct pages are directly in the
> > page cache, write-through cached. (See further below about how we
> > can do this.)
> >
> > Note that this is radically different from the current approach
> > that tries to use DIO and DAX to provide specialized "direct
> > access" APIs.
> >
> > With the 'directly mapped' approach we have numerous advantages:
> >
> > - no double buffering to main RAM: the device pages represent
> > file content.
> >
> > - no bdflush, no VM pressure, no writeback pressure, no
> > swapping: this is a very simple VM model where the device is
> > RAM and we don't have much dirty state. The primary kernel
> > cache is the dcache and the directly mapped page cache, which
> > is not a writeback cache in this case but essentially a
> > logical->physical index cache of filesystem indexing
> > metadata.
> >
> > - every binary mmap()ed would be XIP mapped in essence
> >
> > - every read() would be equivalent a DIO read, without the
> > complexity of DIO.
> >
> > - every read() or write() done into a data mmap() area would
> > allow device-to-device zero copy DMA.
> >
> > - main RAM caching would still be avilable and would work in
> > many cases by default: as most apps use file processing
> > buffers in anonymous memory into which they read() data.
>
> I admire your big vision, but I think there are problems that it doesn't
> solve.
>
> 1. The difference in lifetimes between filesystem blocks and page cache
> pages that represent them. Existing filesystems have their own block
> allocators which have their own notions of when blocks are available for
> reallocation which may differ from when a page in the page cache can be
> reused for caching another block.
>
> Concrete example: A mapped page of a file is used as the source or target
> of a direct I/O. That file is simultaneously truncated, which in our
> current paths calls the filesystem to free the block, while leaving the
> page cache page in place in order to be the source or destination of
> the I/O. Once the I/O completes, the page's reference count drops to
> zero and the page can be freed.
>
> If we do not modify the filesystem, that page/block may end up referring
> to a block in a different file, with the usual security & integrity
> problems.
>
> 2. Some of the media which currently exist (not exactly supported
> well by the current DAX framework either) have great read properties,
> but abysmal write properties. For example, they may have only a small
> number of write cycles, or they may take milliseconds to absorb a write.
> These media might work well for mapping some read-mostly files directly,
> but be poor choices for putting things like struct page in, which contains
> cachelines which are frquently modified.

I also would like to stress that such solution would not work for me. In
my case the device memory might not even be mappable by the CPU. I admit
that it is an odd case but none the less there are hardware limitation
(PCIE bar size thought we could resize them).

Even in the case where we can map the device memory to the CPU, i would
rather not have any struct page on the device memory. Any kind of small
read to device memory will simply completely break PCIE bandwidth. Each
read and write becomes a PCIE transaction with a minimum payload (128
bytes iirc) and thus all small access will congestion the bus for all
this small transaction effectively crippling the bandwidth of PCIE.

I think the scheme i proposed is easier and can serve not only my case
but also PMEM folks. Use the zero page map read only for all struct page
that are not yet allocated. When one of the filesystem layer (GPU driver
in my case) need to expose a struct page, it does allocate a page and
initialize a valid struct page then replace the zero page with that page
for the given pfn range.

Only performance hurting change is to pfn_to_page() that would need to
test a flag inside the struct page before returning NULL if not set or
struct page otherwise. But i think even that should be fine as anyway
someone asking for the struct page from a pfn is likely to access the
struct page so if pfn_to_page() already deference it then it becomes
hot in cache, like prefetching for the caller.

We can likely avoid TLB flush here (i assume that write fault trigger
the CPU to try update its TLB entry by rewalking the page table instead
of triggering the page fault vector).

Moreover i think only very few place will want to allocate underlying
struct page and thus should be high level in the filesystem stack and
thus can more easily cope with memory starvation. Not to mention we can
design a whole new cache for such allocation.

Cheers,
J?r?me

2015-05-12 00:53:57

by Dave Chinner

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
>
> * Dave Chinner <[email protected]> wrote:
>
> > On Sat, May 09, 2015 at 10:45:10AM +0200, Ingo Molnar wrote:
> > >
> > > * Rik van Riel <[email protected]> wrote:
> > >
> > > > On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> > > > > On Fri, May 8, 2015 at 7:40 AM, John Stoffel <[email protected]> wrote:
> > > > >>
> > > > >> Now go and look at your /home or /data/ or /work areas, where the
> > > > >> endusers are actually keeping their day to day work. Photos, mp3,
> > > > >> design files, source code, object code littered around, etc.
> > > > >
> > > > > However, the big files in that list are almost immaterial from a
> > > > > caching standpoint.
> > > >
> > > > > The big files in your home directory? Let me make an educated guess.
> > > > > Very few to *none* of them are actually in your page cache right now.
> > > > > And you'd never even care if they ever made it into your page cache
> > > > > *at*all*. Much less whether you could ever cache them using large
> > > > > pages using some very fancy cache.
> > > >
> > > > However, for persistent memory, all of the files will be "in
> > > > memory".
> > > >
> > > > Not instantiating the 4kB struct pages for 2MB areas that are not
> > > > currently being accessed with small files may make a difference.
> > > >
> > > > For dynamically allocated 4kB page structs, we need some way to
> > > > discover where they are. It may make sense, from a simplicity point
> > > > of view, to have one mechanism that works both for pmem and for
> > > > normal system memory.
> > >
> > > I don't think we need to or want to allocate page structs dynamically,
> > > which makes the model really simple and robust.
> > >
> > > If we 'think big', we can create something very exciting IMHO, that
> > > also gets rid of most of the complications with DIO, DAX, etc:
> > >
> > > "Directly mapped pmem integrated into the page cache":
> > > ------------------------------------------------------
> > >
> > > - The pmem filesystem is mapped directly in all cases, it has device
> > > side struct page arrays, and its struct pages are directly in the
> > > page cache, write-through cached. (See further below about how we
> > > can do this.)
> > >
> > > Note that this is radically different from the current approach
> > > that tries to use DIO and DAX to provide specialized "direct
> > > access" APIs.
> > >
> > > With the 'directly mapped' approach we have numerous advantages:
> > >
> > > - no double buffering to main RAM: the device pages represent
> > > file content.
> > >
> > > - no bdflush, no VM pressure, no writeback pressure, no
> > > swapping: this is a very simple VM model where the device is
> >
> > But, OTOH, no encryption, no compression, no
> > mirroring/redundancy/repair, etc. [...]
>
> mirroring/redundancy/repair should be relatively easy to add without
> hurting the the simplicity of the scheme - but it can also be part of
> the filesystem.

We already have it in the filesystems and block layer, but the
persistent page cache infrastructure you are proposing makes it
impossible for the existing infrastructure to be used for this
purpose.

> Compression and encryption is not able to directly represent content
> in pram anyway. You could still do per file encryption and
> compression, if the filesystem supports it. Any block based filesystem
> can be used.

Right, but they require a buffered IO path through volatile RAM,
which means treating it just like a normal storage device. IOWs,
if we add persistent page cache paths, the filesystem now will have
to support 3 different IO paths for persistent memory - a) direct
map page cache, b) buffered page cache with readahead and writeback,
and c) direct IO bypassing the page cache.

IOWs, it's not anywhere near as simple as you are implying it will
be. One of the main reasons we chose to use direct IO for DAX was so
we didn't need to add a third IO path to filesystems that wanted to
make use of DAX....

> But you are wrong about mirroring/redundancy/repair: these concepts do
> not require destructive data (content) transformation: they mostly
> work by transforming addresses (or at most adding extra metadata),
> they don't destroy the original content.

You're missing the fact that such data transformations all require
synchronisation of some kind at the IO level - it's way more complex
than just writing to RAM. e.g. parity/erasure codes need to be
calculated before any update hits the persistent storage, otherwise
the existing codes on disk are invalidated and incorrect. Hence you
cannot use direct mapped page cache (or DAX, for that matter) if the
storage path requires syncronised data updates to multiple locations
to be done.

> > > - every read() would be equivalent a DIO read, without the
> > > complexity of DIO.
> >
> > Sure, it is replaced with the complexity of the buffered read path.
> > Swings and roundabouts.
>
> So you say this as if it was a bad thing, while the regular read()
> path is Linux's main VFS and IO path. So I'm not sure what your point
> is here.

Just pointing out that the VFS read path is not as simple and fast
as you are implying it is, especially the fact that it is not
designed for low latency, high bandwidth storage.

e.g. the VFS page IO paths are designed completely around hiding the
latency of slow, low bandwidth storage. All that readahead cruft,
dirty page throttling, writeback tracking, etc are all there to hide
crappy storage performance. In comparison, the direct IO paths have
very little overhead, are optimised for high IOPS and high bandwidth
storage, and are already known to scale to the limits of any storage
subsystem we put under it. The DIO path is currently a much better
match to the characteristics of persistent memory storage than the
VFS page IO path.

Also, the page IO has significant issues with large pages - no
persistent filesystem actually supports the use of large pages in
the page IO path. i.e all are dependent on PAGE_CACHE_SIZE struct
pages in this path, and that is not easy to change to be dynamic.

IOWs the VFS IO paths will require a fair bit of change to work
well with PRAM class storage, whereas we've only had to make minor
tweaks to the DIO paths to do the same thing...

(And I haven't even mentioned the problems related to filesystems
dependent on bufferheads in the page IO paths!)

> > > - every read() or write() done into a data mmap() area would
> > > allow device-to-device zero copy DMA.
> > >
> > > - main RAM caching would still be avilable and would work in
> > > many cases by default: as most apps use file processing
> > > buffers in anonymous memory into which they read() data.
> > >
> > > We can achieve this by statically allocating all page structs on the
> > > device, in the following way:
> > >
> > > - For every 128MB of pmem data we allocate 2MB of struct-page
> > > descriptors, 64 bytes each, that describes that 128MB data range
> > > in a 4K granular way. We never have to allocate page structs as
> > > they are always there.
> >
> > Who allocates them, when do they get allocated, [...]
>
> Multiple models can be used for that: the simplest would be at device
> creation time with some exceedingly simple tooling that just sets a
> superblock to make it easy to autodetect. (Should the superblock get
> corrupted, it can be re-created with the same parameters,
> non-destructively, etc.)

OK, if there's persistent metadata than there's a need for mkfs,
fsck, init tooling, persistent formatting with versioning,
configuration information, etc. Seeing as it will require userspace
tools to manage, it will need a block device to be presented - it's
effectively a special partition. That means libblkid will need to
know about it so various programs won't allow users to accidently
overwrite that partition...

That's kind of my point - you're glossing over this as "simple", but
history and experience tells me that people who think persistent
device management is "simple" get it badly wrong.

> > [...] what happens when they get corrupted?
>
> Nothing unexpected should happen, they get reinitialized on every
> reboot, see the lazy initialization scheme I describe later in the
> proposal.

That was not clear at all from your proposal. "lazy initialisation"
of structures in preallocated persistent storage areas does not mean
"structures are volatile" to anyone who deals with persistent
storage on a day to day basis. Case in point: ext4 lazy inode table
initialisation.

Anyway, I think others have covered the fact that "PRAM as RAM is
not desirable from write latency and endurance POV. That's another
one of the main reasons we didn't go down the persistent page cache
path with DAX ~2 years ago...

> > And, of course, different platforms have different page sizes, so
> > designing page array structures to be optimal for x86-64 is just a
> > wee bit premature.
>
> 4K is the smallest one on x86 and ARM, and it's also a IMHO pretty
> sane default from a human workflow point of view.
>
> But oddball configs with larger page sizes could also be supported at
> device creation time (via a simple superblock structure).

Ok, so now I know it's volatile, why do we need a persistent
superblock? Why is *anything* persistent required? And why would
page size matter if the reserved area is volatile?

And if it is volatile, then the kernel is effectively doing dynamic
allocation and initialisation of the struct pages, so why wouldn't
we just do dynamic allocation out of a slab cache in RAM and free
them when the last reference to the page goes away? Applications
aren't going to be able to reference every page in persistent
memory at the same time...

Keep in mind we need to design for tens of TB of PRAM at minimum
(400GB NVDIMMS and tens of them in a single machine are not that far
away), so static arrays of structures that index 4k blocks is not a
design that scales to these sizes - it's like using 1980s filesystem
algorithms for a new filesystem designed for tens of terabytes of
storage - it can be made to work, but it's just not efficient or
scalable in the long term.

As an example, look at the current problems with scaling the
initialisation for struct pages for large memory machines - 16TB
machines are taking 10 minutes just to initialise the struct page
arrays on startup. That's the scale of overhead that static page
arrays will have for PRAM, whether they are lazily initialised or
not. IOWs, static page arrays are not scalable, and hence aren't a
viable long term solution to the PRAM problem.

IMO, we need to be designing around the concept that the filesytem
manages the pmem space, and the MM subsystem simply uses the block
mapping information provided to it from the filesystem to decide how
it references and maps the regions into the user's address space or
for DMA. The mm subsystem does not manage the pmem space, it's
alignment or how it is allocated to user files. Hence page mappings
can only be - at best - reactive to what the filesystem does with
it's free space. The mm subsystem already has to query the block
layer to get mappings on page faults, so it's only a small stretch
to enhance the DAX mapping request to ask for a large page mapping
rather than a 4k mapping. If the fs can't do a large page mapping,
you'll get a 4k aligned mapping back.

What I'm trying to say is that the mapping behaviour needs to be
designed with the way filesystems and the mm subsystem interact in
mind, not from a pre-formed "direct Io is bad, we must use the page
cache" point of view. The filesystem and the mm subsystem must
co-operate to allow things like large page mappings to be made and
hence looking at the problem purely from a mm<->pmem device
perspective as you are ignores an important chunk of the system:
the part that actually manages the pmem space...

> Really, I'd be blind to not notice your hostility and I'd like to
> understand its source. What's the problem?

Hostile? Take a chill pill, please, Ingo, you've got entirely the
wrong impression.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-12 14:48:08

by Jerome Glisse

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Tue, May 12, 2015 at 10:53:47AM +1000, Dave Chinner wrote:
> On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
>
> > > And, of course, different platforms have different page sizes, so
> > > designing page array structures to be optimal for x86-64 is just a
> > > wee bit premature.
> >
> > 4K is the smallest one on x86 and ARM, and it's also a IMHO pretty
> > sane default from a human workflow point of view.
> >
> > But oddball configs with larger page sizes could also be supported at
> > device creation time (via a simple superblock structure).
>
> Ok, so now I know it's volatile, why do we need a persistent
> superblock? Why is *anything* persistent required? And why would
> page size matter if the reserved area is volatile?
>
> And if it is volatile, then the kernel is effectively doing dynamic
> allocation and initialisation of the struct pages, so why wouldn't
> we just do dynamic allocation out of a slab cache in RAM and free
> them when the last reference to the page goes away? Applications
> aren't going to be able to reference every page in persistent
> memory at the same time...
>
> Keep in mind we need to design for tens of TB of PRAM at minimum
> (400GB NVDIMMS and tens of them in a single machine are not that far
> away), so static arrays of structures that index 4k blocks is not a
> design that scales to these sizes - it's like using 1980s filesystem
> algorithms for a new filesystem designed for tens of terabytes of
> storage - it can be made to work, but it's just not efficient or
> scalable in the long term.

On having easy pfn<->struct page relation i would agree with Ingo. I
thin it is important. For instance in my case when migrating system
memory to device memory i store a pfn in special swap entry. While
right now i use my own adhoc structure i would rather directly use
a struct page that i can easily find back from the pfn.

In the scheme i proposed you only need to allocate PUD & PMD directory
and use a huge zero page map read only for the whole array at boot time.
When you need a struct page for a given pfn you allocate 2 page, one for
the PMD directory and one for the struct page array for given range of
pfn. Once the struct page is no longer needed you free both page and
turn back to the zero huge page.

So you get dynamic allocation and keep the nice pfn<->struct page mapping
working.

>
> As an example, look at the current problems with scaling the
> initialisation for struct pages for large memory machines - 16TB
> machines are taking 10 minutes just to initialise the struct page
> arrays on startup. That's the scale of overhead that static page
> arrays will have for PRAM, whether they are lazily initialised or
> not. IOWs, static page arrays are not scalable, and hence aren't a
> viable long term solution to the PRAM problem.

With solution i describe above all you need to initialize is PUD & PMD
directory to point to a zero huge page. I would think this should be
fast enough even for 1TB 2^(40 - 12 - 9 - 9) = 2^10 so you need 1024
PUD and 512K PMD (4M of PUD and 256M of PMD). You can even directly
share PMD and have to dynamicly allocate 3 pages (1 for the PMD level,
1 for the PTE level, 1 for struct page array) effectively reducing to
static 4M allocation for all PUD. Rest being dynamicly allocated/freed
upon useage.

> IMO, we need to be designing around the concept that the filesytem
> manages the pmem space, and the MM subsystem simply uses the block
> mapping information provided to it from the filesystem to decide how
> it references and maps the regions into the user's address space or
> for DMA. The mm subsystem does not manage the pmem space, it's
> alignment or how it is allocated to user files. Hence page mappings
> can only be - at best - reactive to what the filesystem does with
> it's free space. The mm subsystem already has to query the block
> layer to get mappings on page faults, so it's only a small stretch
> to enhance the DAX mapping request to ask for a large page mapping
> rather than a 4k mapping. If the fs can't do a large page mapping,
> you'll get a 4k aligned mapping back.
>
> What I'm trying to say is that the mapping behaviour needs to be
> designed with the way filesystems and the mm subsystem interact in
> mind, not from a pre-formed "direct Io is bad, we must use the page
> cache" point of view. The filesystem and the mm subsystem must
> co-operate to allow things like large page mappings to be made and
> hence looking at the problem purely from a mm<->pmem device
> perspective as you are ignores an important chunk of the system:
> the part that actually manages the pmem space...

I am all for letting the filesystem manage pmem, but i think having
struct page expose to mm allow the mm side to stay ignorant of what
is really behind. Also if i could share more code with other i would
be happier :)

Cheers,
J?r?me

2015-06-05 05:44:04

by Dan Williams

[permalink] [raw]
Subject: Re: "Directly mapped persistent memory page cache"

On Tue, May 12, 2015 at 7:47 AM, Jerome Glisse <[email protected]> wrote:
> On Tue, May 12, 2015 at 10:53:47AM +1000, Dave Chinner wrote:
>> On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
>> IMO, we need to be designing around the concept that the filesytem
>> manages the pmem space, and the MM subsystem simply uses the block
>> mapping information provided to it from the filesystem to decide how
>> it references and maps the regions into the user's address space or
>> for DMA. The mm subsystem does not manage the pmem space, it's
>> alignment or how it is allocated to user files. Hence page mappings
>> can only be - at best - reactive to what the filesystem does with
>> it's free space. The mm subsystem already has to query the block
>> layer to get mappings on page faults, so it's only a small stretch
>> to enhance the DAX mapping request to ask for a large page mapping
>> rather than a 4k mapping. If the fs can't do a large page mapping,
>> you'll get a 4k aligned mapping back.
>>
>> What I'm trying to say is that the mapping behaviour needs to be
>> designed with the way filesystems and the mm subsystem interact in
>> mind, not from a pre-formed "direct Io is bad, we must use the page
>> cache" point of view. The filesystem and the mm subsystem must
>> co-operate to allow things like large page mappings to be made and
>> hence looking at the problem purely from a mm<->pmem device
>> perspective as you are ignores an important chunk of the system:
>> the part that actually manages the pmem space...
>
> I am all for letting the filesystem manage pmem, but i think having
> struct page expose to mm allow the mm side to stay ignorant of what
> is really behind. Also if i could share more code with other i would
> be happier :)
>

As this thread is directly referencing one of the topics listed for
the Persistent Memory microconference I do not think it is
unreasonable to shamelessly hijack it to promote Linux Plumbers 2015.
Tomorrow is the deadline for earlybird registration and topic
submission tool is now open for submission of this or any other
persistent memory topic.

https://linuxplumbersconf.org/2015/attend/
https://linuxplumbersconf.org/2015/how-to-submit-microconference-discussions-topics/