This version of the series addresses the review comments received,
particularly from Christof.
Thanks to all for review and testing.
The patch adding mm/swap.h got a minor conflict when I rebaesd on
5.17-rc1, suggestion that it could easily get more conflicts in the
future. It might be good if it could land before 5.17 comes out, to
avoid (some of) those conflicts.
I think (Though haven't checked) that all the NFS patch patches
except
NFS: rename nfs_direct_IO and use as ->swap_rw
NFS: swap IO handling is slightly different for O_DIRECT IO
can land independently of the MM patches, and can be moved to the
end of the series. Maybe they could be held until after 5.18-rc1 if we
agree to proceed with these in the next merge window.
Intro from previous series is below.
Thanks,
NeilBrown
swap-over-NFS currently has a variety of problems.
swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.
NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !
So the first few patches here clean up and improve various parts of the
swap-via-filesystem code. ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.
Current swap-to-filesystem code only ever submits single-page reads and
writes. These patches change that to allow multi-page IO when adjacent
requests are submitted. Writes are also changed to be async rather than
sync. This substantially speeds up write throughput for swap-over-NFS.
Some of the NFS patches can land independently of the MM patches. A few
require the MM patches to land first.
---
NeilBrown (23):
MM: create new mm/swap.h header file.
MM: extend block-plugging to cover all swap reads with read-ahead
MM: drop swap_set_page_dirty
MM: move responsibility for setting SWP_FS_OPS to ->swap_activate
MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
MM: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space
MM: perform async writes to SWP_FS_OPS swap-space using ->swap_rw
DOC: update documentation for swap_activate and swap_rw
MM: submit multipage reads for SWP_FS_OPS swap-space
MM: submit multipage write for SWP_FS_OPS swap-space
VFS: Add FMODE_CAN_ODIRECT file flag
NFS: remove IS_SWAPFILE hack
NFS: rename nfs_direct_IO and use as ->swap_rw
NFS: swap IO handling is slightly different for O_DIRECT IO
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/auth: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
NFSv4: keep state manager thread active if swap is enabled
NFS: swap-out must always use STABLE writes.
SUNRPC: lock against ->sock changing during sysfs read
Documentation/filesystems/locking.rst | 18 ++-
Documentation/filesystems/vfs.rst | 17 ++-
drivers/block/loop.c | 4 +-
fs/cifs/file.c | 7 +-
fs/fcntl.c | 9 +-
fs/nfs/direct.c | 56 ++++---
fs/nfs/file.c | 39 +++--
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 +++
fs/nfs/nfs4state.c | 39 ++++-
fs/nfs/read.c | 4 -
fs/nfs/write.c | 2 +
fs/open.c | 9 +-
fs/overlayfs/file.c | 13 +-
include/linux/fs.h | 4 +
include/linux/nfs_fs.h | 11 +-
include/linux/nfs_xdr.h | 2 +
include/linux/sunrpc/auth.h | 1 +
include/linux/sunrpc/sched.h | 1 -
include/linux/swap.h | 7 +-
include/linux/writeback.h | 7 +
include/trace/events/sunrpc.h | 1 -
mm/madvise.c | 8 +-
mm/memory.c | 2 +-
mm/page_io.c | 210 ++++++++++++++++++++------
mm/swap.h | 26 +++-
mm/swap_state.c | 33 ++--
mm/swapfile.c | 13 +-
mm/vmscan.c | 38 +++--
net/sunrpc/auth.c | 8 +-
net/sunrpc/auth_gss/auth_gss.c | 6 +-
net/sunrpc/auth_unix.c | 10 +-
net/sunrpc/clnt.c | 7 +-
net/sunrpc/sched.c | 29 ++--
net/sunrpc/sysfs.c | 5 +-
net/sunrpc/xprt.c | 19 +--
net/sunrpc/xprtrdma/transport.c | 10 +-
net/sunrpc/xprtsock.c | 15 +-
38 files changed, 505 insertions(+), 206 deletions(-)
--
Signature
Code that does swap read-ahead uses blk_start_plug() and
blk_finish_plug() to allow lower levels to combine multiple read-ahead
pages into a single request, but calls blk_finish_plug() *before*
submitting the original (non-ahead) read request.
This missed an opportunity to combine read requests.
This patch moves the blk_finish_plug to *after* all the reads.
This will likely combine the primary read with some of the "ahead"
reads, and that may slightly increase the latency of that read, but it
should more than make up for this by making more efficient use of the
storage path.
The patch mostly makes the code look more consistent. Performance
change is unlikely to be noticeable.
Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")
Signed-off-by: NeilBrown <[email protected]>
---
mm/swap_state.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bb38453425c7..093ecf864200 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -625,6 +625,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
+ blk_start_plug(&plug);
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
@@ -638,7 +639,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (end_offset >= si->max)
end_offset = si->max - 1;
- blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = __read_swap_cache_async(
@@ -655,11 +655,12 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
}
put_page(page);
}
- blk_finish_plug(&plug);
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
- return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+ page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+ blk_finish_plug(&plug);
+ return page;
}
int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -800,11 +801,11 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
.win = 1,
};
+ blk_start_plug(&plug);
swap_ra_info(vmf, &ra_info);
if (ra_info.win == 1)
goto skip;
- blk_start_plug(&plug);
for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
i++, pte++) {
pentry = *pte;
@@ -828,11 +829,12 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
}
put_page(page);
}
- blk_finish_plug(&plug);
lru_add_drain();
skip:
- return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
+ page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
ra_info.win == 1);
+ blk_finish_plug(&plug);
+ return page;
}
/**
This patch switches swap-out to SWP_FS_OPS swap-spaces to use ->swap_rw
and makes the writes asynchronous, like they are for other swap spaces.
To make it async we need to allocate the kiocb struct from a mempool.
This may block, but won't block as long as waiting for the write to
complete. At most it will wait for some previous swap IO to complete.
Signed-off-by: NeilBrown <[email protected]>
---
mm/page_io.c | 93 +++++++++++++++++++++++++++++++++-------------------------
1 file changed, 53 insertions(+), 40 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index e90a3231f225..6e32ca35d9b6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -303,6 +303,57 @@ int sio_pool_init(void)
return 0;
}
+static void sio_write_complete(struct kiocb *iocb, long ret)
+{
+ struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+ struct page *page = sio->bvec.bv_page;
+
+ if (ret != 0 && ret != PAGE_SIZE) {
+ /*
+ * In the case of swap-over-nfs, this can be a
+ * temporary failure if the system has limited
+ * memory for allocating transmit buffers.
+ * Mark the page dirty and avoid
+ * folio_rotate_reclaimable but rate-limit the
+ * messages but do not flag PageError like
+ * the normal direct-to-bio case as it could
+ * be temporary.
+ */
+ set_page_dirty(page);
+ ClearPageReclaim(page);
+ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+ ret, page_file_offset(page));
+ } else
+ count_vm_event(PSWPOUT);
+ end_page_writeback(page);
+ mempool_free(sio, sio_pool);
+}
+
+static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
+{
+ struct swap_iocb *sio;
+ struct swap_info_struct *sis = page_swap_info(page);
+ struct file *swap_file = sis->swap_file;
+ struct address_space *mapping = swap_file->f_mapping;
+ struct iov_iter from;
+ int ret;
+
+ set_page_writeback(page);
+ unlock_page(page);
+ sio = mempool_alloc(sio_pool, GFP_NOIO);
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_complete = sio_write_complete;
+ sio->iocb.ki_pos = page_file_offset(page);
+ sio->bvec.bv_page = page;
+ sio->bvec.bv_len = PAGE_SIZE;
+ sio->bvec.bv_offset = 0;
+ iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_write_complete(&sio->iocb, ret);
+ return ret;
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -311,46 +362,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
struct swap_info_struct *sis = page_swap_info(page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
- if (data_race(sis->flags & SWP_FS_OPS)) {
- struct kiocb kiocb;
- struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct bio_vec bv = {
- .bv_page = page,
- .bv_len = PAGE_SIZE,
- .bv_offset = 0
- };
- struct iov_iter from;
-
- iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
- init_sync_kiocb(&kiocb, swap_file);
- kiocb.ki_pos = page_file_offset(page);
-
- set_page_writeback(page);
- unlock_page(page);
- ret = mapping->a_ops->direct_IO(&kiocb, &from);
- if (ret == PAGE_SIZE) {
- count_vm_event(PSWPOUT);
- ret = 0;
- } else {
- /*
- * In the case of swap-over-nfs, this can be a
- * temporary failure if the system has limited
- * memory for allocating transmit buffers.
- * Mark the page dirty and avoid
- * folio_rotate_reclaimable but rate-limit the
- * messages but do not flag PageError like
- * the normal direct-to-bio case as it could
- * be temporary.
- */
- set_page_dirty(page);
- ClearPageReclaim(page);
- pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
- page_file_offset(page));
- }
- end_page_writeback(page);
- return ret;
- }
+ if (data_race(sis->flags & SWP_FS_OPS))
+ return swap_writepage_fs(page, wbc);
ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
if (!ret) {
swap_readpage() is given one page at a time, but maybe called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
are single page reads.
With this patch we pass in a pointer-to-pointer when swap_readpage can
store state between calls - much like the effect of blk_plug. After
calling swap_readpage() some number of times, the state will be passed
to swap_read_unplug() which can submit the combined request.
Some caller currently call blk_finish_plug() *before* the final call to
swap_readpage(), so the last page cannot be included. This patch moves
blk_finish_plug() to after the last call, and calls swap_read_unplug()
there too.
Signed-off-by: NeilBrown <[email protected]>
---
mm/madvise.c | 8 +++-
mm/memory.c | 2 +
mm/page_io.c | 102 +++++++++++++++++++++++++++++++++++--------------------
mm/swap.h | 16 +++++++--
mm/swap_state.c | 19 +++++++---
5 files changed, 98 insertions(+), 49 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 1ee4b7583379..2b1ab30af141 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -225,6 +225,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
pte_t *orig_pte;
struct vm_area_struct *vma = walk->private;
unsigned long index;
+ struct swap_iocb *splug = NULL;
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
return 0;
@@ -246,10 +247,11 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
continue;
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false);
+ vma, index, false, &splug);
if (page)
put_page(page);
}
+ swap_read_unplug(splug);
return 0;
}
@@ -265,6 +267,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
struct page *page;
+ struct swap_iocb *splug = NULL;
rcu_read_lock();
xas_for_each(&xas, page, end_index) {
@@ -277,13 +280,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
swap = radix_to_swp_entry(page);
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false);
+ NULL, 0, false, &splug);
if (page)
put_page(page);
rcu_read_lock();
}
rcu_read_unlock();
+ swap_read_unplug(splug);
lru_add_drain(); /* Push any new pages onto the LRU now */
}
diff --git a/mm/memory.c b/mm/memory.c
index d25372340107..8bd18c54eaa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3559,7 +3559,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* To provide entry to swap_readpage() */
set_page_private(page, entry.val);
- swap_readpage(page, true);
+ swap_readpage(page, true, NULL);
set_page_private(page, 0);
}
} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 6e32ca35d9b6..bcf655d650c8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -286,7 +286,8 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
struct swap_iocb {
struct kiocb iocb;
- struct bio_vec bvec;
+ struct bio_vec bvec[SWAP_CLUSTER_MAX];
+ int pages;
};
static mempool_t *sio_pool;
@@ -306,7 +307,7 @@ int sio_pool_init(void)
static void sio_write_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
- struct page *page = sio->bvec.bv_page;
+ struct page *page = sio->bvec[0].bv_page;
if (ret != 0 && ret != PAGE_SIZE) {
/*
@@ -344,10 +345,10 @@ static int swap_writepage_fs(struct page *page, struct writeback_control *wbc)
init_sync_kiocb(&sio->iocb, swap_file);
sio->iocb.ki_complete = sio_write_complete;
sio->iocb.ki_pos = page_file_offset(page);
- sio->bvec.bv_page = page;
- sio->bvec.bv_len = PAGE_SIZE;
- sio->bvec.bv_offset = 0;
- iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+ sio->bvec[0].bv_page = page;
+ sio->bvec[0].bv_len = PAGE_SIZE;
+ sio->bvec[0].bv_offset = 0;
+ iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
if (ret != -EIOCBQUEUED)
sio_write_complete(&sio->iocb, ret);
@@ -390,46 +391,60 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
static void sio_read_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
- struct page *page = sio->bvec.bv_page;
-
- if (ret != 0 && ret != PAGE_SIZE) {
- SetPageError(page);
- ClearPageUptodate(page);
- pr_alert_ratelimited("Read-error on swap-device\n");
- } else {
- SetPageUptodate(page);
- count_vm_event(PSWPIN);
+ int p;
+
+ for (p = 0; p < sio->pages; p++) {
+ struct page *page = sio->bvec[p].bv_page;
+ if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+ SetPageError(page);
+ ClearPageUptodate(page);
+ pr_alert_ratelimited("Read-error on swap-device\n");
+ } else {
+ SetPageUptodate(page);
+ count_vm_event(PSWPIN);
+ }
+ unlock_page(page);
}
- unlock_page(page);
mempool_free(sio, sio_pool);
}
-static int swap_readpage_fs(struct page *page)
+static void swap_readpage_fs(struct page *page,
+ struct swap_iocb **plug)
{
struct swap_info_struct *sis = page_swap_info(page);
- struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct iov_iter from;
- struct swap_iocb *sio;
+ struct swap_iocb *sio = NULL;
loff_t pos = page_file_offset(page);
- int ret;
-
- sio = mempool_alloc(sio_pool, GFP_KERNEL);
- init_sync_kiocb(&sio->iocb, swap_file);
- sio->iocb.ki_pos = pos;
- sio->iocb.ki_complete = sio_read_complete;
- sio->bvec.bv_page = page;
- sio->bvec.bv_len = PAGE_SIZE;
- sio->bvec.bv_offset = 0;
- iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
- ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
- if (ret != -EIOCBQUEUED)
- sio_read_complete(&sio->iocb, ret);
- return ret;
+ if (*plug)
+ sio = *plug;
+ if (sio) {
+ if (sio->iocb.ki_filp != sis->swap_file ||
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ swap_read_unplug(sio);
+ sio = NULL;
+ }
+ }
+ if (!sio) {
+ sio = mempool_alloc(sio_pool, GFP_KERNEL);
+ init_sync_kiocb(&sio->iocb, sis->swap_file);
+ sio->iocb.ki_pos = pos;
+ sio->iocb.ki_complete = sio_read_complete;
+ sio->pages = 0;
+ }
+ sio->bvec[sio->pages].bv_page = page;
+ sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+ sio->bvec[sio->pages].bv_offset = 0;
+ sio->pages += 1;
+ if (sio->pages == ARRAY_SIZE(sio->bvec) || !plug) {
+ swap_read_unplug(sio);
+ sio = NULL;
+ }
+ if (plug)
+ *plug = sio;
}
-int swap_readpage(struct page *page, bool synchronous)
+int swap_readpage(struct page *page, bool synchronous,
+ struct swap_iocb **plug)
{
struct bio *bio;
int ret = 0;
@@ -455,7 +470,7 @@ int swap_readpage(struct page *page, bool synchronous)
}
if (data_race(sis->flags & SWP_FS_OPS)) {
- ret = swap_readpage_fs(page);
+ swap_readpage_fs(page, plug);
goto out;
}
@@ -507,3 +522,16 @@ int swap_readpage(struct page *page, bool synchronous)
delayacct_swapin_end();
return ret;
}
+
+void __swap_read_unplug(struct swap_iocb *sio)
+{
+ struct iov_iter from;
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ int ret;
+
+ iov_iter_bvec(&from, READ, sio->bvec, sio->pages,
+ PAGE_SIZE * sio->pages);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_read_complete(&sio->iocb, ret);
+}
diff --git a/mm/swap.h b/mm/swap.h
index e8ee995cf8d8..0c79b2478f3f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -4,7 +4,15 @@
/* linux/mm/page_io.c */
int sio_pool_init(void);
-int swap_readpage(struct page *page, bool do_poll);
+struct swap_iocb;
+int swap_readpage(struct page *page, bool do_poll,
+ struct swap_iocb **plug);
+void __swap_read_unplug(struct swap_iocb *plug);
+static inline void swap_read_unplug(struct swap_iocb *plug)
+{
+ if (unlikely(plug))
+ __swap_read_unplug(plug);
+}
int swap_writepage(struct page *page, struct writeback_control *wbc);
void end_swap_bio_write(struct bio *bio);
int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -38,7 +46,8 @@ struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
struct page *read_swap_cache_async(swp_entry_t, gfp_t,
struct vm_area_struct *vma,
unsigned long addr,
- bool do_poll);
+ bool do_poll,
+ struct swap_iocb **plug);
struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
struct vm_area_struct *vma,
unsigned long addr,
@@ -53,7 +62,8 @@ static inline unsigned int page_swap_flags(struct page *page)
return page_swap_info(page)->flags;
}
#else /* CONFIG_SWAP */
-static inline int swap_readpage(struct page *page, bool do_poll)
+static inline int swap_readpage(struct page *page, bool do_poll,
+ struct swap_iocb **plug);
{
return 0;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d541594be1c3..5cb2c75fa247 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -520,14 +520,16 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* the swap entry is no longer in use.
*/
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_area_struct *vma, unsigned long addr, bool do_poll)
+ struct vm_area_struct *vma,
+ unsigned long addr, bool do_poll,
+ struct swap_iocb **plug)
{
bool page_was_allocated;
struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
vma, addr, &page_was_allocated);
if (page_was_allocated)
- swap_readpage(retpage, do_poll);
+ swap_readpage(retpage, do_poll, plug);
return retpage;
}
@@ -621,6 +623,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long mask;
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
+ struct swap_iocb *splug = NULL;
bool do_poll = true, page_allocated;
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
@@ -647,7 +650,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!page)
continue;
if (page_allocated) {
- swap_readpage(page, false);
+ swap_readpage(page, false, &splug);
if (offset != entry_offset) {
SetPageReadahead(page);
count_vm_event(SWAP_RA);
@@ -658,8 +661,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
- page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+ page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll,
+ &splug);
blk_finish_plug(&plug);
+ swap_read_unplug(splug);
return page;
}
@@ -791,6 +796,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
struct vm_fault *vmf)
{
struct blk_plug plug;
+ struct swap_iocb *splug = NULL;
struct vm_area_struct *vma = vmf->vma;
struct page *page;
pte_t *pte, pentry;
@@ -821,7 +827,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
if (!page)
continue;
if (page_allocated) {
- swap_readpage(page, false);
+ swap_readpage(page, false, &splug);
if (i != ra_info.offset) {
SetPageReadahead(page);
count_vm_event(SWAP_RA);
@@ -832,8 +838,9 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
lru_add_drain();
skip:
page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
- ra_info.win == 1);
+ ra_info.win == 1, &splug);
blk_finish_plug(&plug);
+ swap_read_unplug(splug);
return page;
}
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.
mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.
rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.
Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/sched.c | 4 +++-
net/sunrpc/xprtrdma/transport.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOFS;
+ if (RPC_IS_ASYNC(task))
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ gfp |= __GFP_MEMALLOC;
size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
gfp_t flags;
flags = RPCRDMA_DEF_GFP;
+ if (RPC_IS_ASYNC(task))
+ flags = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ flags |= __GFP_MEMALLOC;
if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
flags))
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.
mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.
lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks. So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory. If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again. HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.
Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/auth.h | 1 +
net/sunrpc/auth.c | 6 +++++-
net/sunrpc/auth_gss/auth_gss.c | 6 +++++-
net/sunrpc/auth_unix.c | 10 ++++++++--
net/sunrpc/clnt.c | 3 +++
5 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {
/* Flags for rpcauth_lookupcred() */
#define RPCAUTH_LOOKUP_NEW 0x01 /* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC 0x02 /* Don't block waiting for memory */
/*
* Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
};
struct rpc_cred *ret;
+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
put_cred(acred.cred);
return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)
if (!acred.principal)
return NULL;
+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
}
@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
};
if (flags & RPC_TASK_ASYNC)
- lookupflags |= RPCAUTH_LOOKUP_NEW;
+ lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
if (task->tk_op_cred)
/* Task must use exactly this rpc_cred */
new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
static struct rpc_cred *
gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+ gfp_t gfp = GFP_NOFS;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ return rpcauth_lookup_credcache(auth, acred, flags, gfp);
}
static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
static struct rpc_cred *
unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+ gfp_t gfp = GFP_NOFS;
+ struct rpc_cred *ret;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ ret = mempool_alloc(unix_pool, gfp);
+ if (!ret)
+ return ERR_PTR(-ENOMEM);
rpcauth_init_cred(ret, acred, auth, &unix_credops);
ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
task->tk_cred_retry--;
trace_rpc_retry_refresh_status(task);
return;
+ case -ENOMEM:
+ rpc_delay(task, HZ >> 4);
+ return;
}
trace_rpc_refresh_status(task);
rpc_call_rpcerror(task, status);
If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.
This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 15 ++++++++++++---
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 ++++++++++++++++++++
fs/nfs/nfs4state.c | 39 +++++++++++++++++++++++++++++++++------
include/linux/nfs_xdr.h | 2 ++
net/sunrpc/clnt.c | 2 ++
6 files changed, 70 insertions(+), 9 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 04ba56f223d3..ceacae8e7a38 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -486,8 +486,9 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
unsigned long blocks;
long long isize;
int ret;
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
@@ -508,14 +509,22 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
}
*span = sis->pages;
sis->flags |= SWP_FS_OPS;
+
+ if (cl->rpc_ops->enable_swap)
+ cl->rpc_ops->enable_swap(inode);
+
return ret;
}
static void nfs_swap_deactivate(struct file *file)
{
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;
rpc_clnt_swap_deactivate(clnt);
+ if (cl->rpc_ops->disable_swap)
+ cl->rpc_ops->disable_swap(file_inode(file));
}
const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
NFS4CLNT_LEASE_MOVED,
NFS4CLNT_DELEGATION_EXPIRED,
NFS4CLNT_RUN_MANAGER,
+ NFS4CLNT_MANAGER_AVAILABLE,
NFS4CLNT_RECALL_RUNNING,
NFS4CLNT_RECALL_ANY_LAYOUT_READ,
NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
return error + error2 + error3;
}
+static void nfs4_enable_swap(struct inode *inode)
+{
+ /* The state manager thread must always be running.
+ * It will notice the client is a swapper, and stay put.
+ */
+ struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+ nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+ /* The state manager thread will now exit once it is
+ * woken.
+ */
+ wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
static const struct inode_operations nfs4_dir_inode_operations = {
.create = nfs_create,
.lookup = nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
.free_client = nfs4_free_client,
.create_server = nfs4_create_server,
.clone_server = nfs_clone_server,
+ .enable_swap = nfs4_enable_swap,
+ .disable_swap = nfs4_disable_swap,
};
static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index d88b779f9dd0..05d53c8a2ddb 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
{
struct task_struct *task;
char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;
set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+ if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+ wake_up_var(&clp->cl_state);
return;
+ }
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
__module_get(THIS_MODULE);
refcount_inc(&clp->cl_count);
@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
printk(KERN_ERR "%s: kthread_run: %ld\n",
__func__, PTR_ERR(task));
nfs4_clear_state_manager_bit(clp);
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
nfs_put_client(clp);
module_put(THIS_MODULE);
}
@@ -2665,11 +2673,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
}
- /* Did we race with an attempt to give us more work? */
- if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
- return;
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
- return;
+ return;
+
} while (refcount_read(&clp->cl_count) > 1 && !signalled());
goto out_drain;
@@ -2689,9 +2694,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
static int nfs4_run_state_manager(void *ptr)
{
struct nfs_client *clp = ptr;
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;
allow_signal(SIGKILL);
+again:
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
nfs4_state_manager(clp);
+ if (atomic_read(&cl->cl_swapper)) {
+ wait_var_event_interruptible(&clp->cl_state,
+ test_bit(NFS4CLNT_RUN_MANAGER,
+ &clp->cl_state));
+ if (atomic_read(&cl->cl_swapper) &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+ goto again;
+ /* Either no longer a swapper, or were signalled */
+ }
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+ if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+ !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+ goto again;
+
nfs_put_client(clp);
module_put_and_kthread_exit(0);
return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
struct nfs_server *(*create_server)(struct fs_context *);
struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
struct nfs_fattr *, rpc_authflavor_t);
+ void (*enable_swap)(struct inode *inode);
+ void (*disable_swap)(struct inode *inode);
};
/*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
int
rpc_clnt_swap_activate(struct rpc_clnt *clnt)
{
+ while (clnt != clnt->cl_parent)
+ clnt = clnt->cl_parent;
if (atomic_inc_return(&clnt->cl_swapper) == 1)
return rpc_clnt_iterate_for_each_xprt(clnt,
rpc_clnt_swap_activate_callback, NULL);
On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> Code that does swap read-ahead uses blk_start_plug() and
> blk_finish_plug() to allow lower levels to combine multiple read-ahead
> pages into a single request, but calls blk_finish_plug() *before*
> submitting the original (non-ahead) read request.
> This missed an opportunity to combine read requests.
>
> This patch moves the blk_finish_plug to *after* all the reads.
> This will likely combine the primary read with some of the "ahead"
> reads, and that may slightly increase the latency of that read, but it
> should more than make up for this by making more efficient use of the
> storage path.
>
> The patch mostly makes the code look more consistent. Performance
> change is unlikely to be noticeable.
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
> Fixes-no-auto-backport: 3fb5c298b04e ("swap: allow swap readahead to be merged")
Is this really a thing?
On Mon, 24 Jan 2022 at 03:52, NeilBrown <[email protected]> wrote:
>
> swap_readpage() is given one page at a time, but maybe called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
> are single page reads.
>
> With this patch we pass in a pointer-to-pointer when swap_readpage can
> store state between calls - much like the effect of blk_plug. After
> calling swap_readpage() some number of times, the state will be passed
> to swap_read_unplug() which can submit the combined request.
>
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included. This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> mm/madvise.c | 8 +++-
> mm/memory.c | 2 +
> mm/page_io.c | 102 +++++++++++++++++++++++++++++++++++--------------------
> mm/swap.h | 16 +++++++--
> mm/swap_state.c | 19 +++++++---
> 5 files changed, 98 insertions(+), 49 deletions(-)
>
...
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 6e32ca35d9b6..bcf655d650c8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -390,46 +391,60 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
> static void sio_read_complete(struct kiocb *iocb, long ret)
> {
> struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
> - struct page *page = sio->bvec.bv_page;
> -
> - if (ret != 0 && ret != PAGE_SIZE) {
> - SetPageError(page);
> - ClearPageUptodate(page);
> - pr_alert_ratelimited("Read-error on swap-device\n");
> - } else {
> - SetPageUptodate(page);
> - count_vm_event(PSWPIN);
> + int p;
> +
> + for (p = 0; p < sio->pages; p++) {
> + struct page *page = sio->bvec[p].bv_page;
> + if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
> + SetPageError(page);
> + ClearPageUptodate(page);
> + pr_alert_ratelimited("Read-error on swap-device\n");
> + } else {
> + SetPageUptodate(page);
> + count_vm_event(PSWPIN);
> + }
> + unlock_page(page);
> }
> - unlock_page(page);
> mempool_free(sio, sio_pool);
> }
Trivial: on success, could be single call to count_vm_events(PSWPIN,
sio->pages).
Similar comment for PSWPOUT in sio_write_complete()
>
> -static int swap_readpage_fs(struct page *page)
> +static void swap_readpage_fs(struct page *page,
> + struct swap_iocb **plug)
> {
> struct swap_info_struct *sis = page_swap_info(page);
> - struct file *swap_file = sis->swap_file;
> - struct address_space *mapping = swap_file->f_mapping;
> - struct iov_iter from;
> - struct swap_iocb *sio;
> + struct swap_iocb *sio = NULL;
> loff_t pos = page_file_offset(page);
> - int ret;
> -
> - sio = mempool_alloc(sio_pool, GFP_KERNEL);
> - init_sync_kiocb(&sio->iocb, swap_file);
> - sio->iocb.ki_pos = pos;
> - sio->iocb.ki_complete = sio_read_complete;
> - sio->bvec.bv_page = page;
> - sio->bvec.bv_len = PAGE_SIZE;
> - sio->bvec.bv_offset = 0;
>
> - iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
> - ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> - if (ret != -EIOCBQUEUED)
> - sio_read_complete(&sio->iocb, ret);
> - return ret;
> + if (*plug)
> + sio = *plug;
'plug' can be NULL when called from do_swap_page();
if (plug && *plug)
Cheers,
Mark
On Thu, 27 Jan 2022, Hugh Dickins wrote:
> On Thu, 27 Jan 2022, NeilBrown wrote:
> > On Mon, 24 Jan 2022, Christoph Hellwig wrote:
> > > On Mon, Jan 24, 2022 at 02:48:32PM +1100, NeilBrown wrote:
> > > > Code that does swap read-ahead uses blk_start_plug() and
> > > > blk_finish_plug() to allow lower levels to combine multiple read-ahead
> > > > pages into a single request, but calls blk_finish_plug() *before*
> > > > submitting the original (non-ahead) read request.
> > > > This missed an opportunity to combine read requests.
>
> No, you're misunderstanding there. All the necessary reads are issued
> within the loop, between the plug and unplug: it does not skip over
> the target page in the loop, but issues its read along with the rest.
>
> But it has not kept any of those pages locked, nor even kept any
> refcounts raised: so at the end has to look up the target page again
> with the final read_swap_cache_async() (which also copes with the
> highly unlikely case that the page got swapped out again meanwhile).
>
....
>
> I don't suppose your patch does any actual harm (beyond propagating a
> misunderstanding), but it's certainly not a fix, and I think should
> simply be dropped from the series.
Thanks - I had missed that. The code is correct, but looks wrong (to
me).
I've dropped the patch, but added a comment when I add
"swap_read_unplug()" to explain while plugging isn't needed for that
final read_swap_cache_async().
>
> (But please don't expect any comment from me on the rest:
> SWP_FS_OPS has always been beyond my understanding.)
:-)
Thanks,
NeilBrown
Hi Neil,
On Mon, Jan 24, 2022 at 5:40 PM NeilBrown <[email protected]> wrote:
> swap-over-NFS currently has a variety of problems.
>
> swap writes call generic_write_checks(), which always fails on a swap
> file, so it completely fails.
> Even without this, various deadlocks are possible - largely due to
> improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> which weren't tested against swap-out.
>
> NFS is the only filesystem that has supported fs-based swap IO, and it
> hasn't worked for several releases, so now is a convenient time to clean
> up the swap-via-filesystem interfaces - we cannot break anything !
>
> So the first few patches here clean up and improve various parts of the
> swap-via-filesystem code. ->activate_swap() is given a cleaner
> interface, a new ->swap_rw is introduced instead of burdening
> ->direct_IO, etc.
>
> Current swap-to-filesystem code only ever submits single-page reads and
> writes. These patches change that to allow multi-page IO when adjacent
> requests are submitted. Writes are also changed to be async rather than
> sync. This substantially speeds up write throughput for swap-over-NFS.
>
> Some of the NFS patches can land independently of the MM patches. A few
> require the MM patches to land first.
Thanks for your series!
Swap over NFS was indeed broken last time I tried[1], but with your
series, it's working again on arm32 (RZ/A1 with 32 MiB of RAM, 100Mbps
Ethernet and Debian 9 nfsroot). My system was exercised using "apt
update", and the subsequent "apt upgrade" is still running, though
(it took more than 6 hours to build the apt dependency tree, now it's
trying hard to create a list of packages...).
Tested-by: Geert Uytterhoeven <[email protected]>
BTW, I think you do want to run scripts/checkpatch.pl on your series,
and improve it by fixing a few of the reported warnings (function
definition arguments should also have an identifier name, missing
data_race() comment, missing SPDX-License-Identifier tag).
[1] https://lore.kernel.org/all/[email protected]/
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds