2021-12-16 23:52:17

by NeilBrown

[permalink] [raw]
Subject: [PATCH 00/18 V2] Repair SWAP-over-NFS

swap-over-NFS currently has a variety of problems.

swap writes call generic_write_checks(), which always fails on a swap
file, so it completely fails.
Even without this, various deadlocks are possible - largely due to
improvements in NFS memory allocation (using NOFS instead of ATOMIC)
which weren't tested against swap-out.

NFS is the only filesystem that has supported fs-based swap IO, and it
hasn't worked for several releases, so now is a convenient time to clean
up the swap-via-filesystem interfaces - we cannot break anything !

So the first few patches here clean up and improve various parts of the
swap-via-filesystem code. ->activate_swap() is given a cleaner
interface, a new ->swap_rw is introduced instead of burdening
->direct_IO, etc.

Current swap-to-filesystem code only ever submits single-page reads and
writes. These patches change that to allow multi-page IO when adjacent
requests are submitted. Writes are also changed to be async rather than
sync. This substantially speeds up write throughput for swap-over-NFS.

Some of the NFS patches can land independently of the MM patches. A few
require the MM patches to land first.

Thanks,
NeilBrown


---

NeilBrown (18):
Structural cleanup for filesystem-based swap
MM: create new mm/swap.h header file.
MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
MM: perform async writes to SWP_FS_OPS swap-space
MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
MM: submit multipage reads for SWP_FS_OPS swap-space
MM: submit multipage write for SWP_FS_OPS swap-space
MM: Add AS_CAN_DIO mapping flag
NFS: rename nfs_direct_IO and use as ->swap_rw
NFS: swap IO handling is slightly different for O_DIRECT IO
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/auth: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
NFSv4: keep state manager thread active if swap is enabled
NFS: swap-out must always use STABLE writes.


drivers/block/loop.c | 4 +-
fs/fcntl.c | 5 +-
fs/inode.c | 3 +
fs/nfs/direct.c | 56 ++++++----
fs/nfs/file.c | 25 +++--
fs/nfs/inode.c | 1 +
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 ++++
fs/nfs/nfs4state.c | 39 ++++++-
fs/nfs/read.c | 4 -
fs/nfs/write.c | 2 +
fs/open.c | 2 +-
fs/overlayfs/file.c | 10 +-
include/linux/fs.h | 2 +-
include/linux/nfs_fs.h | 11 +-
include/linux/nfs_xdr.h | 2 +
include/linux/pagemap.h | 3 +-
include/linux/sunrpc/auth.h | 1 +
include/linux/sunrpc/sched.h | 1 -
include/linux/swap.h | 121 --------------------
include/linux/writeback.h | 7 ++
include/trace/events/sunrpc.h | 1 -
mm/madvise.c | 9 +-
mm/memory.c | 3 +-
mm/mincore.c | 1 +
mm/page_alloc.c | 1 +
mm/page_io.c | 189 ++++++++++++++++++++++++++------
mm/shmem.c | 1 +
mm/swap.h | 140 +++++++++++++++++++++++
mm/swap_state.c | 32 ++++--
mm/swapfile.c | 6 +
mm/util.c | 1 +
mm/vmscan.c | 31 +++++-
net/sunrpc/auth.c | 8 +-
net/sunrpc/auth_gss/auth_gss.c | 6 +-
net/sunrpc/auth_unix.c | 10 +-
net/sunrpc/clnt.c | 7 +-
net/sunrpc/sched.c | 29 +++--
net/sunrpc/xprt.c | 19 ++--
net/sunrpc/xprtrdma/transport.c | 10 +-
net/sunrpc/xprtsock.c | 8 ++
41 files changed, 558 insertions(+), 274 deletions(-)
create mode 100644 mm/swap.h

--
Signature



2021-12-16 23:52:23

by NeilBrown

[permalink] [raw]
Subject: [PATCH 01/18] Structural cleanup for filesystem-based swap

Linux primarily uses IO to block devices for swap, but can send the IO
requests to a filesystem. This has only ever worked for NFS, and that
hasn't worked for a while due to a lack of testing. This seems like a
good time for some tidy-up before restoring swap-over-NFS functionality.

This patch:
- updates the documentation (both copies!) for swap_activate which
is woefully out-of-date
- introduces a new address_space operation "swap_rw" for swap IO.
The code currently used ->readpage for reads and ->direct_IO for
writes. The former imposes a limit of one-page-at-a-time, the
later means that direct writes and swap writes are encouraged to
use the same path. While similar, swap can often be simpler as
it can assume that no allocation is needed, and coherence with the
page cache is irrelevant.
- move the responsibility for setting SWP_FS_OPS to ->swap_activate()
and also requires it to always call add_swap_extent(). This makes
it much easier to find filesystems that require SWP_FS_OPS.
- drops the call to the filesystem for ->set_page_dirty(). These
pages do not belong to the filesystem, and it has no interest
in the dirty status.

writeout is switched to ->swap_rw, but read-in is not as that requires
too much change for this patch.

Both cifs and nfs set SWP_FS_OPS but neither provide a swap_rw, so both
will now fail to activate swap. cifs never really tried to provide swap
support as ->direct_IO always returns an error. NFS will be fixed up
with following patches.

Signed-off-by: NeilBrown <[email protected]>
---
Documentation/filesystems/locking.rst | 18 ++++++++++++------
Documentation/filesystems/vfs.rst | 17 ++++++++++++-----
fs/cifs/file.c | 7 ++++++-
fs/nfs/file.c | 17 +++++++++++++++--
include/linux/fs.h | 1 +
include/linux/swap.h | 1 -
mm/page_io.c | 26 ++++++--------------------
mm/swap_state.c | 2 +-
mm/swapfile.c | 10 +++-------
9 files changed, 56 insertions(+), 43 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d36fe79167b3..c2bb753bf688 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -265,8 +265,9 @@ prototypes::
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
- int (*swap_activate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);

locking rules:
All except set_page_dirty and freepage may block
@@ -295,6 +296,7 @@ is_partially_uptodate: yes
error_remove_page: yes
swap_activate: no
swap_deactivate: no
+swap_rw: yes, unlocks
====================== ======================== ========= ===============

->write_begin(), ->write_end() and ->readpage() may be called from
@@ -397,15 +399,19 @@ cleaned, or an error value if not. Note that in order to prevent the page
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.

-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap. It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation. It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added. If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.

->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.

+->swap_rw will be called for swap IO if ->swap_activate() set SWP_FS_OPS.
+
file_lock_operations
====================

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index bf5c48066fac..70d7ce335565 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -751,8 +751,9 @@ cache in your filesystem. The following members are defined:
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
- int (*swap_activate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};

``writepage``
@@ -959,15 +960,21 @@ cache in your filesystem. The following members are defined:
unless you have them locked or reference counts increased.

``swap_activate``
- Called when swapon is used on a file to allocate space if
- necessary and pin the block lookup information in memory. A
- return value of zero indicates success, in which case this file
- can be used to back swapspace.
+
+ Called to prepare the given file for swap. It should perform
+ any validation and preparation necessary to ensure that writes
+ can be performed with minimal memory allocation. It should call
+ add_swap_extent(), or the helper iomap_swapfile_activate(), and
+ return the number of extents added. If IO should be submitted
+ through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+ be submitted directly to the block device ``sis->bdev``.

``swap_deactivate``
Called during swapoff on files where swap_activate was
successful.

+``swap_rw``
+ Called to read or write swap pages when swap_activate() set SWP_FS_OPS.

The File Object
===============
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 9fee3af83a73..50bebf5f15cc 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4943,6 +4943,10 @@ static int cifs_swap_activate(struct swap_info_struct *sis,

cifs_dbg(FYI, "swap activate\n");

+ if (!swap_file->f_mapping->a_ops->swap_rw)
+ /* Cannot support swap */
+ return -EINVAL;
+
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
isize = inode->i_size;
@@ -4971,7 +4975,8 @@ static int cifs_swap_activate(struct swap_info_struct *sis,
* from reading or writing the file
*/

- return 0;
+ sis->flags |= SWP_FS_OPS;
+ return add_swap_extent(sis, 0, sis->max, 0);
}

static void cifs_swap_deactivate(struct file *file)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 24e7dccce355..0d33c95eefb6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -489,9 +489,14 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
{
unsigned long blocks;
long long isize;
+ int ret;
struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
struct inode *inode = file->f_mapping->host;

+ if (!file->f_mapping->a_ops->swap_rw)
+ /* Cannot support swap */
+ return -EINVAL;
+
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
isize = inode->i_size;
@@ -501,9 +506,17 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
return -EINVAL;
}

+ ret = rpc_clnt_swap_activate(clnt);
+ if (ret)
+ return ret;
+ ret = add_swap_extent(sis, 0, sis->max, 0);
+ if (ret < 0) {
+ rpc_clnt_swap_deactivate(clnt);
+ return ret;
+ }
*span = sis->pages;
-
- return rpc_clnt_swap_activate(clnt);
+ sis->flags |= SWP_FS_OPS;
+ return ret;
}

static void nfs_swap_deactivate(struct file *file)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbf812ce89a8..deaaf359cc49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -415,6 +415,7 @@ struct address_space_operations {
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
sector_t *span);
void (*swap_deactivate)(struct file *file);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};

extern const struct address_space_operations empty_aops;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d1ea44b31f19..10b2a92c1aa1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -427,7 +427,6 @@ extern int swap_writepage(struct page *page, struct writeback_control *wbc);
extern void end_swap_bio_write(struct bio *bio);
extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func);
-extern int swap_set_page_dirty(struct page *page);

int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
unsigned long nr_pages, sector_t start_block);
diff --git a/mm/page_io.c b/mm/page_io.c
index 9725c7e1eeea..cb617a4f59df 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -307,10 +307,9 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->direct_IO(&kiocb, &from);
- if (ret == PAGE_SIZE) {
+ ret = mapping->a_ops->swap_rw(&kiocb, &from);
+ if (ret == 0) {
count_vm_event(PSWPOUT);
- ret = 0;
} else {
/*
* In the case of swap-over-nfs, this can be a
@@ -378,10 +377,11 @@ int swap_readpage(struct page *page, bool synchronous)
}

if (data_race(sis->flags & SWP_FS_OPS)) {
- struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
+ //struct file *swap_file = sis->swap_file;
+ //struct address_space *mapping = swap_file->f_mapping;

- ret = mapping->a_ops->readpage(swap_file, page);
+ /* This needs to use ->swap_rw() */
+ ret = -EINVAL;
if (!ret)
count_vm_event(PSWPIN);
goto out;
@@ -434,17 +434,3 @@ int swap_readpage(struct page *page, bool synchronous)
psi_memstall_leave(&pflags);
return ret;
}
-
-int swap_set_page_dirty(struct page *page)
-{
- struct swap_info_struct *sis = page_swap_info(page);
-
- if (data_race(sis->flags & SWP_FS_OPS)) {
- struct address_space *mapping = sis->swap_file->f_mapping;
-
- VM_BUG_ON_PAGE(!PageSwapCache(page), page);
- return mapping->a_ops->set_page_dirty(page);
- } else {
- return __set_page_dirty_no_writeback(page);
- }
-}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..616eb1d75b35 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -30,7 +30,7 @@
*/
static const struct address_space_operations swap_aops = {
.writepage = swap_writepage,
- .set_page_dirty = swap_set_page_dirty,
+ .set_page_dirty = __set_page_dirty_no_writeback,
#ifdef CONFIG_MIGRATION
.migratepage = migrate_page,
#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..419eacf474c5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2397,13 +2397,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)

if (mapping->a_ops->swap_activate) {
ret = mapping->a_ops->swap_activate(sis, swap_file, span);
- if (ret >= 0)
- sis->flags |= SWP_ACTIVATED;
- if (!ret) {
- sis->flags |= SWP_FS_OPS;
- ret = add_swap_extent(sis, 0, sis->max, 0);
- *span = sis->pages;
- }
+ if (ret < 0)
+ return ret;
+ sis->flags |= SWP_ACTIVATED;
return ret;
}




2021-12-16 23:52:30

by NeilBrown

[permalink] [raw]
Subject: [PATCH 02/18] MM: create new mm/swap.h header file.

Many functions declared in include/linux/swap.h are only used within mm/

Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/swap.h | 121 -----------------------------------------------
mm/madvise.c | 1
mm/memory.c | 1
mm/mincore.c | 1
mm/page_alloc.c | 1
mm/page_io.c | 1
mm/shmem.c | 1
mm/swap.h | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 1
mm/swapfile.c | 1
mm/util.c | 1
mm/vmscan.c | 1
12 files changed, 139 insertions(+), 121 deletions(-)
create mode 100644 mm/swap.h

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 10b2a92c1aa1..6a0c25c0bb95 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -419,61 +419,18 @@ extern void kswapd_stop(int nid);

#ifdef CONFIG_SWAP

-#include <linux/blk_types.h> /* for bio_end_io_t */
-
-/* linux/mm/page_io.c */
-extern int swap_readpage(struct page *page, bool do_poll);
-extern int swap_writepage(struct page *page, struct writeback_control *wbc);
-extern void end_swap_bio_write(struct bio *bio);
-extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
- bio_end_io_t end_write_func);
-
int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
unsigned long nr_pages, sector_t start_block);
int generic_swapfile_activate(struct swap_info_struct *, struct file *,
sector_t *);

-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT 14
-#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry) \
- (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
- >> SWAP_ADDRESS_SPACE_SHIFT])
static inline unsigned long total_swapcache_pages(void)
{
return global_node_page_state(NR_SWAPCACHE);
}

-extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *page);
-extern void *get_shadow_from_swap_cache(swp_entry_t entry);
-extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp, void **shadowp);
-extern void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry, void *shadow);
-extern void delete_from_swap_cache(struct page *);
-extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
- unsigned long end);
-extern void free_swap_cache(struct page *);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page *lookup_swap_cache(swp_entry_t entry,
- struct vm_area_struct *vma,
- unsigned long addr);
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
-extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
- struct vm_area_struct *vma, unsigned long addr,
- bool do_poll);
-extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
- struct vm_area_struct *vma, unsigned long addr,
- bool *new_page_allocated);
-extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
- struct vm_fault *vmf);
-extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
- struct vm_fault *vmf);
-
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
@@ -527,12 +484,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
}

#else /* CONFIG_SWAP */
-
-static inline int swap_readpage(struct page *page, bool do_poll)
-{
- return 0;
-}
-
static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
{
return NULL;
@@ -547,11 +498,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
{
}

-static inline struct address_space *swap_address_space(swp_entry_t entry)
-{
- return NULL;
-}
-
#define get_nr_swap_pages() 0L
#define total_swap_pages 0L
#define total_swapcache_pages() 0UL
@@ -566,14 +512,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
#define free_pages_and_swap_cache(pages, nr) \
release_pages((pages), (nr));

-static inline void free_swap_cache(struct page *page)
-{
-}
-
-static inline void show_swap_cache_info(void)
-{
-}
-
/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
#define free_swap_and_cache(e) is_pfn_swap_entry(e)

@@ -599,65 +537,6 @@ static inline void put_swap_page(struct page *page, swp_entry_t swp)
{
}

-static inline struct page *swap_cluster_readahead(swp_entry_t entry,
- gfp_t gfp_mask, struct vm_fault *vmf)
-{
- return NULL;
-}
-
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
- struct vm_fault *vmf)
-{
- return NULL;
-}
-
-static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
-{
- return 0;
-}
-
-static inline struct page *lookup_swap_cache(swp_entry_t swp,
- struct vm_area_struct *vma,
- unsigned long addr)
-{
- return NULL;
-}
-
-static inline
-struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
-{
- return find_get_page(mapping, index);
-}
-
-static inline int add_to_swap(struct page *page)
-{
- return 0;
-}
-
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
-{
- return NULL;
-}
-
-static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask, void **shadowp)
-{
- return -1;
-}
-
-static inline void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry, void *shadow)
-{
-}
-
-static inline void delete_from_swap_cache(struct page *page)
-{
-}
-
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
- unsigned long end)
-{
-}

static inline int page_swapcount(struct page *page)
{
diff --git a/mm/madvise.c b/mm/madvise.c
index 8c927202bbe6..724470773582 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -33,6 +33,7 @@
#include <asm/tlb.h>

#include "internal.h"
+#include "swap.h"

struct madvise_walk_private {
struct mmu_gather *tlb;
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..80bbfd449b40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -85,6 +85,7 @@

#include "pgalloc-track.h"
#include "internal.h"
+#include "swap.h"

#if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..f4f627325e12 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -20,6 +20,7 @@
#include <linux/pgtable.h>

#include <linux/uaccess.h>
+#include "swap.h"

static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
unsigned long end, struct mm_walk *walk)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..c0b7a3878801 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -78,6 +78,7 @@
#include "internal.h"
#include "shuffle.h"
#include "page_reporting.h"
+#include "swap.h"

/* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
typedef int __bitwise fpi_t;
diff --git a/mm/page_io.c b/mm/page_io.c
index cb617a4f59df..a9fe5de5dc32 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -25,6 +25,7 @@
#include <linux/psi.h>
#include <linux/uio.h>
#include <linux/sched/task.h>
+#include "swap.h"

void end_swap_bio_write(struct bio *bio)
{
diff --git a/mm/shmem.c b/mm/shmem.c
index 18f93c2d68f1..993b6b3ca20f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -39,6 +39,7 @@
#include <linux/frontswap.h>
#include <linux/fs_parser.h>
#include <linux/swapfile.h>
+#include "swap.h"

static struct vfsmount *shm_mnt;

diff --git a/mm/swap.h b/mm/swap.h
new file mode 100644
index 000000000000..13e72a5023aa
--- /dev/null
+++ b/mm/swap.h
@@ -0,0 +1,129 @@
+
+#ifdef CONFIG_SWAP
+#include <linux/blk_types.h> /* for bio_end_io_t */
+
+/* linux/mm/page_io.c */
+int swap_readpage(struct page *page, bool do_poll);
+int swap_writepage(struct page *page, struct writeback_control *wbc);
+void end_swap_bio_write(struct bio *bio);
+int __swap_writepage(struct page *page, struct writeback_control *wbc,
+ bio_end_io_t end_write_func);
+
+/* linux/mm/swap_state.c */
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT 14
+#define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space *swapper_spaces[];
+#define swap_address_space(entry) \
+ (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
+ >> SWAP_ADDRESS_SPACE_SHIFT])
+
+void show_swap_cache_info(void);
+int add_to_swap(struct page *page);
+void *get_shadow_from_swap_cache(swp_entry_t entry);
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp);
+void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow);
+void delete_from_swap_cache(struct page *);
+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end);
+void free_swap_cache(struct page *);
+struct page *lookup_swap_cache(swp_entry_t entry,
+ struct vm_area_struct *vma,
+ unsigned long addr);
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
+
+struct page *read_swap_cache_async(swp_entry_t, gfp_t,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ bool do_poll);
+struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ bool *new_page_allocated);
+struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
+ struct vm_fault *vmf);
+struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
+ struct vm_fault *vmf);
+
+#else /* CONFIG_SWAP */
+static inline int swap_readpage(struct page *page, bool do_poll)
+{
+ return 0;
+}
+
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+ return NULL;
+}
+
+static inline void free_swap_cache(struct page *page)
+{
+}
+
+static inline void show_swap_cache_info(void)
+{
+}
+
+static inline struct page *swap_cluster_readahead(swp_entry_t entry,
+ gfp_t gfp_mask, struct vm_fault *vmf)
+{
+ return NULL;
+}
+
+static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+ struct vm_fault *vmf)
+{
+ return NULL;
+}
+
+static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
+{
+ return 0;
+}
+
+static inline struct page *lookup_swap_cache(swp_entry_t swp,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return NULL;
+}
+
+static inline
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
+{
+ return find_get_page(mapping, index);
+}
+
+static inline int add_to_swap(struct page *page)
+{
+ return 0;
+}
+
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ return NULL;
+}
+
+static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp_mask, void **shadowp)
+{
+ return -1;
+}
+
+static inline void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow)
+{
+}
+
+static inline void delete_from_swap_cache(struct page *page)
+{
+}
+
+static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end)
+{
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 616eb1d75b35..514b86b05488 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
#include <linux/huge_mm.h>
#include <linux/shmem_fs.h>
#include "internal.h"
+#include "swap.h"

/*
* swapper_space is a fiction, retained to simplify the path through
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 419eacf474c5..f23d9ff21cf8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -44,6 +44,7 @@
#include <asm/tlbflush.h>
#include <linux/swapops.h>
#include <linux/swap_cgroup.h>
+#include "swap.h"

static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
diff --git a/mm/util.c b/mm/util.c
index 741ba32a43ac..7e26387be090 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -27,6 +27,7 @@
#include <linux/uaccess.h>

#include "internal.h"
+#include "swap.h"

/**
* kfree_const - conditionally free memory
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..969bcdb4ca80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
#include <linux/balloon_compaction.h>

#include "internal.h"
+#include "swap.h"

#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>



2021-12-16 23:52:37

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space

To submit an async read with ->swap_rw() we need to allocate
a structure to hold the kiocb and other details. swap_readpage() cannot
handle transient failure, so create a mempool to provide the structures.

Signed-off-by: NeilBrown <[email protected]>
---
mm/page_io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++------
mm/swap.h | 1 +
mm/swapfile.c | 5 +++++
3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index a9fe5de5dc32..47d7e7866e33 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,6 +283,23 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
#define bio_associate_blkg_from_page(bio, page) do { } while (0)
#endif /* CONFIG_MEMCG && CONFIG_BLK_CGROUP */

+struct swap_iocb {
+ struct kiocb iocb;
+ struct bio_vec bvec;
+};
+static mempool_t *sio_pool;
+
+int sio_pool_init(void)
+{
+ if (!sio_pool)
+ sio_pool = mempool_create_kmalloc_pool(
+ SWAP_CLUSTER_MAX, sizeof(struct swap_iocb));
+ if (sio_pool)
+ return 0;
+ else
+ return -ENOMEM;
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -353,6 +370,23 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return 0;
}

+static void sio_read_complete(struct kiocb *iocb, long ret)
+{
+ struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+ struct page *page = sio->bvec.bv_page;
+
+ if (ret != 0 && ret != PAGE_SIZE) {
+ SetPageError(page);
+ ClearPageUptodate(page);
+ pr_alert_ratelimited("Read-error on swap-device\n");
+ } else {
+ SetPageUptodate(page);
+ count_vm_event(PSWPIN);
+ }
+ unlock_page(page);
+ mempool_free(sio, sio_pool);
+}
+
int swap_readpage(struct page *page, bool synchronous)
{
struct bio *bio;
@@ -378,13 +412,25 @@ int swap_readpage(struct page *page, bool synchronous)
}

if (data_race(sis->flags & SWP_FS_OPS)) {
- //struct file *swap_file = sis->swap_file;
- //struct address_space *mapping = swap_file->f_mapping;
+ struct file *swap_file = sis->swap_file;
+ struct address_space *mapping = swap_file->f_mapping;
+ struct iov_iter from;
+ struct swap_iocb *sio;
+ loff_t pos = page_file_offset(page);
+
+ sio = mempool_alloc(sio_pool, GFP_KERNEL);
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_pos = pos;
+ sio->iocb.ki_complete = sio_read_complete;
+ sio->bvec.bv_page = page;
+ sio->bvec.bv_len = PAGE_SIZE;
+ sio->bvec.bv_offset = 0;
+
+ iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_read_complete(&sio->iocb, ret);

- /* This needs to use ->swap_rw() */
- ret = -EINVAL;
- if (!ret)
- count_vm_event(PSWPIN);
goto out;
}

diff --git a/mm/swap.h b/mm/swap.h
index 13e72a5023aa..128a1d3e5558 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -3,6 +3,7 @@
#include <linux/blk_types.h> /* for bio_end_io_t */

/* linux/mm/page_io.c */
+int sio_pool_init(void);
int swap_readpage(struct page *page, bool do_poll);
int swap_writepage(struct page *page, struct writeback_control *wbc);
void end_swap_bio_write(struct bio *bio);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f23d9ff21cf8..43539be38e68 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2401,6 +2401,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
if (ret < 0)
return ret;
sis->flags |= SWP_ACTIVATED;
+ if ((sis->flags & SWP_FS_OPS) &&
+ sio_pool_init() != 0) {
+ destroy_swap_extents(sis);
+ return -ENOMEM;
+ }
return ret;
}




2021-12-16 23:52:44

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space

Writes to SWP_FS_OPS swapspace is currently synchronous. To make it
async we need to allocate the kiocb struct which may block, but won't
block as long as waiting for the write to complete would block.

Signed-off-by: NeilBrown <[email protected]>
---
mm/page_io.c | 69 +++++++++++++++++++++++++++++++++-------------------------
1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 47d7e7866e33..84859132c9c6 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -300,6 +300,32 @@ int sio_pool_init(void)
return -ENOMEM;
}

+static void sio_write_complete(struct kiocb *iocb, long ret)
+{
+ struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
+ struct page *page = sio->bvec.bv_page;
+
+ if (ret != 0 && ret != PAGE_SIZE) {
+ /*
+ * In the case of swap-over-nfs, this can be a
+ * temporary failure if the system has limited
+ * memory for allocating transmit buffers.
+ * Mark the page dirty and avoid
+ * folio_rotate_reclaimable but rate-limit the
+ * messages but do not flag PageError like
+ * the normal direct-to-bio case as it could
+ * be temporary.
+ */
+ set_page_dirty(page);
+ ClearPageReclaim(page);
+ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+ ret, page_file_offset(page));
+ } else
+ count_vm_event(PSWPOUT);
+ end_page_writeback(page);
+ mempool_free(sio, sio_pool);
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
bio_end_io_t end_write_func)
{
@@ -309,42 +335,25 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

VM_BUG_ON_PAGE(!PageSwapCache(page), page);
if (data_race(sis->flags & SWP_FS_OPS)) {
- struct kiocb kiocb;
+ struct swap_iocb *sio;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
- struct bio_vec bv = {
- .bv_page = page,
- .bv_len = PAGE_SIZE,
- .bv_offset = 0
- };
struct iov_iter from;

- iov_iter_bvec(&from, WRITE, &bv, 1, PAGE_SIZE);
- init_sync_kiocb(&kiocb, swap_file);
- kiocb.ki_pos = page_file_offset(page);
-
set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->swap_rw(&kiocb, &from);
- if (ret == 0) {
- count_vm_event(PSWPOUT);
- } else {
- /*
- * In the case of swap-over-nfs, this can be a
- * temporary failure if the system has limited
- * memory for allocating transmit buffers.
- * Mark the page dirty and avoid
- * folio_rotate_reclaimable but rate-limit the
- * messages but do not flag PageError like
- * the normal direct-to-bio case as it could
- * be temporary.
- */
- set_page_dirty(page);
- ClearPageReclaim(page);
- pr_err_ratelimited("Write error on dio swapfile (%llu)\n",
- page_file_offset(page));
- }
- end_page_writeback(page);
+ sio = mempool_alloc(sio_pool, GFP_NOIO);
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_complete = sio_write_complete;
+ sio->iocb.ki_pos = page_file_offset(page);
+ sio->bvec.bv_page = page;
+ sio->bvec.bv_len = PAGE_SIZE;
+ sio->bvec.bv_offset = 0;
+ iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_write_complete(&sio->iocb, ret);
+
return ret;
}




2021-12-16 23:52:53

by NeilBrown

[permalink] [raw]
Subject: [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space

If swap-out is using filesystem operations (SWP_FS_OPS), then it is not
safe to enter the FS for reclaim.
So only down-grade the requirement for swap pages to __GFP_IO after
checking that SWP_FS_OPS are not being used.

Signed-off-by: NeilBrown <[email protected]>
---
mm/vmscan.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 969bcdb4ca80..5f460d174b1b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1465,6 +1465,21 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
return nr_succeeded;
}

+static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_FS)
+ return true;
+ if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
+ return false;
+ /* We can "enter_fs" for swap-cache with only __GFP_IO
+ * providing this isn't SWP_FS_OPS.
+ * ->flags can be updated non-atomicially (scan_swap_map_slots),
+ * but that will never affect SWP_FS_OPS, so the data_race
+ * is safe.
+ */
+ return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -1514,8 +1529,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;

- may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
- (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+ may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);

/*
* The number of dirty pages determines if a node is marked
@@ -1683,7 +1697,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
goto activate_locked_split;
}

- may_enter_fs = true;
+ may_enter_fs = test_may_enter_fs(page,
+ sc->gfp_mask);

/* Adding to swap updated mapping */
mapping = page_mapping(page);



2021-12-16 23:53:00

by NeilBrown

[permalink] [raw]
Subject: [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space

swap_readpage() is given one page at a time, but maybe called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
are single page reads.

With this patch we pass in a pointer-to-pointer when swap_readpage can
store state between calls - much like the effect of blk_plug. After
calling swap_readpage() some number of times, the state will be passed
to swap_read_unplug() which can submit the combined request.

Some caller currently call blk_finish_plug() *before* the final call to
swap_readpage(), so the last page cannot be included. This patch moves
blk_finish_plug() to after the last call, and calls swap_read_unplug()
there too.

Signed-off-by: NeilBrown <[email protected]>
---
mm/madvise.c | 8 +++--
mm/memory.c | 2 +
mm/page_io.c | 95 ++++++++++++++++++++++++++++++++++++-------------------
mm/swap.h | 13 ++++++--
mm/swap_state.c | 31 ++++++++++++------
5 files changed, 100 insertions(+), 49 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 724470773582..a90870c7a2df 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -191,6 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
pte_t *orig_pte;
struct vm_area_struct *vma = walk->private;
unsigned long index;
+ struct swap_iocb *splug = NULL;

if (pmd_none_or_trans_huge_or_clear_bad(pmd))
return 0;
@@ -212,10 +213,11 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
continue;

page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false);
+ vma, index, false, &splug);
if (page)
put_page(page);
}
+ swap_read_unplug(splug);

return 0;
}
@@ -231,6 +233,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
struct page *page;
+ struct swap_iocb *splug = NULL;

rcu_read_lock();
xas_for_each(&xas, page, end_index) {
@@ -243,13 +246,14 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,

swap = radix_to_swp_entry(page);
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false);
+ NULL, 0, false, &splug);
if (page)
put_page(page);

rcu_read_lock();
}
rcu_read_unlock();
+ swap_read_unplug(splug);

lru_add_drain(); /* Push any new pages onto the LRU now */
}
diff --git a/mm/memory.c b/mm/memory.c
index 80bbfd449b40..0ca00f2a6890 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3538,7 +3538,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* To provide entry to swap_readpage() */
set_page_private(page, entry.val);
- swap_readpage(page, true);
+ swap_readpage(page, true, NULL);
set_page_private(page, 0);
}
} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 84859132c9c6..03fbf9463081 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -285,7 +285,8 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)

struct swap_iocb {
struct kiocb iocb;
- struct bio_vec bvec;
+ struct bio_vec bvec[SWAP_CLUSTER_MAX];
+ int pages;
};
static mempool_t *sio_pool;

@@ -303,7 +304,7 @@ int sio_pool_init(void)
static void sio_write_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
- struct page *page = sio->bvec.bv_page;
+ struct page *page = sio->bvec[0].bv_page;

if (ret != 0 && ret != PAGE_SIZE) {
/*
@@ -346,10 +347,10 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
init_sync_kiocb(&sio->iocb, swap_file);
sio->iocb.ki_complete = sio_write_complete;
sio->iocb.ki_pos = page_file_offset(page);
- sio->bvec.bv_page = page;
- sio->bvec.bv_len = PAGE_SIZE;
- sio->bvec.bv_offset = 0;
- iov_iter_bvec(&from, WRITE, &sio->bvec, 1, PAGE_SIZE);
+ sio->bvec[0].bv_page = page;
+ sio->bvec[0].bv_len = PAGE_SIZE;
+ sio->bvec[0].bv_offset = 0;
+ iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
if (ret != -EIOCBQUEUED)
sio_write_complete(&sio->iocb, ret);
@@ -382,21 +383,25 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
static void sio_read_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
- struct page *page = sio->bvec.bv_page;
-
- if (ret != 0 && ret != PAGE_SIZE) {
- SetPageError(page);
- ClearPageUptodate(page);
- pr_alert_ratelimited("Read-error on swap-device\n");
- } else {
- SetPageUptodate(page);
- count_vm_event(PSWPIN);
+ int p;
+
+ for (p = 0; p < sio->pages; p++) {
+ struct page *page = sio->bvec[p].bv_page;
+ if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+ SetPageError(page);
+ ClearPageUptodate(page);
+ pr_alert_ratelimited("Read-error on swap-device\n");
+ } else {
+ SetPageUptodate(page);
+ count_vm_event(PSWPIN);
+ }
+ unlock_page(page);
}
- unlock_page(page);
mempool_free(sio, sio_pool);
}

-int swap_readpage(struct page *page, bool synchronous)
+int swap_readpage(struct page *page, bool synchronous,
+ struct swap_iocb **plug)
{
struct bio *bio;
int ret = 0;
@@ -421,24 +426,35 @@ int swap_readpage(struct page *page, bool synchronous)
}

if (data_race(sis->flags & SWP_FS_OPS)) {
- struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct iov_iter from;
- struct swap_iocb *sio;
+ struct swap_iocb *sio = NULL;
loff_t pos = page_file_offset(page);

- sio = mempool_alloc(sio_pool, GFP_KERNEL);
- init_sync_kiocb(&sio->iocb, swap_file);
- sio->iocb.ki_pos = pos;
- sio->iocb.ki_complete = sio_read_complete;
- sio->bvec.bv_page = page;
- sio->bvec.bv_len = PAGE_SIZE;
- sio->bvec.bv_offset = 0;
-
- iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
- ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
- if (ret != -EIOCBQUEUED)
- sio_read_complete(&sio->iocb, ret);
+ if (*plug)
+ sio = *plug;
+ if (sio) {
+ if (sio->iocb.ki_filp != sis->swap_file ||
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ swap_read_unplug(sio);
+ sio = NULL;
+ }
+ }
+ if (!sio) {
+ sio = mempool_alloc(sio_pool, GFP_KERNEL);
+ init_sync_kiocb(&sio->iocb, sis->swap_file);
+ sio->iocb.ki_pos = pos;
+ sio->iocb.ki_complete = sio_read_complete;
+ sio->pages = 0;
+ }
+ sio->bvec[sio->pages].bv_page = page;
+ sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+ sio->bvec[sio->pages].bv_offset = 0;
+ sio->pages += 1;
+ if (sio->pages == ARRAY_SIZE(sio->bvec) || !plug) {
+ swap_read_unplug(sio);
+ sio = NULL;
+ }
+ if (plug)
+ *plug = sio;

goto out;
}
@@ -490,3 +506,16 @@ int swap_readpage(struct page *page, bool synchronous)
psi_memstall_leave(&pflags);
return ret;
}
+
+void __swap_read_unplug(struct swap_iocb *sio)
+{
+ struct iov_iter from;
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ int ret;
+
+ iov_iter_bvec(&from, READ, sio->bvec, sio->pages,
+ PAGE_SIZE * sio->pages);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_read_complete(&sio->iocb, ret);
+}
diff --git a/mm/swap.h b/mm/swap.h
index 128a1d3e5558..ce967abc5f46 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -4,7 +4,15 @@

/* linux/mm/page_io.c */
int sio_pool_init(void);
-int swap_readpage(struct page *page, bool do_poll);
+struct swap_iocb;
+int swap_readpage(struct page *page, bool do_poll,
+ struct swap_iocb **plug);
+void __swap_read_unplug(struct swap_iocb *plug);
+static inline void swap_read_unplug(struct swap_iocb *plug)
+{
+ if (unlikely(plug))
+ __swap_read_unplug(plug);
+}
int swap_writepage(struct page *page, struct writeback_control *wbc);
void end_swap_bio_write(struct bio *bio);
int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -38,7 +46,8 @@ struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
struct page *read_swap_cache_async(swp_entry_t, gfp_t,
struct vm_area_struct *vma,
unsigned long addr,
- bool do_poll);
+ bool do_poll,
+ struct swap_iocb **plug);
struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
struct vm_area_struct *vma,
unsigned long addr,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 514b86b05488..5cb2c75fa247 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -520,14 +520,16 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* the swap entry is no longer in use.
*/
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_area_struct *vma, unsigned long addr, bool do_poll)
+ struct vm_area_struct *vma,
+ unsigned long addr, bool do_poll,
+ struct swap_iocb **plug)
{
bool page_was_allocated;
struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
vma, addr, &page_was_allocated);

if (page_was_allocated)
- swap_readpage(retpage, do_poll);
+ swap_readpage(retpage, do_poll, plug);

return retpage;
}
@@ -621,10 +623,12 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long mask;
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
+ struct swap_iocb *splug = NULL;
bool do_poll = true, page_allocated;
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;

+ blk_start_plug(&plug);
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
@@ -638,7 +642,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (end_offset >= si->max)
end_offset = si->max - 1;

- blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
page = __read_swap_cache_async(
@@ -647,7 +650,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!page)
continue;
if (page_allocated) {
- swap_readpage(page, false);
+ swap_readpage(page, false, &splug);
if (offset != entry_offset) {
SetPageReadahead(page);
count_vm_event(SWAP_RA);
@@ -655,11 +658,14 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
}
put_page(page);
}
- blk_finish_plug(&plug);

lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
- return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
+ page = read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll,
+ &splug);
+ blk_finish_plug(&plug);
+ swap_read_unplug(splug);
+ return page;
}

int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -790,6 +796,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
struct vm_fault *vmf)
{
struct blk_plug plug;
+ struct swap_iocb *splug = NULL;
struct vm_area_struct *vma = vmf->vma;
struct page *page;
pte_t *pte, pentry;
@@ -800,11 +807,11 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
.win = 1,
};

+ blk_start_plug(&plug);
swap_ra_info(vmf, &ra_info);
if (ra_info.win == 1)
goto skip;

- blk_start_plug(&plug);
for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
i++, pte++) {
pentry = *pte;
@@ -820,7 +827,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
if (!page)
continue;
if (page_allocated) {
- swap_readpage(page, false);
+ swap_readpage(page, false, &splug);
if (i != ra_info.offset) {
SetPageReadahead(page);
count_vm_event(SWAP_RA);
@@ -828,11 +835,13 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
}
put_page(page);
}
- blk_finish_plug(&plug);
lru_add_drain();
skip:
- return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
- ra_info.win == 1);
+ page = read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
+ ra_info.win == 1, &splug);
+ blk_finish_plug(&plug);
+ swap_read_unplug(splug);
+ return page;
}

/**



2021-12-16 23:53:07

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space

swap_writepage() is given one page at a time, but may be called repeatedly
in succession.
For block-device swapspace, the blk_plug functionality allows the
multiple pages to be combined together at lower layers.
That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
are single page reads.

With this patch we pass a pointer-to-pointer via the wbc.
swap_writepage can store state between calls - much like the pointer
passed explicitly to swap_readpage. After calling swap_writepage() some
number of times, the state will be passed to swap_write_unplug() which
can submit the combined request.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/writeback.h | 7 +++
mm/page_io.c | 98 ++++++++++++++++++++++++++++++---------------
mm/swap.h | 1
mm/vmscan.c | 9 +++-
4 files changed, 80 insertions(+), 35 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3bfd487d1dd2..16f780b618d2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -79,6 +79,13 @@ struct writeback_control {

unsigned punt_to_cgroup:1; /* cgrp punting, see __REQ_CGROUP_PUNT */

+ /* To enable batching of swap writes to non-block-device backends,
+ * "plug" can be set point to a 'struct swap_iocb *'. When all swap
+ * writes have been submitted, if with swap_iocb is not NULL,
+ * swap_write_unplug() should be called.
+ */
+ struct swap_iocb **plug;
+
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
struct inode *inode; /* inode being written out */
diff --git a/mm/page_io.c b/mm/page_io.c
index 03fbf9463081..92a31df467a2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -304,26 +304,30 @@ int sio_pool_init(void)
static void sio_write_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
- struct page *page = sio->bvec[0].bv_page;
+ int p;

- if (ret != 0 && ret != PAGE_SIZE) {
- /*
- * In the case of swap-over-nfs, this can be a
- * temporary failure if the system has limited
- * memory for allocating transmit buffers.
- * Mark the page dirty and avoid
- * folio_rotate_reclaimable but rate-limit the
- * messages but do not flag PageError like
- * the normal direct-to-bio case as it could
- * be temporary.
- */
- set_page_dirty(page);
- ClearPageReclaim(page);
- pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
- ret, page_file_offset(page));
- } else
- count_vm_event(PSWPOUT);
- end_page_writeback(page);
+ for (p = 0; p < sio->pages; p++) {
+ struct page *page = sio->bvec[p].bv_page;
+
+ if (ret != 0 && ret != PAGE_SIZE * sio->pages) {
+ /*
+ * In the case of swap-over-nfs, this can be a
+ * temporary failure if the system has limited
+ * memory for allocating transmit buffers.
+ * Mark the page dirty and avoid
+ * folio_rotate_reclaimable but rate-limit the
+ * messages but do not flag PageError like
+ * the normal direct-to-bio case as it could
+ * be temporary.
+ */
+ set_page_dirty(page);
+ ClearPageReclaim(page);
+ pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
+ ret, page_file_offset(page));
+ } else
+ count_vm_event(PSWPOUT);
+ end_page_writeback(page);
+ }
mempool_free(sio, sio_pool);
}

@@ -336,24 +340,39 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

VM_BUG_ON_PAGE(!PageSwapCache(page), page);
if (data_race(sis->flags & SWP_FS_OPS)) {
- struct swap_iocb *sio;
+ struct swap_iocb *sio = NULL;
struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct iov_iter from;
+ loff_t pos = page_file_offset(page);

set_page_writeback(page);
unlock_page(page);
- sio = mempool_alloc(sio_pool, GFP_NOIO);
- init_sync_kiocb(&sio->iocb, swap_file);
- sio->iocb.ki_complete = sio_write_complete;
- sio->iocb.ki_pos = page_file_offset(page);
- sio->bvec[0].bv_page = page;
- sio->bvec[0].bv_len = PAGE_SIZE;
- sio->bvec[0].bv_offset = 0;
- iov_iter_bvec(&from, WRITE, &sio->bvec[0], 1, PAGE_SIZE);
- ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
- if (ret != -EIOCBQUEUED)
- sio_write_complete(&sio->iocb, ret);
+
+ if (wbc->plug)
+ sio = *wbc->plug;
+ if (sio) {
+ if (sio->iocb.ki_filp != swap_file ||
+ sio->iocb.ki_pos + sio->pages * PAGE_SIZE != pos) {
+ swap_write_unplug(sio);
+ sio = NULL;
+ }
+ }
+ if (!sio) {
+ sio = mempool_alloc(sio_pool, GFP_NOIO);
+ init_sync_kiocb(&sio->iocb, swap_file);
+ sio->iocb.ki_complete = sio_write_complete;
+ sio->iocb.ki_pos = pos;
+ sio->pages = 0;
+ }
+ sio->bvec[sio->pages].bv_page = page;
+ sio->bvec[sio->pages].bv_len = PAGE_SIZE;
+ sio->bvec[sio->pages].bv_offset = 0;
+ sio->pages += 1;
+ if (sio->pages == ARRAY_SIZE(sio->bvec) || !wbc->plug) {
+ swap_write_unplug(sio);
+ sio = NULL;
+ }
+ if (wbc->plug)
+ *wbc->plug = sio;

return ret;
}
@@ -380,6 +399,19 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return 0;
}

+void swap_write_unplug(struct swap_iocb *sio)
+{
+ struct iov_iter from;
+ struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
+ int ret;
+
+ iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
+ PAGE_SIZE * sio->pages);
+ ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
+ if (ret != -EIOCBQUEUED)
+ sio_write_complete(&sio->iocb, ret);
+}
+
static void sio_read_complete(struct kiocb *iocb, long ret)
{
struct swap_iocb *sio = container_of(iocb, struct swap_iocb, iocb);
diff --git a/mm/swap.h b/mm/swap.h
index ce967abc5f46..f4d0edda6e59 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -13,6 +13,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
if (unlikely(plug))
__swap_read_unplug(plug);
}
+void swap_write_unplug(struct swap_iocb *sio);
int swap_writepage(struct page *page, struct writeback_control *wbc);
void end_swap_bio_write(struct bio *bio);
int __swap_writepage(struct page *page, struct writeback_control *wbc,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f460d174b1b..50a363e63102 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1123,7 +1123,8 @@ typedef enum {
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
*/
-static pageout_t pageout(struct page *page, struct address_space *mapping)
+static pageout_t pageout(struct page *page, struct address_space *mapping,
+ struct swap_iocb **plug)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -1170,6 +1171,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping)
.range_start = 0,
.range_end = LLONG_MAX,
.for_reclaim = 1,
+ .plug = plug,
};

SetPageReclaim(page);
@@ -1495,6 +1497,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
+ struct swap_iocb *plug = NULL;

memset(stat, 0, sizeof(*stat));
cond_resched();
@@ -1780,7 +1783,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
- switch (pageout(page, mapping)) {
+ switch (pageout(page, mapping, &plug)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
@@ -1934,6 +1937,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);

+ if (plug)
+ swap_write_unplug(plug);
return nr_reclaimed;
}




2021-12-16 23:53:18

by NeilBrown

[permalink] [raw]
Subject: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag

Currently various places test if direct IO is possible on a file by
checking for the existence of the direct_IO address space operation.
This is a poor choice, as the direct_IO operation may not be used - it is
only used if the generic_file_*_iter functions are called for direct IO
and some filesystems - particularly NFS - don't do this.

Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
places to check this (avoiding a pointer dereference).
unlock_new_inode() will set this flag if ->direct_IO is present, so
filesystems do not need to be changed.

NFS *is* changed, to set the flag explicitly and discard the direct_IO
entry in the address_space_operations for files.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/block/loop.c | 4 ++--
fs/fcntl.c | 5 +++--
fs/inode.c | 3 +++
fs/nfs/file.c | 1 -
fs/nfs/inode.c | 1 +
fs/open.c | 2 +-
fs/overlayfs/file.c | 10 ++++------
include/linux/fs.h | 2 +-
include/linux/pagemap.h | 3 ++-
9 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a36cfaa855..ab4dee6c0fc3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -184,8 +184,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
*/
if (dio) {
if (queue_logical_block_size(lo->lo_queue) >= sb_bsize &&
- !(lo->lo_offset & dio_align) &&
- mapping->a_ops->direct_IO)
+ !(lo->lo_offset & dio_align) &&
+ test_bit(AS_CAN_DIO, &mapping->flags))
use_dio = true;
else
use_dio = false;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9c6c6a3e2de5..fcbf2dc44273 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -26,6 +26,7 @@
#include <linux/memfd.h>
#include <linux/compat.h>
#include <linux/mount.h>
+#include <linux/pagemap.h>

#include <linux/poll.h>
#include <asm/siginfo.h>
@@ -57,9 +58,9 @@ static int setfl(int fd, struct file * filp, unsigned long arg)

/* Pipe packetized mode is controlled by O_DIRECT flag */
if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) {
- if (!filp->f_mapping || !filp->f_mapping->a_ops ||
- !filp->f_mapping->a_ops->direct_IO)
- return -EINVAL;
+ if (!filp->f_mapping ||
+ !test_bit(AS_CAN_DIO, &filp->f_mapping->flags))
+ return -EINVAL;
}

if (filp->f_op->check_flags)
diff --git a/fs/inode.c b/fs/inode.c
index 6b80a51129d5..bae65ccecdb1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
void unlock_new_inode(struct inode *inode)
{
lockdep_annotate_inode_mutex_key(inode);
+ if (inode->i_mapping->a_ops &&
+ inode->i_mapping->a_ops->direct_IO)
+ set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
spin_lock(&inode->i_lock);
WARN_ON(!(inode->i_state & I_NEW));
inode->i_state &= ~I_NEW & ~I_CREATING;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0d33c95eefb6..60842b774b56 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -536,7 +536,6 @@ const struct address_space_operations nfs_file_aops = {
.write_end = nfs_write_end,
.invalidatepage = nfs_invalidate_page,
.releasepage = nfs_release_page,
- .direct_IO = nfs_direct_IO,
#ifdef CONFIG_MIGRATION
.migratepage = nfs_migrate_page,
#endif
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index fda530d5e764..e9d1097170b1 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -496,6 +496,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
if (S_ISREG(inode->i_mode)) {
inode->i_fop = NFS_SB(sb)->nfs_client->rpc_ops->file_ops;
inode->i_data.a_ops = &nfs_file_aops;
+ set_bit(AS_CAN_DIO, &inode->i_data.flags);
nfs_inode_init_regular(nfsi);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
diff --git a/fs/open.c b/fs/open.c
index f732fb94600c..ff58874acd10 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -840,7 +840,7 @@ static int do_dentry_open(struct file *f,

/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
+ if (!test_bit(AS_CAN_DIO, &f->f_mapping->flags))
return -EINVAL;
}

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index fa125feed0ff..21754edf5b62 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -13,6 +13,7 @@
#include <linux/security.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/pagemap.h>
#include "overlayfs.h"

struct ovl_aio_req {
@@ -83,8 +84,7 @@ static int ovl_change_flags(struct file *file, unsigned int flags)
return -EPERM;

if (flags & O_DIRECT) {
- if (!file->f_mapping->a_ops ||
- !file->f_mapping->a_ops->direct_IO)
+ if (!test_bit(AS_CAN_DIO, &file->f_mapping->flags))
return -EINVAL;
}

@@ -306,8 +306,7 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)

ret = -EINVAL;
if (iocb->ki_flags & IOCB_DIRECT &&
- (!real.file->f_mapping->a_ops ||
- !real.file->f_mapping->a_ops->direct_IO))
+ !test_bit(AS_CAN_DIO, &real.file->f_mapping->flags))
goto out_fdput;

old_cred = ovl_override_creds(file_inode(file)->i_sb);
@@ -367,8 +366,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)

ret = -EINVAL;
if (iocb->ki_flags & IOCB_DIRECT &&
- (!real.file->f_mapping->a_ops ||
- !real.file->f_mapping->a_ops->direct_IO))
+ !test_bit(AS_CAN_DIO, &real.file->f_mapping->flags))
goto out_fdput;

if (!ovl_should_sync(OVL_FS(inode->i_sb)))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index deaaf359cc49..1e954756b093 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -448,7 +448,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
* @nrpages: Number of page entries, protected by the i_pages lock.
* @writeback_index: Writeback starts here.
* @a_ops: Methods.
- * @flags: Error bits and flags (AS_*).
+ * @flags: Error bits and flags (AS_*). (enum mapping_flags)
* @wb_err: The most recent error which has occurred.
* @private_lock: For use by the owner of the address_space.
* @private_list: For use by the owner of the address_space.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 605246452305..ceb599b6ba8b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -81,10 +81,11 @@ enum mapping_flags {
AS_ENOSPC = 1, /* ENOSPC on async write */
AS_MM_ALL_LOCKS = 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = 3, /* e.g., ramdisk, SHM_LOCK */
- AS_EXITING = 4, /* final truncate in progress */
+ AS_EXITING = 4, /* final truncate in progress */
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
+ AS_CAN_DIO = 7, /* DIO is supported */
};

/**



2021-12-16 23:53:26

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/18] NFS: rename nfs_direct_IO and use as ->swap_rw

The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a
while. We now need a ->swap_rw function which behaves slightly
differently, returning zero for success rather than a byte count.

So modify nfs_direct_IO accordingly, rename it, and use it as the
->swap_rw function.

Note: it still won't work - that will be fixed in later patches.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 23 ++++++++++-------------
fs/nfs/file.c | 5 +----
include/linux/nfs_fs.h | 2 +-
3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 9cff8709c80a..f1e169f3050a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -152,28 +152,25 @@ nfs_direct_count_bytes(struct nfs_direct_req *dreq,
}

/**
- * nfs_direct_IO - NFS address space operation for direct I/O
+ * nfs_swap_rw - NFS address space operation for swap I/O
* @iocb: target I/O control block
* @iter: I/O buffer
*
- * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * Perform IO to the swap-file. This is much like direct IO.
*/
-ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
{
- struct inode *inode = iocb->ki_filp->f_mapping->host;
-
- /* we only support swap file calling nfs_direct_IO */
- if (!IS_SWAPFILE(inode))
- return 0;
+ ssize_t ret;

VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);

if (iov_iter_rw(iter) == READ)
- return nfs_file_direct_read(iocb, iter);
- return nfs_file_direct_write(iocb, iter);
+ ret = nfs_file_direct_read(iocb, iter);
+ else
+ ret = nfs_file_direct_write(iocb, iter);
+ if (ret < 0)
+ return ret;
+ return 0;
}

static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 60842b774b56..b620fe697158 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -493,10 +493,6 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
struct inode *inode = file->f_mapping->host;

- if (!file->f_mapping->a_ops->swap_rw)
- /* Cannot support swap */
- return -EINVAL;
-
spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
isize = inode->i_size;
@@ -544,6 +540,7 @@ const struct address_space_operations nfs_file_aops = {
.error_remove_page = generic_error_remove_page,
.swap_activate = nfs_swap_activate,
.swap_deactivate = nfs_swap_deactivate,
+ .swap_rw = nfs_swap_rw,
};

/*
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 05f249f20f55..6329e6958718 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -510,7 +510,7 @@ static inline const struct cred *nfs_file_cred(struct file *file)
/*
* linux/fs/nfs/direct.c
*/
-extern ssize_t nfs_direct_IO(struct kiocb *, struct iov_iter *);
+extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
struct iov_iter *iter);
extern ssize_t nfs_file_direct_write(struct kiocb *iocb,



2021-12-16 23:53:33

by NeilBrown

[permalink] [raw]
Subject: [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO

1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.

We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway. There is no other need for
i_rwsem during direct IO. So never take it for swap_rw()

2/ generic_write_checks() explicitly forbids writes to swap, and
performs checks that are not needed for swap. So bypass it
for swap_rw().

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 30 +++++++++++++++++++++---------
fs/nfs/file.c | 4 ++--
include/linux/nfs_fs.h | 4 ++--
3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index f1e169f3050a..eeff1b4e1a7c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -165,9 +165,9 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);

if (iov_iter_rw(iter) == READ)
- ret = nfs_file_direct_read(iocb, iter);
+ ret = nfs_file_direct_read(iocb, iter, true);
else
- ret = nfs_file_direct_write(iocb, iter);
+ ret = nfs_file_direct_write(iocb, iter, true);
if (ret < 0)
return ret;
return 0;
@@ -421,6 +421,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
* nfs_file_direct_read - file direct read operation for NFS files
* @iocb: target I/O control block
* @iter: vector of user buffers into which to read data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
*
* We use this function for direct reads instead of calling
* generic_file_aio_read() in order to avoid gfar's check to see if
@@ -436,7 +437,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
* client must read the updated atime from the server back into its
* cache.
*/
-ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
+ bool swap)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -478,12 +480,14 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter)
if (iter_is_iovec(iter))
dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;

- nfs_start_io_direct(inode);
+ if (!swap)
+ nfs_start_io_direct(inode);

NFS_I(inode)->read_io += count;
requested = nfs_direct_read_schedule_iovec(dreq, iter, iocb->ki_pos);

- nfs_end_io_direct(inode);
+ if (!swap)
+ nfs_end_io_direct(inode);

if (requested > 0) {
result = nfs_direct_wait(dreq);
@@ -872,6 +876,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
* nfs_file_direct_write - file direct write operation for NFS files
* @iocb: target I/O control block
* @iter: vector of user buffers from which to write data
+ * @swap: flag indicating this is swap IO, not O_DIRECT IO
*
* We use this function for direct writes instead of calling
* generic_file_aio_write() in order to avoid taking the inode
@@ -888,7 +893,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
* Note that O_APPEND is not supported for NFS direct writes, as there
* is no atomic O_APPEND write facility in the NFS protocol.
*/
-ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+ bool swap)
{
ssize_t result, requested;
size_t count;
@@ -902,7 +908,11 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);

- result = generic_write_checks(iocb, iter);
+ if (!swap)
+ result = generic_write_checks(iocb, iter);
+ else
+ /* bypass generic checks */
+ result = iov_iter_count(iter);
if (result <= 0)
return result;
count = result;
@@ -933,7 +943,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
dreq->iocb = iocb;
pnfs_init_ds_commit_info_ops(&dreq->ds_cinfo, inode);

- nfs_start_io_direct(inode);
+ if (!swap)
+ nfs_start_io_direct(inode);

requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);

@@ -942,7 +953,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
pos >> PAGE_SHIFT, end);
}

- nfs_end_io_direct(inode);
+ if (!swap)
+ nfs_end_io_direct(inode);

if (requested > 0) {
result = nfs_direct_wait(dreq);
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index b620fe697158..996dfb3c74b2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -161,7 +161,7 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
ssize_t result;

if (iocb->ki_flags & IOCB_DIRECT)
- return nfs_file_direct_read(iocb, to);
+ return nfs_file_direct_read(iocb, to, false);

dprintk("NFS: read(%pD2, %zu@%lu)\n",
iocb->ki_filp,
@@ -625,7 +625,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
return result;

if (iocb->ki_flags & IOCB_DIRECT)
- return nfs_file_direct_write(iocb, from);
+ return nfs_file_direct_write(iocb, from, false);

dprintk("NFS: write(%pD2, %zu@%Ld)\n",
file, iov_iter_count(from), (long long) iocb->ki_pos);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 6329e6958718..3a210478f665 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -512,9 +512,9 @@ static inline const struct cred *nfs_file_cred(struct file *file)
*/
extern int nfs_swap_rw(struct kiocb *, struct iov_iter *);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
- struct iov_iter *iter);
+ struct iov_iter *iter, bool swap);
extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
- struct iov_iter *iter);
+ struct iov_iter *iter, bool swap);

/*
* linux/fs/nfs/dir.c



2021-12-16 23:53:42

by NeilBrown

[permalink] [raw]
Subject: [PATCH 11/18] SUNRPC/call_alloc: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.

rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/sched.c | 4 +++-
net/sunrpc/xprtrdma/transport.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index e2c835482791..d5b6e897f5a5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1023,8 +1023,10 @@ int rpc_malloc(struct rpc_task *task)
struct rpc_buffer *buf;
gfp_t gfp = GFP_NOFS;

+ if (RPC_IS_ASYNC(task))
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ gfp |= __GFP_MEMALLOC;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..a52277115500 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -574,8 +574,10 @@ xprt_rdma_allocate(struct rpc_task *task)
gfp_t flags;

flags = RPCRDMA_DEF_GFP;
+ if (RPC_IS_ASYNC(task))
+ flags = GFP_NOWAIT | __GFP_NOWARN;
if (RPC_IS_SWAPPER(task))
- flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
+ flags |= __GFP_MEMALLOC;

if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
flags))



2021-12-16 23:53:50

by NeilBrown

[permalink] [raw]
Subject: [PATCH 12/18] SUNRPC/auth: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.

mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available workqueue
threads are waiting on the mempool, no thread is available to return
anything.

lookup_cred() can block on a mempool or kmalloc - and this can cause
deadlocks. So add a new RPCAUTH_LOOKUP flag for async lookups and don't
block on memory. If the -ENOMEM gets back to call_refreshresult(), wait
a short while and try again. HZ>>4 is chosen as it is used elsewhere
for -ENOMEM retries.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/auth.h | 1 +
net/sunrpc/auth.c | 6 +++++-
net/sunrpc/auth_gss/auth_gss.c | 6 +++++-
net/sunrpc/auth_unix.c | 10 ++++++++--
net/sunrpc/clnt.c | 3 +++
5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/sunrpc/auth.h b/include/linux/sunrpc/auth.h
index 98da816b5fc2..3e6ce288a7fc 100644
--- a/include/linux/sunrpc/auth.h
+++ b/include/linux/sunrpc/auth.h
@@ -99,6 +99,7 @@ struct rpc_auth_create_args {

/* Flags for rpcauth_lookupcred() */
#define RPCAUTH_LOOKUP_NEW 0x01 /* Accept an uninitialised cred */
+#define RPCAUTH_LOOKUP_ASYNC 0x02 /* Don't block waiting for memory */

/*
* Client authentication ops
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..6bfa19f9fa6a 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -615,6 +615,8 @@ rpcauth_bind_root_cred(struct rpc_task *task, int lookupflags)
};
struct rpc_cred *ret;

+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
ret = auth->au_ops->lookup_cred(auth, &acred, lookupflags);
put_cred(acred.cred);
return ret;
@@ -631,6 +633,8 @@ rpcauth_bind_machine_cred(struct rpc_task *task, int lookupflags)

if (!acred.principal)
return NULL;
+ if (RPC_IS_ASYNC(task))
+ lookupflags |= RPCAUTH_LOOKUP_ASYNC;
return auth->au_ops->lookup_cred(auth, &acred, lookupflags);
}

@@ -654,7 +658,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
};

if (flags & RPC_TASK_ASYNC)
- lookupflags |= RPCAUTH_LOOKUP_NEW;
+ lookupflags |= RPCAUTH_LOOKUP_NEW | RPCAUTH_LOOKUP_ASYNC;
if (task->tk_op_cred)
/* Task must use exactly this rpc_cred */
new = get_rpccred(task->tk_op_cred);
diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 5f42aa5fc612..df72d6301f78 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1341,7 +1341,11 @@ gss_hash_cred(struct auth_cred *acred, unsigned int hashbits)
static struct rpc_cred *
gss_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- return rpcauth_lookup_credcache(auth, acred, flags, GFP_NOFS);
+ gfp_t gfp = GFP_NOFS;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ return rpcauth_lookup_credcache(auth, acred, flags, gfp);
}

static struct rpc_cred *
diff --git a/net/sunrpc/auth_unix.c b/net/sunrpc/auth_unix.c
index e7df1f782b2e..e5819265dd1b 100644
--- a/net/sunrpc/auth_unix.c
+++ b/net/sunrpc/auth_unix.c
@@ -43,8 +43,14 @@ unx_destroy(struct rpc_auth *auth)
static struct rpc_cred *
unx_lookup_cred(struct rpc_auth *auth, struct auth_cred *acred, int flags)
{
- struct rpc_cred *ret = mempool_alloc(unix_pool, GFP_NOFS);
-
+ gfp_t gfp = GFP_NOFS;
+ struct rpc_cred *ret;
+
+ if (flags & RPCAUTH_LOOKUP_ASYNC)
+ gfp = GFP_NOWAIT | __GFP_NOWARN;
+ ret = mempool_alloc(unix_pool, gfp);
+ if (!ret)
+ return ERR_PTR(-ENOMEM);
rpcauth_init_cred(ret, acred, auth, &unix_credops);
ret->cr_flags = 1UL << RPCAUTH_CRED_UPTODATE;
return ret;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a312ea2bc440..238b2ef5491f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1745,6 +1745,9 @@ call_refreshresult(struct rpc_task *task)
task->tk_cred_retry--;
trace_rpc_retry_refresh_status(task);
return;
+ case -ENOMEM:
+ rpc_delay(task, HZ >> 4);
+ return;
}
trace_rpc_refresh_status(task);
rpc_call_rpcerror(task, status);



2021-12-16 23:53:58

by NeilBrown

[permalink] [raw]
Subject: [PATCH 13/18] SUNRPC/xprt: async tasks mustn't block waiting for memory

When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.

xprt_dynamic_alloc_slot can block indefinitely. This can tie up all
workqueue threads and NFS can deadlock. So when called from a
workqueue, set __GFP_NORETRY.

The rdma alloc_slot already does not block. However it sets the error
to -EAGAIN suggesting this will trigger a sleep. It does not. As we
can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN
causes immediate retry.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/xprt.c | 5 ++++-
net/sunrpc/xprtrdma/transport.c | 2 +-
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a02de2bddb28..47d207e416ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1687,12 +1687,15 @@ static bool xprt_throttle_congested(struct rpc_xprt *xprt, struct rpc_task *task
static struct rpc_rqst *xprt_dynamic_alloc_slot(struct rpc_xprt *xprt)
{
struct rpc_rqst *req = ERR_PTR(-EAGAIN);
+ gfp_t gfp_mask = GFP_NOFS;

if (xprt->num_reqs >= xprt->max_reqs)
goto out;
++xprt->num_reqs;
spin_unlock(&xprt->reserve_lock);
- req = kzalloc(sizeof(struct rpc_rqst), GFP_NOFS);
+ if (current->flags & PF_WQ_WORKER)
+ gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
+ req = kzalloc(sizeof(struct rpc_rqst), gfp_mask);
spin_lock(&xprt->reserve_lock);
if (req != NULL)
goto out;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index a52277115500..32df23796747 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -521,7 +521,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
return;

out_sleep:
- task->tk_status = -EAGAIN;
+ task->tk_status = -ENOMEM;
xprt_add_backlog(xprt, task);
}




2021-12-16 23:54:04

by NeilBrown

[permalink] [raw]
Subject: [PATCH 14/18] SUNRPC: remove scheduling boost for "SWAPPER" tasks.

Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.

This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks. So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).

This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it. If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.

Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/sched.c | 7 -------
net/sunrpc/xprt.c | 11 -----------
2 files changed, 18 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d5b6e897f5a5..256302bf6557 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -186,11 +186,6 @@ static void __rpc_add_wait_queue_priority(struct rpc_wait_queue *queue,

/*
* Add new request to wait queue.
- *
- * Swapper tasks always get inserted at the head of the queue.
- * This should avoid many nasty memory deadlocks and hopefully
- * improve overall performance.
- * Everyone else gets appended to the queue to ensure proper FIFO behavior.
*/
static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
struct rpc_task *task,
@@ -199,8 +194,6 @@ static void __rpc_add_wait_queue(struct rpc_wait_queue *queue,
INIT_LIST_HEAD(&task->u.tk_wait.timer_list);
if (RPC_IS_PRIORITY(queue))
__rpc_add_wait_queue_priority(queue, task, queue_priority);
- else if (RPC_IS_SWAPPER(task))
- list_add(&task->u.tk_wait.list, &queue->tasks[0]);
else
list_add_tail(&task->u.tk_wait.list, &queue->tasks[0]);
task->tk_waitqueue = queue;
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 47d207e416ab..a0a2583fe941 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1354,17 +1354,6 @@ xprt_request_enqueue_transmit(struct rpc_task *task)
INIT_LIST_HEAD(&req->rq_xmit2);
goto out;
}
- } else if (RPC_IS_SWAPPER(task)) {
- list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
- if (pos->rq_cong || pos->rq_bytes_sent)
- continue;
- if (RPC_IS_SWAPPER(pos->rq_task))
- continue;
- /* Note: req is added _before_ pos */
- list_add_tail(&req->rq_xmit, &pos->rq_xmit);
- INIT_LIST_HEAD(&req->rq_xmit2);
- goto out;
- }
} else if (!req->rq_seqno) {
list_for_each_entry(pos, &xprt->xmit_queue, rq_xmit) {
if (pos->rq_task->tk_owner != task->tk_owner)



2021-12-16 23:54:12

by NeilBrown

[permalink] [raw]
Subject: [PATCH 15/18] NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS

NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.

RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.

So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.

Remove both.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/read.c | 4 ----
include/linux/nfs_fs.h | 5 -----
include/linux/sunrpc/sched.h | 1 -
include/trace/events/sunrpc.h | 1 -
net/sunrpc/auth.c | 2 +-
5 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index d11af2a9299c..a8f2b884306c 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -194,10 +194,6 @@ static void nfs_initiate_read(struct nfs_pgio_header *hdr,
const struct nfs_rpc_ops *rpc_ops,
struct rpc_task_setup *task_setup_data, int how)
{
- struct inode *inode = hdr->inode;
- int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
-
- task_setup_data->flags |= swap_flags;
rpc_ops->read_setup(hdr, msg);
trace_nfs_initiate_read(hdr);
}
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 3a210478f665..5dce9129fe64 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -45,11 +45,6 @@
*/
#define NFS_MAX_TRANSPORTS 16

-/*
- * These are the default flags for swap requests
- */
-#define NFS_RPC_SWAPFLAGS (RPC_TASK_SWAPPER|RPC_TASK_ROOTCREDS)
-
/*
* Size of the NFS directory verifier
*/
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index db964bb63912..56710f8056d3 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -124,7 +124,6 @@ struct rpc_task_setup {
#define RPC_TASK_MOVEABLE 0x0004 /* nfs4.1+ rpc tasks */
#define RPC_TASK_NULLCREDS 0x0010 /* Use AUTH_NULL credential */
#define RPC_CALL_MAJORSEEN 0x0020 /* major timeout seen */
-#define RPC_TASK_ROOTCREDS 0x0040 /* force root creds */
#define RPC_TASK_DYNAMIC 0x0080 /* task was kmalloc'ed */
#define RPC_TASK_NO_ROUND_ROBIN 0x0100 /* send requests on "main" xprt */
#define RPC_TASK_SOFT 0x0200 /* Use soft timeouts */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 3a99358c262b..141dc34a450e 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -311,7 +311,6 @@ TRACE_EVENT(rpc_request,
{ RPC_TASK_MOVEABLE, "MOVEABLE" }, \
{ RPC_TASK_NULLCREDS, "NULLCREDS" }, \
{ RPC_CALL_MAJORSEEN, "MAJORSEEN" }, \
- { RPC_TASK_ROOTCREDS, "ROOTCREDS" }, \
{ RPC_TASK_DYNAMIC, "DYNAMIC" }, \
{ RPC_TASK_NO_ROUND_ROBIN, "NO_ROUND_ROBIN" }, \
{ RPC_TASK_SOFT, "SOFT" }, \
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 6bfa19f9fa6a..682fcd24bf43 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -670,7 +670,7 @@ rpcauth_bindcred(struct rpc_task *task, const struct cred *cred, int flags)
/* If machine cred couldn't be bound, try a root cred */
if (new)
;
- else if (cred == &machine_cred || (flags & RPC_TASK_ROOTCREDS))
+ else if (cred == &machine_cred)
new = rpcauth_bind_root_cred(task, lookupflags);
else if (flags & RPC_TASK_NULLCREDS)
new = authnull_ops.lookup_cred(NULL, NULL, 0);



2021-12-16 23:54:22

by NeilBrown

[permalink] [raw]
Subject: [PATCH 16/18] SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC

rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC
to be used for some allocations. This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.

Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.

GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.

xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks,
then it blocks all other queued tasks. So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.

Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.

So with this patch:
1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
2/ __rpc_execute() sets PF_MEMALLOC while handling any task
with RPC_TASK_SWAPPER set, or when handling any task that
holds the XPRT_LOCKED lock on an xprt used for swap.
This removes the need for the RPC_IS_SWAPPER() test
in ->buf_alloc handlers.
3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
any task to a swapper xprt. __rpc_execute() will clear it.
3/ PF_MEMALLOC is set for all the connect workers.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/write.c | 2 ++
net/sunrpc/clnt.c | 2 --
net/sunrpc/sched.c | 20 +++++++++++++++++---
net/sunrpc/xprt.c | 3 +++
net/sunrpc/xprtrdma/transport.c | 6 ++++--
net/sunrpc/xprtsock.c | 8 ++++++++
6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 9b7619ce17a7..0c7a304c9e74 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1408,6 +1408,8 @@ static void nfs_initiate_write(struct nfs_pgio_header *hdr,
{
int priority = flush_task_priority(how);

+ if (IS_SWAPFILE(hdr->inode))
+ task_setup_data->flags |= RPC_TASK_SWAPPER;
task_setup_data->priority = priority;
rpc_ops->write_setup(hdr, msg, &task_setup_data->rpc_client);
trace_nfs_initiate_write(hdr);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 238b2ef5491f..cb76fbea3ed5 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1085,8 +1085,6 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
task->tk_flags |= RPC_TASK_TIMEOUT;
if (clnt->cl_noretranstimeo)
task->tk_flags |= RPC_TASK_NO_RETRANS_TIMEOUT;
- if (atomic_read(&clnt->cl_swapper))
- task->tk_flags |= RPC_TASK_SWAPPER;
/* Add to the client's list of all tasks */
spin_lock(&clnt->cl_lock);
list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 256302bf6557..9020cedb7c95 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -869,6 +869,15 @@ void rpc_release_calldata(const struct rpc_call_ops *ops, void *calldata)
ops->rpc_release(calldata);
}

+static bool xprt_needs_memalloc(struct rpc_xprt *xprt, struct rpc_task *tk)
+{
+ if (!xprt)
+ return false;
+ if (!atomic_read(&xprt->swapper))
+ return false;
+ return test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == tk;
+}
+
/*
* This is the RPC `scheduler' (or rather, the finite state machine).
*/
@@ -877,6 +886,7 @@ static void __rpc_execute(struct rpc_task *task)
struct rpc_wait_queue *queue;
int task_is_async = RPC_IS_ASYNC(task);
int status = 0;
+ unsigned long pflags = current->flags;

WARN_ON_ONCE(RPC_IS_QUEUED(task));
if (RPC_IS_QUEUED(task))
@@ -899,6 +909,10 @@ static void __rpc_execute(struct rpc_task *task)
}
if (!do_action)
break;
+ if (RPC_IS_SWAPPER(task) ||
+ xprt_needs_memalloc(task->tk_xprt, task))
+ current->flags |= PF_MEMALLOC;
+
trace_rpc_task_run_action(task, do_action);
do_action(task);

@@ -936,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
rpc_clear_running(task);
spin_unlock(&queue->lock);
if (task_is_async)
- return;
+ goto out;

/* sync task: sleep here */
trace_rpc_task_sync_sleep(task, task->tk_action);
@@ -960,6 +974,8 @@ static void __rpc_execute(struct rpc_task *task)

/* Release all resources associated with the task */
rpc_release_task(task);
+out:
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/*
@@ -1018,8 +1034,6 @@ int rpc_malloc(struct rpc_task *task)

if (RPC_IS_ASYNC(task))
gfp = GFP_NOWAIT | __GFP_NOWARN;
- if (RPC_IS_SWAPPER(task))
- gfp |= __GFP_MEMALLOC;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index a0a2583fe941..0614e7463d4b 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1492,6 +1492,9 @@ bool xprt_prepare_transmit(struct rpc_task *task)
return false;

}
+ if (atomic_read(&xprt->swapper))
+ /* This will be clear in __rpc_execute */
+ current->flags |= PF_MEMALLOC;
return true;
}

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 32df23796747..256b06a92391 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -239,8 +239,11 @@ xprt_rdma_connect_worker(struct work_struct *work)
struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
rx_connect_worker.work);
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+ unsigned int pflags = current->flags;
int rc;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
rc = rpcrdma_xprt_connect(r_xprt);
xprt_clear_connecting(xprt);
if (!rc) {
@@ -254,6 +257,7 @@ xprt_rdma_connect_worker(struct work_struct *work)
rpcrdma_xprt_disconnect(r_xprt);
xprt_unlock_connect(xprt, r_xprt);
xprt_wake_pending_tasks(xprt, rc);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**
@@ -576,8 +580,6 @@ xprt_rdma_allocate(struct rpc_task *task)
flags = RPCRDMA_DEF_GFP;
if (RPC_IS_ASYNC(task))
flags = GFP_NOWAIT | __GFP_NOWARN;
- if (RPC_IS_SWAPPER(task))
- flags |= __GFP_MEMALLOC;

if (!rpcrdma_check_regbuf(r_xprt, req->rl_sendbuf, rqst->rq_callsize,
flags))
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d8ee06a9650a..9d34c71004fa 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2047,7 +2047,10 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
struct socket *sock;
int status = -EIO;
+ unsigned int pflags = current->flags;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
sock = xs_create_sock(xprt, transport,
xs_addr(xprt)->sa_family, SOCK_DGRAM,
IPPROTO_UDP, false);
@@ -2067,6 +2070,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
xprt_clear_connecting(xprt);
xprt_unlock_connect(xprt, transport);
xprt_wake_pending_tasks(xprt, status);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**
@@ -2226,7 +2230,10 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
struct rpc_xprt *xprt = &transport->xprt;
int status;
+ unsigned int pflags = current->flags;

+ if (atomic_read(&xprt->swapper))
+ current->flags |= PF_MEMALLOC;
if (!sock) {
sock = xs_create_sock(xprt, transport,
xs_addr(xprt)->sa_family, SOCK_STREAM,
@@ -2291,6 +2298,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
xprt_clear_connecting(xprt);
out_unlock:
xprt_unlock_connect(xprt, transport);
+ current_restore_flags(pflags, PF_MEMALLOC);
}

/**



2021-12-16 23:54:34

by NeilBrown

[permalink] [raw]
Subject: [PATCH 17/18] NFSv4: keep state manager thread active if swap is enabled

If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.

This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 15 ++++++++++++---
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 ++++++++++++++++++++
fs/nfs/nfs4state.c | 39 +++++++++++++++++++++++++++++++++------
include/linux/nfs_xdr.h | 2 ++
net/sunrpc/clnt.c | 2 ++
6 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 996dfb3c74b2..6ad054b9bbd0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -490,8 +490,9 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
unsigned long blocks;
long long isize;
int ret;
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

spin_lock(&inode->i_lock);
blocks = inode->i_blocks;
@@ -512,14 +513,22 @@ static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
}
*span = sis->pages;
sis->flags |= SWP_FS_OPS;
+
+ if (cl->rpc_ops->enable_swap)
+ cl->rpc_ops->enable_swap(inode);
+
return ret;
}

static void nfs_swap_deactivate(struct file *file)
{
- struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
+ struct inode *inode = file_inode(file);
+ struct rpc_clnt *clnt = NFS_CLIENT(inode);
+ struct nfs_client *cl = NFS_SERVER(inode)->nfs_client;

rpc_clnt_swap_deactivate(clnt);
+ if (cl->rpc_ops->disable_swap)
+ cl->rpc_ops->disable_swap(file_inode(file));
}

const struct address_space_operations nfs_file_aops = {
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ed5eaca6801e..8a9ce0f42efd 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -42,6 +42,7 @@ enum nfs4_client_state {
NFS4CLNT_LEASE_MOVED,
NFS4CLNT_DELEGATION_EXPIRED,
NFS4CLNT_RUN_MANAGER,
+ NFS4CLNT_MANAGER_AVAILABLE,
NFS4CLNT_RECALL_RUNNING,
NFS4CLNT_RECALL_ANY_LAYOUT_READ,
NFS4CLNT_RECALL_ANY_LAYOUT_RW,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ee3bc79f6ca3..ab6382f9cbf0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10347,6 +10347,24 @@ static ssize_t nfs4_listxattr(struct dentry *dentry, char *list, size_t size)
return error + error2 + error3;
}

+static void nfs4_enable_swap(struct inode *inode)
+{
+ /* The state manager thread must always be running.
+ * It will notice the client is a swapper, and stay put.
+ */
+ struct nfs_client *clp = NFS_SERVER(inode)->nfs_client;
+
+ nfs4_schedule_state_manager(clp);
+}
+
+static void nfs4_disable_swap(struct inode *inode)
+{
+ /* The state manager thread will now exit once it is
+ * woken.
+ */
+ wake_up_var(&NFS_SERVER(inode)->nfs_client->cl_state);
+}
+
static const struct inode_operations nfs4_dir_inode_operations = {
.create = nfs_create,
.lookup = nfs_lookup,
@@ -10423,6 +10441,8 @@ const struct nfs_rpc_ops nfs_v4_clientops = {
.free_client = nfs4_free_client,
.create_server = nfs4_create_server,
.clone_server = nfs_clone_server,
+ .enable_swap = nfs4_enable_swap,
+ .disable_swap = nfs4_disable_swap,
};

static const struct xattr_handler nfs4_xattr_nfs4_acl_handler = {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f63dfa01001c..ebe470e6aa8f 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1205,10 +1205,17 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
{
struct task_struct *task;
char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

set_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state);
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+ if (test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state) != 0) {
+ wake_up_var(&clp->cl_state);
return;
+ }
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
__module_get(THIS_MODULE);
refcount_inc(&clp->cl_count);

@@ -1224,6 +1231,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
printk(KERN_ERR "%s: kthread_run: %ld\n",
__func__, PTR_ERR(task));
nfs4_clear_state_manager_bit(clp);
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
nfs_put_client(clp);
module_put(THIS_MODULE);
}
@@ -2665,11 +2673,8 @@ static void nfs4_state_manager(struct nfs_client *clp)
clear_bit(NFS4CLNT_RECALL_RUNNING, &clp->cl_state);
}

- /* Did we race with an attempt to give us more work? */
- if (!test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
- return;
- if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
- return;
+ return;
+
} while (refcount_read(&clp->cl_count) > 1 && !signalled());
goto out_drain;

@@ -2689,9 +2694,31 @@ static void nfs4_state_manager(struct nfs_client *clp)
static int nfs4_run_state_manager(void *ptr)
{
struct nfs_client *clp = ptr;
+ struct rpc_clnt *cl = clp->cl_rpcclient;
+
+ while (cl != cl->cl_parent)
+ cl = cl->cl_parent;

allow_signal(SIGKILL);
+again:
+ set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
nfs4_state_manager(clp);
+ if (atomic_read(&cl->cl_swapper)) {
+ wait_var_event_interruptible(&clp->cl_state,
+ test_bit(NFS4CLNT_RUN_MANAGER,
+ &clp->cl_state));
+ if (atomic_read(&cl->cl_swapper) &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state))
+ goto again;
+ /* Either no longer a swapper, or were signalled */
+ }
+ clear_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state);
+
+ if (refcount_read(&clp->cl_count) > 1 && !signalled() &&
+ test_bit(NFS4CLNT_RUN_MANAGER, &clp->cl_state) &&
+ !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE, &clp->cl_state))
+ goto again;
+
nfs_put_client(clp);
module_put_and_exit(0);
return 0;
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 967a0098f0a9..04cf3a8fb949 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1795,6 +1795,8 @@ struct nfs_rpc_ops {
struct nfs_server *(*create_server)(struct fs_context *);
struct nfs_server *(*clone_server)(struct nfs_server *, struct nfs_fh *,
struct nfs_fattr *, rpc_authflavor_t);
+ void (*enable_swap)(struct inode *inode);
+ void (*disable_swap)(struct inode *inode);
};

/*
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cb76fbea3ed5..4cb403a0f334 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -3066,6 +3066,8 @@ rpc_clnt_swap_activate_callback(struct rpc_clnt *clnt,
int
rpc_clnt_swap_activate(struct rpc_clnt *clnt)
{
+ while (clnt != clnt->cl_parent)
+ clnt = clnt->cl_parent;
if (atomic_inc_return(&clnt->cl_swapper) == 1)
return rpc_clnt_iterate_for_each_xprt(clnt,
rpc_clnt_swap_activate_callback, NULL);



2021-12-16 23:54:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH 18/18] NFS: swap-out must always use STABLE writes.

The commit handling code is not safe against memory-pressure deadlocks
when writing to swap. In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.

swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does. However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.

For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_COND_STABLE.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/direct.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eeff1b4e1a7c..1317465150a6 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -790,7 +790,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
*/
static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
struct iov_iter *iter,
- loff_t pos)
+ loff_t pos, int ioflags)
{
struct nfs_pageio_descriptor desc;
struct inode *inode = dreq->inode;
@@ -798,7 +798,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
size_t requested_bytes = 0;
size_t wsize = max_t(size_t, NFS_SERVER(inode)->wsize, PAGE_SIZE);

- nfs_pageio_init_write(&desc, inode, FLUSH_COND_STABLE, false,
+ nfs_pageio_init_write(&desc, inode, ioflags, false,
&nfs_direct_write_completion_ops);
desc.pg_dreq = dreq;
get_dreq(dreq);
@@ -904,6 +904,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
struct nfs_direct_req *dreq;
struct nfs_lock_context *l_ctx;
loff_t pos, end;
+ int ioflags = swap ? FLUSH_COND_STABLE : FLUSH_STABLE;

dfprintk(FILE, "NFS: direct write(%pD2, %zd@%Ld)\n",
file, iov_iter_count(iter), (long long) iocb->ki_pos);
@@ -946,7 +947,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
if (!swap)
nfs_start_io_direct(inode);

- requested = nfs_direct_write_schedule_iovec(dreq, iter, pos);
+ requested = nfs_direct_write_schedule_iovec(dreq, iter, pos, ioflags);

if (mapping->nrpages) {
invalidate_inode_pages2_range(mapping,



2021-12-17 07:10:05

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master mszeredi-vfs/overlayfs-next next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base: git://git.samba.org/sfrench/cifs-2.6.git for-next
config: nds32-allnoconfig (https://download.01.org/0day-ci/archive/20211217/[email protected]/config)
compiler: nds32le-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/d34716a962c31e9e0a6e40a702e581a02b7e29f7
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
git checkout d34716a962c31e9e0a6e40a702e581a02b7e29f7
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=nds32 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:3541:33: error: too many arguments to function 'swap_readpage'
3541 | swap_readpage(page, true, NULL);
| ^~~~~~~~~~~~~
In file included from mm/memory.c:88:
mm/swap.h:61:19: note: declared here
61 | static inline int swap_readpage(struct page *page, bool do_poll)
| ^~~~~~~~~~~~~


vim +/swap_readpage +3541 mm/memory.c

3462
3463 /*
3464 * We enter with non-exclusive mmap_lock (to exclude vma changes,
3465 * but allow concurrent faults), and pte mapped but not yet locked.
3466 * We return with pte unmapped and unlocked.
3467 *
3468 * We return with the mmap_lock locked or unlocked in the same cases
3469 * as does filemap_fault().
3470 */
3471 vm_fault_t do_swap_page(struct vm_fault *vmf)
3472 {
3473 struct vm_area_struct *vma = vmf->vma;
3474 struct page *page = NULL, *swapcache;
3475 struct swap_info_struct *si = NULL;
3476 swp_entry_t entry;
3477 pte_t pte;
3478 int locked;
3479 int exclusive = 0;
3480 vm_fault_t ret = 0;
3481 void *shadow = NULL;
3482
3483 if (!pte_unmap_same(vmf))
3484 goto out;
3485
3486 entry = pte_to_swp_entry(vmf->orig_pte);
3487 if (unlikely(non_swap_entry(entry))) {
3488 if (is_migration_entry(entry)) {
3489 migration_entry_wait(vma->vm_mm, vmf->pmd,
3490 vmf->address);
3491 } else if (is_device_exclusive_entry(entry)) {
3492 vmf->page = pfn_swap_entry_to_page(entry);
3493 ret = remove_device_exclusive_entry(vmf);
3494 } else if (is_device_private_entry(entry)) {
3495 vmf->page = pfn_swap_entry_to_page(entry);
3496 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
3497 } else if (is_hwpoison_entry(entry)) {
3498 ret = VM_FAULT_HWPOISON;
3499 } else {
3500 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
3501 ret = VM_FAULT_SIGBUS;
3502 }
3503 goto out;
3504 }
3505
3506 /* Prevent swapoff from happening to us. */
3507 si = get_swap_device(entry);
3508 if (unlikely(!si))
3509 goto out;
3510
3511 delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
3512 page = lookup_swap_cache(entry, vma, vmf->address);
3513 swapcache = page;
3514
3515 if (!page) {
3516 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
3517 __swap_count(entry) == 1) {
3518 /* skip swapcache */
3519 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
3520 vmf->address);
3521 if (page) {
3522 __SetPageLocked(page);
3523 __SetPageSwapBacked(page);
3524
3525 if (mem_cgroup_swapin_charge_page(page,
3526 vma->vm_mm, GFP_KERNEL, entry)) {
3527 ret = VM_FAULT_OOM;
3528 goto out_page;
3529 }
3530 mem_cgroup_swapin_uncharge_swap(entry);
3531
3532 shadow = get_shadow_from_swap_cache(entry);
3533 if (shadow)
3534 workingset_refault(page_folio(page),
3535 shadow);
3536
3537 lru_cache_add(page);
3538
3539 /* To provide entry to swap_readpage() */
3540 set_page_private(page, entry.val);
> 3541 swap_readpage(page, true, NULL);
3542 set_page_private(page, 0);
3543 }
3544 } else {
3545 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
3546 vmf);
3547 swapcache = page;
3548 }
3549
3550 if (!page) {
3551 /*
3552 * Back out if somebody else faulted in this pte
3553 * while we released the pte lock.
3554 */
3555 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
3556 vmf->address, &vmf->ptl);
3557 if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
3558 ret = VM_FAULT_OOM;
3559 delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
3560 goto unlock;
3561 }
3562
3563 /* Had to read the page from swap area: Major fault */
3564 ret = VM_FAULT_MAJOR;
3565 count_vm_event(PGMAJFAULT);
3566 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
3567 } else if (PageHWPoison(page)) {
3568 /*
3569 * hwpoisoned dirty swapcache pages are kept for killing
3570 * owner processes (which may be unknown at hwpoison time)
3571 */
3572 ret = VM_FAULT_HWPOISON;
3573 delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
3574 goto out_release;
3575 }
3576
3577 locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
3578
3579 delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
3580 if (!locked) {
3581 ret |= VM_FAULT_RETRY;
3582 goto out_release;
3583 }
3584
3585 /*
3586 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
3587 * release the swapcache from under us. The page pin, and pte_same
3588 * test below, are not enough to exclude that. Even if it is still
3589 * swapcache, we need to check that the page's swap has not changed.
3590 */
3591 if (unlikely((!PageSwapCache(page) ||
3592 page_private(page) != entry.val)) && swapcache)
3593 goto out_page;
3594
3595 page = ksm_might_need_to_copy(page, vma, vmf->address);
3596 if (unlikely(!page)) {
3597 ret = VM_FAULT_OOM;
3598 page = swapcache;
3599 goto out_page;
3600 }
3601
3602 cgroup_throttle_swaprate(page, GFP_KERNEL);
3603
3604 /*
3605 * Back out if somebody else already faulted in this pte.
3606 */
3607 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
3608 &vmf->ptl);
3609 if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
3610 goto out_nomap;
3611
3612 if (unlikely(!PageUptodate(page))) {
3613 ret = VM_FAULT_SIGBUS;
3614 goto out_nomap;
3615 }
3616
3617 /*
3618 * The page isn't present yet, go ahead with the fault.
3619 *
3620 * Be careful about the sequence of operations here.
3621 * To get its accounting right, reuse_swap_page() must be called
3622 * while the page is counted on swap but not yet in mapcount i.e.
3623 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
3624 * must be called after the swap_free(), or it will never succeed.
3625 */
3626
3627 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
3628 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
3629 pte = mk_pte(page, vma->vm_page_prot);
3630 if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
3631 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3632 vmf->flags &= ~FAULT_FLAG_WRITE;
3633 ret |= VM_FAULT_WRITE;
3634 exclusive = RMAP_EXCLUSIVE;
3635 }
3636 flush_icache_page(vma, page);
3637 if (pte_swp_soft_dirty(vmf->orig_pte))
3638 pte = pte_mksoft_dirty(pte);
3639 if (pte_swp_uffd_wp(vmf->orig_pte)) {
3640 pte = pte_mkuffd_wp(pte);
3641 pte = pte_wrprotect(pte);
3642 }
3643 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
3644 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
3645 vmf->orig_pte = pte;
3646
3647 /* ksm created a completely new copy */
3648 if (unlikely(page != swapcache && swapcache)) {
3649 page_add_new_anon_rmap(page, vma, vmf->address, false);
3650 lru_cache_add_inactive_or_unevictable(page, vma);
3651 } else {
3652 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
3653 }
3654
3655 swap_free(entry);
3656 if (mem_cgroup_swap_full(page) ||
3657 (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3658 try_to_free_swap(page);
3659 unlock_page(page);
3660 if (page != swapcache && swapcache) {
3661 /*
3662 * Hold the lock to avoid the swap entry to be reused
3663 * until we take the PT lock for the pte_same() check
3664 * (to avoid false positives from pte_same). For
3665 * further safety release the lock after the swap_free
3666 * so that the swap count won't change under a
3667 * parallel locked swapcache.
3668 */
3669 unlock_page(swapcache);
3670 put_page(swapcache);
3671 }
3672
3673 if (vmf->flags & FAULT_FLAG_WRITE) {
3674 ret |= do_wp_page(vmf);
3675 if (ret & VM_FAULT_ERROR)
3676 ret &= VM_FAULT_ERROR;
3677 goto out;
3678 }
3679
3680 /* No need to invalidate - it was non-present before */
3681 update_mmu_cache(vma, vmf->address, vmf->pte);
3682 unlock:
3683 pte_unmap_unlock(vmf->pte, vmf->ptl);
3684 out:
3685 if (si)
3686 put_swap_device(si);
3687 return ret;
3688 out_nomap:
3689 pte_unmap_unlock(vmf->pte, vmf->ptl);
3690 out_page:
3691 unlock_page(page);
3692 out_release:
3693 put_page(page);
3694 if (page != swapcache && swapcache) {
3695 unlock_page(swapcache);
3696 put_page(swapcache);
3697 }
3698 if (si)
3699 put_swap_device(si);
3700 return ret;
3701 }
3702

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2021-12-17 08:52:08

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master mszeredi-vfs/overlayfs-next next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base: git://git.samba.org/sfrench/cifs-2.6.git for-next
config: arm-randconfig-r006-20211216 (https://download.01.org/0day-ci/archive/20211217/[email protected]/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/a8e1b1ffec6ade1545df519d254eae0400b7ec37
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
git checkout a8e1b1ffec6ade1545df519d254eae0400b7ec37
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
^
mm/vmscan.c:1480:20: note: did you mean 'swp_swap_info'?
include/linux/swap.h:487:40: note: 'swp_swap_info' declared here
static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
~~~~~~~~~~~~~~~~~~~~ ^
include/linux/compiler.h:216:28: note: expanded from macro 'data_race'
__unqual_scalar_typeof(({ expr; })) __v = ({ \
^~~~
include/linux/compiler_types.h:291:13: note: expanded from macro '__unqual_scalar_typeof'
_Generic((x), \
^
>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
~~~~~~~~~~~~~~~~~~~~ ^
include/linux/compiler.h:216:28: note: expanded from macro 'data_race'
__unqual_scalar_typeof(({ expr; })) __v = ({ \
^~~~
include/linux/compiler_types.h:298:15: note: expanded from macro '__unqual_scalar_typeof'
default: (x)))
^
>> mm/vmscan.c:1480:20: error: implicit declaration of function 'page_swap_info' [-Werror,-Wimplicit-function-declaration]
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
^
>> mm/vmscan.c:1480:42: error: member reference type 'int' is not a pointer
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
~~~~~~~~~~~~~~~~~~~~ ^
include/linux/compiler.h:218:3: note: expanded from macro 'data_race'
expr; \
^~~~
>> mm/vmscan.c:1480:9: error: invalid argument type 'void' to unary expression
return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 errors generated.


vim +/page_swap_info +1480 mm/vmscan.c

1467
1468 static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
1469 {
1470 if (gfp_mask & __GFP_FS)
1471 return true;
1472 if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
1473 return false;
1474 /* We can "enter_fs" for swap-cache with only __GFP_IO
1475 * providing this isn't SWP_FS_OPS.
1476 * ->flags can be updated non-atomicially (scan_swap_map_slots),
1477 * but that will never affect SWP_FS_OPS, so the data_race
1478 * is safe.
1479 */
> 1480 return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);
1481 }
1482

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2021-12-17 10:03:28

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 02/18] MM: create new mm/swap.h header file.

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next rostedt-trace/for-next linus/master v5.16-rc5]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master next-20211216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base: git://git.samba.org/sfrench/cifs-2.6.git for-next
config: hexagon-randconfig-r045-20211216 (https://download.01.org/0day-ci/archive/20211217/[email protected]/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/3dd9e64650d0340fd6469ba4f8abc183bb2eea15
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
git checkout 3dd9e64650d0340fd6469ba4f8abc183bb2eea15
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

>> mm/memcontrol.c:5532:9: error: implicit declaration of function 'find_get_incore_page' [-Werror,-Wimplicit-function-declaration]
return find_get_incore_page(vma->vm_file->f_mapping,
^
>> mm/memcontrol.c:5532:9: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'struct page *' [-Wint-conversion]
return find_get_incore_page(vma->vm_file->f_mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.


vim +/find_get_incore_page +5532 mm/memcontrol.c

90254a65833b67 Daisuke Nishimura 2010-05-26 5521
87946a72283be3 Daisuke Nishimura 2010-05-26 5522 static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
48384b0b76f366 Peter Xu 2021-11-05 5523 unsigned long addr, pte_t ptent)
87946a72283be3 Daisuke Nishimura 2010-05-26 5524 {
87946a72283be3 Daisuke Nishimura 2010-05-26 5525 if (!vma->vm_file) /* anonymous vma */
87946a72283be3 Daisuke Nishimura 2010-05-26 5526 return NULL;
1dfab5abcdd404 Johannes Weiner 2015-02-11 5527 if (!(mc.flags & MOVE_FILE))
87946a72283be3 Daisuke Nishimura 2010-05-26 5528 return NULL;
87946a72283be3 Daisuke Nishimura 2010-05-26 5529
87946a72283be3 Daisuke Nishimura 2010-05-26 5530 /* page is moved even if it's not RSS of this task(page-faulted). */
aa3b189551ad8e Hugh Dickins 2011-08-03 5531 /* shmem/tmpfs may report page out on swap: account for that too. */
f5df8635c5a3c9 Matthew Wilcox (Oracle 2020-10-13 @5532) return find_get_incore_page(vma->vm_file->f_mapping,
f5df8635c5a3c9 Matthew Wilcox (Oracle 2020-10-13 5533) linear_page_index(vma, addr));
87946a72283be3 Daisuke Nishimura 2010-05-26 5534 }
87946a72283be3 Daisuke Nishimura 2010-05-26 5535

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2021-12-17 10:34:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 01/18] Structural cleanup for filesystem-based swap

Hi NeilBrown,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on cifs/for-next]
[also build test ERROR on axboe-block/for-next mszeredi-vfs/overlayfs-next rostedt-trace/for-next linus/master v5.16-rc5 next-20211216]
[cannot apply to trondmy-nfs/linux-next hnaz-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/NeilBrown/Repair-SWAP-over-NFS/20211217-075659
base: git://git.samba.org/sfrench/cifs-2.6.git for-next
config: arm-randconfig-r005-20211216 (https://download.01.org/0day-ci/archive/20211217/[email protected]/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 9043c3d65b11b442226015acfbf8167684586cfa)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/6443c9d01129c1a1c19f3df4a594b01e3772e6bd
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review NeilBrown/Repair-SWAP-over-NFS/20211217-075659
git checkout 6443c9d01129c1a1c19f3df4a594b01e3772e6bd
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/nfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> fs/nfs/file.c:512:8: error: implicit declaration of function 'add_swap_extent' [-Werror,-Wimplicit-function-declaration]
ret = add_swap_extent(sis, 0, sis->max, 0);
^
1 error generated.


vim +/add_swap_extent +512 fs/nfs/file.c

486
487 static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
488 sector_t *span)
489 {
490 unsigned long blocks;
491 long long isize;
492 int ret;
493 struct rpc_clnt *clnt = NFS_CLIENT(file->f_mapping->host);
494 struct inode *inode = file->f_mapping->host;
495
496 if (!file->f_mapping->a_ops->swap_rw)
497 /* Cannot support swap */
498 return -EINVAL;
499
500 spin_lock(&inode->i_lock);
501 blocks = inode->i_blocks;
502 isize = inode->i_size;
503 spin_unlock(&inode->i_lock);
504 if (blocks*512 < isize) {
505 pr_warn("swap activate: swapfile has holes\n");
506 return -EINVAL;
507 }
508
509 ret = rpc_clnt_swap_activate(clnt);
510 if (ret)
511 return ret;
> 512 ret = add_swap_extent(sis, 0, sis->max, 0);
513 if (ret < 0) {
514 rpc_clnt_swap_deactivate(clnt);
515 return ret;
516 }
517 *span = sis->pages;
518 sis->flags |= SWP_FS_OPS;
519 return ret;
520 }
521

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2021-12-17 21:29:23

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH 00/18 V2] Repair SWAP-over-NFS

Hi Neil,

On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <[email protected]> wrote:
>
> swap-over-NFS currently has a variety of problems.
>
> swap writes call generic_write_checks(), which always fails on a swap
> file, so it completely fails.
> Even without this, various deadlocks are possible - largely due to
> improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> which weren't tested against swap-out.
>
> NFS is the only filesystem that has supported fs-based swap IO, and it
> hasn't worked for several releases, so now is a convenient time to clean
> up the swap-via-filesystem interfaces - we cannot break anything !
>
> So the first few patches here clean up and improve various parts of the
> swap-via-filesystem code. ->activate_swap() is given a cleaner
> interface, a new ->swap_rw is introduced instead of burdening
> ->direct_IO, etc.
>
> Current swap-to-filesystem code only ever submits single-page reads and
> writes. These patches change that to allow multi-page IO when adjacent
> requests are submitted. Writes are also changed to be async rather than
> sync. This substantially speeds up write throughput for swap-over-NFS.
>
> Some of the NFS patches can land independently of the MM patches. A few
> require the MM patches to land first.

Thanks for fixing swap-over-NFS! Looks like it passes all the
swap-related xfstests except for generic/357 on NFS v4.2. This test
checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
if there is a way to check for that on the client side but if you have
any ideas it would be nice to get that test passing while you're at
it!

Anna

>
> Thanks,
> NeilBrown
>
>
> ---
>
> NeilBrown (18):
> Structural cleanup for filesystem-based swap
> MM: create new mm/swap.h header file.
> MM: use ->swap_rw for reads from SWP_FS_OPS swap-space
> MM: perform async writes to SWP_FS_OPS swap-space
> MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space
> MM: submit multipage reads for SWP_FS_OPS swap-space
> MM: submit multipage write for SWP_FS_OPS swap-space
> MM: Add AS_CAN_DIO mapping flag
> NFS: rename nfs_direct_IO and use as ->swap_rw
> NFS: swap IO handling is slightly different for O_DIRECT IO
> SUNRPC/call_alloc: async tasks mustn't block waiting for memory
> SUNRPC/auth: async tasks mustn't block waiting for memory
> SUNRPC/xprt: async tasks mustn't block waiting for memory
> SUNRPC: remove scheduling boost for "SWAPPER" tasks.
> NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
> SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
> NFSv4: keep state manager thread active if swap is enabled
> NFS: swap-out must always use STABLE writes.
>
>
> drivers/block/loop.c | 4 +-
> fs/fcntl.c | 5 +-
> fs/inode.c | 3 +
> fs/nfs/direct.c | 56 ++++++----
> fs/nfs/file.c | 25 +++--
> fs/nfs/inode.c | 1 +
> fs/nfs/nfs4_fs.h | 1 +
> fs/nfs/nfs4proc.c | 20 ++++
> fs/nfs/nfs4state.c | 39 ++++++-
> fs/nfs/read.c | 4 -
> fs/nfs/write.c | 2 +
> fs/open.c | 2 +-
> fs/overlayfs/file.c | 10 +-
> include/linux/fs.h | 2 +-
> include/linux/nfs_fs.h | 11 +-
> include/linux/nfs_xdr.h | 2 +
> include/linux/pagemap.h | 3 +-
> include/linux/sunrpc/auth.h | 1 +
> include/linux/sunrpc/sched.h | 1 -
> include/linux/swap.h | 121 --------------------
> include/linux/writeback.h | 7 ++
> include/trace/events/sunrpc.h | 1 -
> mm/madvise.c | 9 +-
> mm/memory.c | 3 +-
> mm/mincore.c | 1 +
> mm/page_alloc.c | 1 +
> mm/page_io.c | 189 ++++++++++++++++++++++++++------
> mm/shmem.c | 1 +
> mm/swap.h | 140 +++++++++++++++++++++++
> mm/swap_state.c | 32 ++++--
> mm/swapfile.c | 6 +
> mm/util.c | 1 +
> mm/vmscan.c | 31 +++++-
> net/sunrpc/auth.c | 8 +-
> net/sunrpc/auth_gss/auth_gss.c | 6 +-
> net/sunrpc/auth_unix.c | 10 +-
> net/sunrpc/clnt.c | 7 +-
> net/sunrpc/sched.c | 29 +++--
> net/sunrpc/xprt.c | 19 ++--
> net/sunrpc/xprtrdma/transport.c | 10 +-
> net/sunrpc/xprtsock.c | 8 ++
> 41 files changed, 558 insertions(+), 274 deletions(-)
> create mode 100644 mm/swap.h
>
> --
> Signature
>

2021-12-19 13:38:54

by Mark Hemment

[permalink] [raw]
Subject: Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag

On Thu, 16 Dec 2021 at 23:57, NeilBrown <[email protected]> wrote:
>
> Currently various places test if direct IO is possible on a file by
> checking for the existence of the direct_IO address space operation.
> This is a poor choice, as the direct_IO operation may not be used - it is
> only used if the generic_file_*_iter functions are called for direct IO
> and some filesystems - particularly NFS - don't do this.
>
> Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> places to check this (avoiding a pointer dereference).
> unlock_new_inode() will set this flag if ->direct_IO is present, so
> filesystems do not need to be changed.
...
> diff --git a/fs/inode.c b/fs/inode.c
> index 6b80a51129d5..bae65ccecdb1 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
> void unlock_new_inode(struct inode *inode)
> {
> lockdep_annotate_inode_mutex_key(inode);
> + if (inode->i_mapping->a_ops &&
> + inode->i_mapping->a_ops->direct_IO)
> + set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
> spin_lock(&inode->i_lock);
> WARN_ON(!(inode->i_state & I_NEW));
> inode->i_state &= ~I_NEW & ~I_CREATING;

Does d_instantiate_new() also need to set AS_CAN_DIO?

Mark

2021-12-19 21:00:07

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag

On Mon, 20 Dec 2021, Mark Hemment wrote:
> On Thu, 16 Dec 2021 at 23:57, NeilBrown <[email protected]> wrote:
> >
> > Currently various places test if direct IO is possible on a file by
> > checking for the existence of the direct_IO address space operation.
> > This is a poor choice, as the direct_IO operation may not be used - it is
> > only used if the generic_file_*_iter functions are called for direct IO
> > and some filesystems - particularly NFS - don't do this.
> >
> > Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> > places to check this (avoiding a pointer dereference).
> > unlock_new_inode() will set this flag if ->direct_IO is present, so
> > filesystems do not need to be changed.
> ...
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 6b80a51129d5..bae65ccecdb1 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1008,6 +1008,9 @@ EXPORT_SYMBOL(lockdep_annotate_inode_mutex_key);
> > void unlock_new_inode(struct inode *inode)
> > {
> > lockdep_annotate_inode_mutex_key(inode);
> > + if (inode->i_mapping->a_ops &&
> > + inode->i_mapping->a_ops->direct_IO)
> > + set_bit(AS_CAN_DIO, &inode->i_mapping->flags);
> > spin_lock(&inode->i_lock);
> > WARN_ON(!(inode->i_state & I_NEW));
> > inode->i_state &= ~I_NEW & ~I_CREATING;
>
> Does d_instantiate_new() also need to set AS_CAN_DIO?

Yes it does - thanks for catching that. I'll update my patch.

Thanks,
NeilBrown

2021-12-19 21:07:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 00/18 V2] Repair SWAP-over-NFS

On Sat, 18 Dec 2021, Anna Schumaker wrote:
> Hi Neil,
>
> On Thu, Dec 16, 2021 at 7:07 PM NeilBrown <[email protected]> wrote:
> >
> > swap-over-NFS currently has a variety of problems.
> >
> > swap writes call generic_write_checks(), which always fails on a swap
> > file, so it completely fails.
> > Even without this, various deadlocks are possible - largely due to
> > improvements in NFS memory allocation (using NOFS instead of ATOMIC)
> > which weren't tested against swap-out.
> >
> > NFS is the only filesystem that has supported fs-based swap IO, and it
> > hasn't worked for several releases, so now is a convenient time to clean
> > up the swap-via-filesystem interfaces - we cannot break anything !
> >
> > So the first few patches here clean up and improve various parts of the
> > swap-via-filesystem code. ->activate_swap() is given a cleaner
> > interface, a new ->swap_rw is introduced instead of burdening
> > ->direct_IO, etc.
> >
> > Current swap-to-filesystem code only ever submits single-page reads and
> > writes. These patches change that to allow multi-page IO when adjacent
> > requests are submitted. Writes are also changed to be async rather than
> > sync. This substantially speeds up write throughput for swap-over-NFS.
> >
> > Some of the NFS patches can land independently of the MM patches. A few
> > require the MM patches to land first.
>
> Thanks for fixing swap-over-NFS! Looks like it passes all the
> swap-related xfstests except for generic/357 on NFS v4.2. This test
> checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> if there is a way to check for that on the client side but if you have
> any ideas it would be nice to get that test passing while you're at
> it!

Thanks for testing!.
I think that testing that swap fails on a reflinked file is bogus. This
isn't an important part of the API, it is just an internal
implementation detail.
I certainly understand that it could be problematic implementing swap on
a reflinked file within XFS and it is perfectly acceptable to fail such
a request. But if one day someone decided to implement it - should that
be seen as a regression?

Certainly over NFS there is no reason at all not to swap to a file that
happens to be reflinked on the server.
I don't think it even makes sense to test if the file has holes as the
current nfs_swap_activate() does. I don't exactly object to the test,
but I think it is misguided and pointless.

Thanks,
NeilBrown

2021-12-20 12:16:43

by Mark Hemment

[permalink] [raw]
Subject: Re: [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space

On Thu, 16 Dec 2021 at 23:54, NeilBrown <[email protected]> wrote:
>
> To submit an async read with ->swap_rw() we need to allocate
> a structure to hold the kiocb and other details. swap_readpage() cannot
> handle transient failure, so create a mempool to provide the structures.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> mm/page_io.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++------
> mm/swap.h | 1 +
> mm/swapfile.c | 5 +++++
> 3 files changed, 58 insertions(+), 6 deletions(-)
...
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f23d9ff21cf8..43539be38e68 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2401,6 +2401,11 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
> if (ret < 0)
> return ret;
> sis->flags |= SWP_ACTIVATED;
> + if ((sis->flags & SWP_FS_OPS) &&
> + sio_pool_init() != 0) {
> + destroy_swap_extents(sis);
> + return -ENOMEM;
> + }
> return ret;
> }

This code is called before 'swapon_mutex' is taken in the swapon code
path, so possible for multiple swapons to race here creating two (or
more) memory pools.

Mark

2021-12-20 12:21:54

by Mark Hemment

[permalink] [raw]
Subject: Re: [PATCH 07/18] MM: submit multipage write for SWP_FS_OPS swap-space

On Thu, 16 Dec 2021 at 23:56, NeilBrown <[email protected]> wrote:
>
> swap_writepage() is given one page at a time, but may be called repeatedly
> in succession.
> For block-device swapspace, the blk_plug functionality allows the
> multiple pages to be combined together at lower layers.
> That cannot be used for SWP_FS_OPS as blk_plug may not exist - it is
> only active when CONFIG_BLOCK=y. Consequently all swap reads over NFS
> are single page reads.
>
> With this patch we pass a pointer-to-pointer via the wbc.
> swap_writepage can store state between calls - much like the pointer
> passed explicitly to swap_readpage. After calling swap_writepage() some
> number of times, the state will be passed to swap_write_unplug() which
> can submit the combined request.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> include/linux/writeback.h | 7 +++
> mm/page_io.c | 98 ++++++++++++++++++++++++++++++---------------
> mm/swap.h | 1
> mm/vmscan.c | 9 +++-
> 4 files changed, 80 insertions(+), 35 deletions(-)
...
> +void swap_write_unplug(struct swap_iocb *sio)
> +{
> + struct iov_iter from;
> + struct address_space *mapping = sio->iocb.ki_filp->f_mapping;
> + int ret;
> +
> + iov_iter_bvec(&from, WRITE, sio->bvec, sio->pages,
> + PAGE_SIZE * sio->pages);
> + ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> + if (ret != -EIOCBQUEUED)
> + sio_write_complete(&sio->iocb, ret);
> +}
> +

As swap_write_unplug() is called from vmscan.c, need an 'no-op'
version (in "swap.h") for when !CONFIG_SWAP

Mark

2021-12-20 15:23:18

by Mark Hemment

[permalink] [raw]
Subject: Re: [PATCH 10/18] NFS: swap IO handling is slightly different for O_DIRECT IO

On Thu, 16 Dec 2021 at 23:58, NeilBrown <[email protected]> wrote:
>
> 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
> possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
> eventuate if a buffered read on the swapfile was attempted.
>
> We don't need coherence with the page cache for a swap file, and
> buffered writes are forbidden anyway. There is no other need for
> i_rwsem during direct IO. So never take it for swap_rw()

Agreed. I cannot see an issue with not taking the sem.

> 2/ generic_write_checks() explicitly forbids writes to swap, and
> performs checks that are not needed for swap. So bypass it
> for swap_rw().
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
...
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> @@ -625,7 +625,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
> return result;
>
> if (iocb->ki_flags & IOCB_DIRECT)
> - return nfs_file_direct_write(iocb, from);
> + return nfs_file_direct_write(iocb, from, false);
>
> dprintk("NFS: write(%pD2, %zu@%Ld)\n",
> file, iov_iter_count(from), (long long) iocb->ki_pos);

This code at the top of nfs/file.c should be removed;
/* Hack for future NFS swap support */
#ifndef IS_SWAPFILE
# define IS_SWAPFILE(inode) (0)
#endif
As IS_SWAPFILE() is used just below this diff (to prevent buffered
writes), better be using a non-hacked version.

Mark

2021-12-21 08:34:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 01/18] Structural cleanup for filesystem-based swap

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Linux primarily uses IO to block devices for swap, but can send the IO
> requests to a filesystem. This has only ever worked for NFS, and that
> hasn't worked for a while due to a lack of testing. This seems like a
> good time for some tidy-up before restoring swap-over-NFS functionality.

The changes look good to me, but I think this needs to be split into
separate, self-contained patches.

>
> This patch:

Patch 1:

> - updates the documentation (both copies!) for swap_activate which
> is woefully out-of-date

Patch 2:

> - drops the call to the filesystem for ->set_page_dirty(). These
> pages do not belong to the filesystem, and it has no interest
> in the dirty status.

Patch 3:


> - move the responsibility for setting SWP_FS_OPS to ->swap_activate()
> and also requires it to always call add_swap_extent(). This makes
> it much easier to find filesystems that require SWP_FS_OPS.

Patch 4:

> - introduces a new address_space operation "swap_rw" for swap IO.
> The code currently used ->readpage for reads and ->direct_IO for
> writes. The former imposes a limit of one-page-at-a-time, the
> later means that direct writes and swap writes are encouraged to
> use the same path. While similar, swap can often be simpler as
> it can assume that no allocation is needed, and coherence with the
> page cache is irrelevant.

2021-12-21 08:36:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/18] MM: create new mm/swap.h header file.

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Many functions declared in include/linux/swap.h are only used within mm/
>
> Create a new "mm/swap.h" and move some of these declarations there.
> Remove the redundant 'extern' from the function declarations.

Good idea! Looks good modulo the extra includes that the buildbot asks
for:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-12-21 08:40:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/18] MM: use ->swap_rw for reads from SWP_FS_OPS swap-space

> +int sio_pool_init(void)
> +{
> + if (!sio_pool)
> + sio_pool = mempool_create_kmalloc_pool(
> + SWAP_CLUSTER_MAX, sizeof(struct swap_iocb));

I can't see anything serializing access here, so we'll need a lock or
cmpxchg dance.

> + if (sio_pool)
> + return 0;
> + else
> + return -ENOMEM;

Nit: This would flow much nicer as:

if (!sio_pool)
return -ENOMEM;
return 0;

> int swap_readpage(struct page *page, bool synchronous)
> {
> struct bio *bio;
> @@ -378,13 +412,25 @@ int swap_readpage(struct page *page, bool synchronous)
> }
>
> if (data_race(sis->flags & SWP_FS_OPS)) {
> - //struct file *swap_file = sis->swap_file;
> - //struct address_space *mapping = swap_file->f_mapping;

This should not be left by the previous patch. In fact I suspect the
part of the previous patch that adds ->swap_rw should probably be folded
into this patch.

> + struct file *swap_file = sis->swap_file;
> + struct address_space *mapping = swap_file->f_mapping;
> + struct iov_iter from;
> + struct swap_iocb *sio;
> + loff_t pos = page_file_offset(page);
> +
> + sio = mempool_alloc(sio_pool, GFP_KERNEL);
> + init_sync_kiocb(&sio->iocb, swap_file);
> + sio->iocb.ki_pos = pos;
> + sio->iocb.ki_complete = sio_read_complete;
> + sio->bvec.bv_page = page;
> + sio->bvec.bv_len = PAGE_SIZE;
> + sio->bvec.bv_offset = 0;
> +
> + iov_iter_bvec(&from, READ, &sio->bvec, 1, PAGE_SIZE);
> + ret = mapping->a_ops->swap_rw(&sio->iocb, &from);
> + if (ret != -EIOCBQUEUED)
> + sio_read_complete(&sio->iocb, ret);
>
> goto out;

I'd be tempted to split the SWP_FS_OPS into a helper to keep the
code tidy.

2021-12-21 08:41:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 04/18] MM: perform async writes to SWP_FS_OPS swap-space

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Writes to SWP_FS_OPS swapspace is currently synchronous. To make it
> async we need to allocate the kiocb struct which may block, but won't
> block as long as waiting for the write to complete would block.

Against a little helper for the SWP_FS_OPS case of __swap_writepage
would be nice. But otherwise this looks good to me.

2021-12-21 08:43:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 05/18] MM: reclaim mustn't enter FS for SWP_FS_OPS swap-space

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> +static bool test_may_enter_fs(struct page *page, gfp_t gfp_mask)
> +{
> + if (gfp_mask & __GFP_FS)
> + return true;
> + if (!PageSwapCache(page) || !(gfp_mask & __GFP_IO))
> + return false;
> + /* We can "enter_fs" for swap-cache with only __GFP_IO
> + * providing this isn't SWP_FS_OPS.
> + * ->flags can be updated non-atomicially (scan_swap_map_slots),
> + * but that will never affect SWP_FS_OPS, so the data_race
> + * is safe.
> + */
> + return !data_race(page_swap_info(page)->flags & SWP_FS_OPS);

Nit: the normal kernel comment style uses an empty

/*

line to start the comment.

> @@ -1514,8 +1529,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> if (!sc->may_unmap && page_mapped(page))
> goto keep_locked;
>
> - may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> - (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> + may_enter_fs = test_may_enter_fs(page, sc->gfp_mask);
>
> /*
> * The number of dirty pages determines if a node is marked
> @@ -1683,7 +1697,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> goto activate_locked_split;
> }
>
> - may_enter_fs = true;
> + may_enter_fs = test_may_enter_fs(page,
> + sc->gfp_mask);

Now that may_enter_fs is always reset using test_may_enter_fs, do we
even need the local variable?

2021-12-21 08:44:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/18] MM: submit multipage reads for SWP_FS_OPS swap-space

On Fri, Dec 17, 2021 at 10:48:22AM +1100, NeilBrown wrote:
> Some caller currently call blk_finish_plug() *before* the final call to
> swap_readpage(), so the last page cannot be included. This patch moves
> blk_finish_plug() to after the last call, and calls swap_read_unplug()
> there too.

Can you move this fix into a separate prep patch, preferably with a
Fixes tag so that it gets picked up for backports?

Otherwise this looks sensible to me.

2021-12-21 08:46:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/18] MM: Add AS_CAN_DIO mapping flag

On Fri, Dec 17, 2021 at 10:48:23AM +1100, NeilBrown wrote:
> Currently various places test if direct IO is possible on a file by
> checking for the existence of the direct_IO address space operation.
> This is a poor choice, as the direct_IO operation may not be used - it is
> only used if the generic_file_*_iter functions are called for direct IO
> and some filesystems - particularly NFS - don't do this.
>
> Instead, introduce a new mapping flag: AS_CAN_DIO and change the various
> places to check this (avoiding a pointer dereference).
> unlock_new_inode() will set this flag if ->direct_IO is present, so
> filesystems do not need to be changed.
>
> NFS *is* changed, to set the flag explicitly and discard the direct_IO
> entry in the address_space_operations for files.

For other can flags related to file operations we usuall stash them into
file->f_mode. Any reason to treat this different?

2021-12-21 08:48:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 00/18 V2] Repair SWAP-over-NFS

On Mon, Dec 20, 2021 at 08:07:15AM +1100, NeilBrown wrote:
> > Thanks for fixing swap-over-NFS! Looks like it passes all the
> > swap-related xfstests except for generic/357 on NFS v4.2. This test
> > checks that we get -EINVAL on a reflinked swapfile, but I'm not sure
> > if there is a way to check for that on the client side but if you have
> > any ideas it would be nice to get that test passing while you're at
> > it!
>
> Thanks for testing!.
> I think that testing that swap fails on a reflinked file is bogus. This
> isn't an important part of the API, it is just an internal
> implementation detail.
> I certainly understand that it could be problematic implementing swap on
> a reflinked file within XFS and it is perfectly acceptable to fail such
> a request. But if one day someone decided to implement it - should that
> be seen as a regression?

Yes, there is really no fundamental reason not to support swap to
reflinked files, especially for NFS. We'll need some kind of opt-in/out
for this test.