2014-04-16 04:17:49

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC 00/19] Support loop-back NFS mounts

Loop-back NFS mounts are when the NFS client and server run on the
same host.

The use-case for this is a high availability cluster with shared
storage. The shared filesystem is mounted on any one machine and
NFS-mounted on the others.
If the nfs server fails, some other node will take over that service,
and then it will have a loop-back NFS mount which needs to keep
working.

This patch set addresses the "keep working" bit and specifically
addresses deadlocks and livelocks.
Allowing the fail-over itself to be deadlock free is a separate
challenge for another day.

The short description of how this works is:

deadlocks:
- Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
disables __GFP_IO.
- Set PF_FSTRANS in nfsd when handling requests related to
memory reclaim, or requests which could block requests related
to memory reclaim.
- Use lockdep to find all consequent deadlocks from some other
thread allocating memory while holding a lock that nfsd might
want.
- Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
as appropriate.

livelocks:
- identify throttling during reclaim and bypass it when
PF_LESS_THROTTLE is set
- only set PF_LESS_THROTTLE for nfsd when handling write requests
from the local host.

The last 12 patches address various deadlocks due to locking chains.
11 were found by lockdep, 2 by testing. There is a reasonable chance
that there are more, I just need to exercise more code while
testing....

There is one issue that lockdep reports which I haven't fixed (I've
just hacked the code out for my testing). That issue relates to
freeze_super().
I may not be interpreting the lockdep reports perfectly, but I think
they are basically saying that if I were to freeze a filesystem that
was exported to the local host, then we could end up deadlocking.
This is to be expected. The NFS filesystem would need to be frozen
first. I don't know how to tell lockdep that I know that is a problem
and I don't want to be warned about it. Suggestions welcome.
Until this is addressed I cannot really ask others to test the code
with lockdep enabled.

There are more subsidiary places that I needed to add PF_FSTRANS than
I would have liked. The thought keeps crossing my mind that maybe we
can get rid of __GFP_FS and require that memory reclaim never ever
block on a filesystem. Then most of these patches go away.

Now that writeback doesn't happen from reclaim (but from kswapd) much
of the calls from reclaim to FS are gone.
The ->releasepage call is the only one that I *know* causes me
problems so I'd like to just say that that must never block. I don't
really understand the consequences of that though.
There are a couple of other places where __GFP_FS is used and I'd need
to carefully analyze those. But if someone just said "no, that is
impossible", I could be happy and stick with the current approach....

I've cc:ed Peter Zijlstra and Ingo Molnar only on the lockdep-related
patches, Ming Lei only on the PF_MEMALLOC_NOIO related patches,
and net-dev only on the network-related patches.
There are probably other people I should CC. Apologies if I missed you.
I'll ensure better coverage if the nfs/mm/xfs people are reasonably happy.

Comments, criticisms, etc most welcome.

Thanks,
NeilBrown


---

NeilBrown (19):
Promote current_{set,restore}_flags_nested from xfs to global.
lockdep: lockdep_set_current_reclaim_state should save old value
lockdep: improve scenario messages for RECLAIM_FS errors.
Make effect of PF_FSTRANS to disable __GFP_FS universal.
SUNRPC: track whether a request is coming from a loop-back interface.
nfsd: set PF_FSTRANS for nfsd threads.
nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.
Set PF_FSTRANS while write_cache_pages calls ->writepage
XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.
NET: set PF_FSTRANS while holding sk_lock
FS: set PF_FSTRANS while holding mmap_sem in exec.c
NET: set PF_FSTRANS while holding rtnl_lock
MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.
driver core: set PF_FSTRANS while holding gdp_mutex
nfsd: set PF_FSTRANS when client_mutex is held.
VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.
VFS: set PF_FSTRANS while namespace_sem is held.
nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.
XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks


drivers/base/core.c | 3 ++
drivers/base/power/runtime.c | 6 ++---
drivers/block/nbd.c | 6 ++---
drivers/md/dm-bufio.c | 6 ++---
drivers/md/dm-ioctl.c | 6 ++---
drivers/mtd/nand/nandsim.c | 28 ++++++---------------
drivers/scsi/iscsi_tcp.c | 6 ++---
drivers/usb/core/hub.c | 6 ++---
fs/dcache.c | 4 ++-
fs/exec.c | 6 +++++
fs/fs-writeback.c | 5 ++--
fs/namespace.c | 4 +++
fs/nfs/file.c | 3 +-
fs/nfsd/nfs4callback.c | 5 ++++
fs/nfsd/nfs4state.c | 3 ++
fs/nfsd/nfssvc.c | 24 ++++++++++++++----
fs/nfsd/vfs.c | 6 +++++
fs/xfs/kmem.h | 2 --
fs/xfs/xfs_aops.c | 7 -----
fs/xfs/xfs_bmap_util.c | 4 +++
fs/xfs/xfs_file.c | 12 +++++++++
fs/xfs/xfs_linux.h | 7 -----
include/linux/lockdep.h | 8 +++---
include/linux/sched.h | 32 +++++++++---------------
include/linux/sunrpc/svc.h | 2 ++
include/linux/sunrpc/svc_xprt.h | 1 +
include/net/sock.h | 1 +
kernel/locking/lockdep.c | 51 ++++++++++++++++++++++++++++-----------
kernel/softirq.c | 6 ++---
mm/migrate.c | 9 +++----
mm/page-writeback.c | 3 ++
mm/page_alloc.c | 18 ++++++++------
mm/percpu.c | 4 +++
mm/slab.c | 2 ++
mm/slob.c | 2 ++
mm/slub.c | 1 +
mm/vmscan.c | 31 +++++++++++++++---------
net/core/dev.c | 6 ++---
net/core/rtnetlink.c | 9 ++++++-
net/core/sock.c | 8 ++++--
net/sunrpc/sched.c | 5 ++--
net/sunrpc/svc.c | 6 +++++
net/sunrpc/svcsock.c | 10 ++++++++
net/sunrpc/xprtrdma/transport.c | 5 ++--
net/sunrpc/xprtsock.c | 17 ++++++++-----
45 files changed, 247 insertions(+), 149 deletions(-)

--
Signature


2014-04-16 04:18:00

by NeilBrown

[permalink] [raw]
Subject: [PATCH 01/19] Promote current_{set, restore}_flags_nested from xfs to global.

These are useful macros from xfs for modifying current->flags.
Other places in the kernel perform the same task in various different
ways.
This patch moves the macros from xfs to include/linux/sched.h and
changes all code which temporarily sets a current->flags flag to
use these macros.

This does not change functionality in any important, but does fix a
few sites which assume that PF_FSTRANS is not already set and so
arbitrarily set and then clear it. The new code is more careful and
will only clear it if it was previously clear.

Signed-off-by: NeilBrown <[email protected]>
---
drivers/base/power/runtime.c | 6 +++---
drivers/block/nbd.c | 6 +++---
drivers/md/dm-bufio.c | 6 +++---
drivers/md/dm-ioctl.c | 6 +++---
drivers/mtd/nand/nandsim.c | 28 ++++++++--------------------
drivers/scsi/iscsi_tcp.c | 6 +++---
drivers/usb/core/hub.c | 6 +++---
fs/fs-writeback.c | 5 +++--
fs/xfs/xfs_linux.h | 7 -------
include/linux/sched.h | 27 ++++++++-------------------
kernel/softirq.c | 6 +++---
mm/migrate.c | 9 ++++-----
mm/page_alloc.c | 10 ++++++----
mm/vmscan.c | 10 ++++++----
net/core/dev.c | 6 +++---
net/core/sock.c | 6 +++---
net/sunrpc/sched.c | 5 +++--
net/sunrpc/xprtrdma/transport.c | 5 +++--
net/sunrpc/xprtsock.c | 17 ++++++++++-------
19 files changed, 78 insertions(+), 99 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 72e00e66ecc5..02448f11c879 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -348,7 +348,7 @@ static int rpm_callback(int (*cb)(struct device *), struct device *dev)
return -ENOSYS;

if (dev->power.memalloc_noio) {
- unsigned int noio_flag;
+ unsigned int pflags;

/*
* Deadlock might be caused if memory allocation with
@@ -359,9 +359,9 @@ static int rpm_callback(int (*cb)(struct device *), struct device *dev)
* device, so network device and its ancestor should
* be marked as memalloc_noio too.
*/
- noio_flag = memalloc_noio_save();
+ current_set_flags_nested(&pflags, PF_MEMALLOC_NOIO);
retval = __rpm_callback(cb, dev);
- memalloc_noio_restore(noio_flag);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC_NOIO);
} else {
retval = __rpm_callback(cb, dev);
}
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 55298db36b2d..d3ddfa8a4da4 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -158,7 +158,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
- unsigned long pflags = current->flags;
+ unsigned int pflags;

if (unlikely(!sock)) {
dev_err(disk_to_dev(nbd->disk),
@@ -172,7 +172,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
siginitsetinv(&blocked, sigmask(SIGKILL));
sigprocmask(SIG_SETMASK, &blocked, &oldset);

- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
do {
sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
iov.iov_base = buf;
@@ -220,7 +220,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
} while (size > 0);

sigprocmask(SIG_SETMASK, &oldset, NULL);
- tsk_restore_flags(current, pflags, PF_MEMALLOC);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);

return result;
}
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 66c5d130c8c2..f5fa93ea3a59 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -322,7 +322,7 @@ static void __cache_size_refresh(void)
static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
enum data_mode *data_mode)
{
- unsigned noio_flag;
+ unsigned int pflags;
void *ptr;

if (c->block_size <= DM_BUFIO_BLOCK_SIZE_SLAB_LIMIT) {
@@ -350,12 +350,12 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
*/

if (gfp_mask & __GFP_NORETRY)
- noio_flag = memalloc_noio_save();
+ current_set_flags_nested(&pflags, PF_MEMALLOC_NOIO);

ptr = __vmalloc(c->block_size, gfp_mask | __GFP_HIGHMEM, PAGE_KERNEL);

if (gfp_mask & __GFP_NORETRY)
- memalloc_noio_restore(noio_flag);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC_NOIO);

return ptr;
}
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 51521429fb59..5409533f22b5 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1716,10 +1716,10 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
}

if (!dmi) {
- unsigned noio_flag;
- noio_flag = memalloc_noio_save();
+ unsigned int pflags;
+ current_set_flags_nested(&pflags, PF_MEMALLOC_NOIO);
dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_REPEAT | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
- memalloc_noio_restore(noio_flag);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC_NOIO);
if (dmi)
*param_flags |= DM_PARAMS_VMALLOC;
}
diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
index 42e8a770e631..8c995f9bb020 100644
--- a/drivers/mtd/nand/nandsim.c
+++ b/drivers/mtd/nand/nandsim.c
@@ -1373,31 +1373,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t
return 0;
}

-static int set_memalloc(void)
-{
- if (current->flags & PF_MEMALLOC)
- return 0;
- current->flags |= PF_MEMALLOC;
- return 1;
-}
-
-static void clear_memalloc(int memalloc)
-{
- if (memalloc)
- current->flags &= ~PF_MEMALLOC;
-}
-
static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
{
ssize_t tx;
- int err, memalloc;
+ int err;
+ unsigned int pflags;

err = get_pages(ns, file, count, pos);
if (err)
return err;
- memalloc = set_memalloc();
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
tx = kernel_read(file, pos, buf, count);
- clear_memalloc(memalloc);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);
put_pages(ns);
return tx;
}
@@ -1405,14 +1392,15 @@ static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_
static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos)
{
ssize_t tx;
- int err, memalloc;
+ int err;
+ unsigned int pflags;

err = get_pages(ns, file, count, pos);
if (err)
return err;
- memalloc = set_memalloc();
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
tx = kernel_write(file, buf, count, pos);
- clear_memalloc(memalloc);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);
put_pages(ns);
return tx;
}
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index add6d1566ec8..834cc3afaadf 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -371,10 +371,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct iscsi_conn *conn)
static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
{
struct iscsi_conn *conn = task->conn;
- unsigned long pflags = current->flags;
+ unsigned int pflags;
int rc = 0;

- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);

while (iscsi_sw_tcp_xmit_qlen(conn)) {
rc = iscsi_sw_tcp_xmit(conn);
@@ -387,7 +387,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
rc = 0;
}

- tsk_restore_flags(current, pflags, PF_MEMALLOC);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);
return rc;
}

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 64ea21971be2..7622b8b09163 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -5282,7 +5282,7 @@ int usb_reset_device(struct usb_device *udev)
{
int ret;
int i;
- unsigned int noio_flag;
+ unsigned int pflags;
struct usb_host_config *config = udev->actconfig;

if (udev->state == USB_STATE_NOTATTACHED ||
@@ -5301,7 +5301,7 @@ int usb_reset_device(struct usb_device *udev)
* because the device 'memalloc_noio' flag may have
* not been set before reseting the usb device.
*/
- noio_flag = memalloc_noio_save();
+ current_set_flags_nested(&pflags, PF_MEMALLOC_NOIO);

/* Prevent autosuspend during the reset */
usb_autoresume_device(udev);
@@ -5347,7 +5347,7 @@ int usb_reset_device(struct usb_device *udev)
}

usb_autosuspend_device(udev);
- memalloc_noio_restore(noio_flag);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC_NOIO);
return ret;
}
EXPORT_SYMBOL_GPL(usb_reset_device);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d754e3cf99a8..73beb4d86ab1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1012,9 +1012,10 @@ void bdi_writeback_workfn(struct work_struct *work)
struct bdi_writeback, dwork);
struct backing_dev_info *bdi = wb->bdi;
long pages_written;
+ unsigned int pflags;

set_worker_desc("flush-%s", dev_name(bdi->dev));
- current->flags |= PF_SWAPWRITE;
+ current_set_flags_nested(&pflags, PF_SWAPWRITE);

if (likely(!current_is_workqueue_rescuer() ||
list_empty(&bdi->bdi_list))) {
@@ -1044,7 +1045,7 @@ void bdi_writeback_workfn(struct work_struct *work)
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));

- current->flags &= ~PF_SWAPWRITE;
+ current_restore_flags_nested(&pflags, PF_SWAPWRITE);
}

/*
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index f9bb590acc0e..7c5b9eaebd0d 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -154,13 +154,6 @@ typedef __uint64_t __psunsigned_t;

#define current_cpu() (raw_smp_processor_id())
#define current_pid() (current->pid)
-#define current_test_flags(f) (current->flags & (f))
-#define current_set_flags_nested(sp, f) \
- (*(sp) = current->flags, current->flags |= (f))
-#define current_clear_flags_nested(sp, f) \
- (*(sp) = current->flags, current->flags &= ~(f))
-#define current_restore_flags_nested(sp, f) \
- (current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))

#define spinlock_destroy(lock)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec1cd0b..56fa52a0654c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,14 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

+#define current_test_flags(f) (current->flags & (f))
+#define current_set_flags_nested(sp, f) \
+ (*(sp) = current->flags, current->flags |= (f))
+#define current_clear_flags_nested(sp, f) \
+ (*(sp) = current->flags, current->flags &= ~(f))
+#define current_restore_flags_nested(sp, f) \
+ (current->flags = ((current->flags & ~(f)) | (*(sp) & (f))))
+
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
@@ -1859,18 +1867,6 @@ static inline gfp_t memalloc_noio_flags(gfp_t flags)
return flags;
}

-static inline unsigned int memalloc_noio_save(void)
-{
- unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
- current->flags |= PF_MEMALLOC_NOIO;
- return flags;
-}
-
-static inline void memalloc_noio_restore(unsigned int flags)
-{
- current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
-}
-
/*
* task->jobctl flags
*/
@@ -1927,13 +1923,6 @@ static inline void rcu_copy_process(struct task_struct *p)

#endif

-static inline void tsk_restore_flags(struct task_struct *task,
- unsigned long orig_flags, unsigned long flags)
-{
- task->flags &= ~flags;
- task->flags |= orig_flags & flags;
-}
-
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 490fcbb1dc5b..dff051dae277 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -225,7 +225,7 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }
asmlinkage void __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
- unsigned long old_flags = current->flags;
+ unsigned int pflags;
int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
@@ -238,7 +238,7 @@ asmlinkage void __do_softirq(void)
* softirq. A softirq handled such as network RX might set PF_MEMALLOC
* again if the socket is related to swap
*/
- current->flags &= ~PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);

pending = local_softirq_pending();
account_irq_enter_time(current);
@@ -295,7 +295,7 @@ restart:
account_irq_exit_time(current);
__local_bh_enable(SOFTIRQ_OFFSET);
WARN_ON_ONCE(in_interrupt());
- tsk_restore_flags(current, old_flags, PF_MEMALLOC);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);
}

asmlinkage void do_softirq(void)
diff --git a/mm/migrate.c b/mm/migrate.c
index bed48809e5d0..2b7574860b2b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1107,11 +1107,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
int pass = 0;
struct page *page;
struct page *page2;
- int swapwrite = current->flags & PF_SWAPWRITE;
+ unsigned int pflags;
int rc;

- if (!swapwrite)
- current->flags |= PF_SWAPWRITE;
+
+ current_set_flags_nested(&pflags, PF_SWAPWRITE);

for(pass = 0; pass < 10 && retry; pass++) {
retry = 0;
@@ -1155,8 +1155,7 @@ out:
count_vm_events(PGMIGRATE_FAIL, nr_failed);
trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);

- if (!swapwrite)
- current->flags &= ~PF_SWAPWRITE;
+ current_restore_flags_nested(&pflags, PF_SWAPWRITE);

return rc;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76ae4b30..a3d1f5da2f21 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2254,6 +2254,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
+ unsigned int pflags;
if (!order)
return NULL;

@@ -2262,11 +2263,11 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
return NULL;
}

- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
nodemask, sync_migration,
contended_compaction);
- current->flags &= ~PF_MEMALLOC;
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);

if (*did_some_progress != COMPACT_SKIPPED) {
struct page *page;
@@ -2325,12 +2326,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist,
{
struct reclaim_state reclaim_state;
int progress;
+ unsigned int pflags;

cond_resched();

/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;
@@ -2339,7 +2341,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist,

current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
- current->flags &= ~PF_MEMALLOC;
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);

cond_resched();

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..94acf53d9abf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3343,8 +3343,9 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
unsigned long nr_reclaimed;
+ unsigned int pflags;

- p->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
lockdep_set_current_reclaim_state(sc.gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
@@ -3353,7 +3354,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)

p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
- p->flags &= ~PF_MEMALLOC;
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);

return nr_reclaimed;
}
@@ -3530,6 +3531,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.gfp_mask = sc.gfp_mask,
};
unsigned long nr_slab_pages0, nr_slab_pages1;
+ unsigned int pflags;

cond_resched();
/*
@@ -3537,7 +3539,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
- p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+ current_set_flags_nested(&pflags, PF_MEMALLOC | PF_SWAPWRITE);
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
@@ -3587,7 +3589,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
}

p->reclaim_state = NULL;
- current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC | PF_SWAPWRITE);
lockdep_clear_current_reclaim_state();
return sc.nr_reclaimed >= nr_pages;
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 45fa2f11f84d..b4eaac7c0d63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3664,7 +3664,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
int ret;

if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
- unsigned long pflags = current->flags;
+ unsigned int pflags;

/*
* PFMEMALLOC skbs are special, they should
@@ -3675,9 +3675,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
* Use PF_MEMALLOC as this saves us from propagating the allocation
* context down to all allocation sites.
*/
- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
ret = __netif_receive_skb_core(skb, true);
- tsk_restore_flags(current, pflags, PF_MEMALLOC);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);
} else
ret = __netif_receive_skb_core(skb, false);

diff --git a/net/core/sock.c b/net/core/sock.c
index c0fc6bdad1e3..cf9bd24e4099 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -318,14 +318,14 @@ EXPORT_SYMBOL_GPL(sk_clear_memalloc);
int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
{
int ret;
- unsigned long pflags = current->flags;
+ unsigned int pflags;

/* these should have been dropped before queueing */
BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));

- current->flags |= PF_MEMALLOC;
+ current_set_flags_nested(&pflags, PF_MEMALLOC);
ret = sk->sk_backlog_rcv(sk, skb);
- tsk_restore_flags(current, pflags, PF_MEMALLOC);
+ current_restore_flags_nested(&pflags, PF_MEMALLOC);

return ret;
}
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index ff3cc4bf4b24..c110dec833cd 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -820,9 +820,10 @@ void rpc_execute(struct rpc_task *task)

static void rpc_async_schedule(struct work_struct *work)
{
- current->flags |= PF_FSTRANS;
+ unsigned int pflags;
+ current_set_flags_nested(&pflags, PF_FSTRANS);
__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 285dc0884115..ac339b5ccf22 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -199,8 +199,9 @@ xprt_rdma_connect_worker(struct work_struct *work)
container_of(work, struct rpcrdma_xprt, rdma_connect.work);
struct rpc_xprt *xprt = &r_xprt->xprt;
int rc = 0;
+ unsigned int pflags;

- current->flags |= PF_FSTRANS;
+ current_set_flags_nested(&pflags, PF_FSTRANS);
xprt_clear_connected(xprt);

dprintk("RPC: %s: %sconnect\n", __func__,
@@ -211,7 +212,7 @@ xprt_rdma_connect_worker(struct work_struct *work)

dprintk("RPC: %s: exit\n", __func__);
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0addefca8e77..8015e7b7d87c 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1932,8 +1932,9 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
struct rpc_xprt *xprt = &transport->xprt;
struct socket *sock;
int status = -EIO;
+ unsigned int pflags;

- current->flags |= PF_FSTRANS;
+ current_set_flags_nested(&pflags, PF_FSTRANS);

clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
status = __sock_create(xprt->xprt_net, AF_LOCAL,
@@ -1973,7 +1974,7 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
return status;
}

@@ -2076,8 +2077,9 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
struct socket *sock = transport->sock;
int status = -EIO;
+ unsigned int pflags;

- current->flags |= PF_FSTRANS;
+ current_set_flags_nested(&pflags, PF_FSTRANS);

/* Start by resetting any existing state */
xs_reset_transport(transport);
@@ -2098,7 +2100,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/*
@@ -2234,8 +2236,9 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
struct rpc_xprt *xprt = &transport->xprt;
int status = -EIO;
+ unsigned int pflags;

- current->flags |= PF_FSTRANS;
+ current_set_flags_nested(&pflags, PF_FSTRANS);

if (!sock) {
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
@@ -2282,7 +2285,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -EINPROGRESS:
case -EALREADY:
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
return;
case -EINVAL:
/* Happens, for instance, if the user specified a link
@@ -2299,7 +2302,7 @@ out_eagain:
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}

/**

2014-04-16 04:18:13

by NeilBrown

[permalink] [raw]
Subject: [PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

lockdep can check for locking problems involving reclaim using
the same infrastructure as used for interrupts.

However a number of the messages still refer to interrupts even
if it was actually a reclaim-related problem.

So determine where the problem was caused by reclaim or irq and adjust
messages accordingly.

Signed-off-by: NeilBrown <[email protected]>
---
kernel/locking/lockdep.c | 43 ++++++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e05b82e92373..33d2ac7519dc 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1423,7 +1423,8 @@ static void
print_irq_lock_scenario(struct lock_list *safe_entry,
struct lock_list *unsafe_entry,
struct lock_class *prev_class,
- struct lock_class *next_class)
+ struct lock_class *next_class,
+ int reclaim)
{
struct lock_class *safe_class = safe_entry->class;
struct lock_class *unsafe_class = unsafe_entry->class;
@@ -1455,20 +1456,27 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
printk("\n\n");
}

- printk(" Possible interrupt unsafe locking scenario:\n\n");
+ if (reclaim)
+ printk(" Possible reclaim unsafe locking scenario:\n\n");
+ else
+ printk(" Possible interrupt unsafe locking scenario:\n\n");
printk(" CPU0 CPU1\n");
printk(" ---- ----\n");
printk(" lock(");
__print_lock_name(unsafe_class);
printk(");\n");
- printk(" local_irq_disable();\n");
+ if (!reclaim)
+ printk(" local_irq_disable();\n");
printk(" lock(");
__print_lock_name(safe_class);
printk(");\n");
printk(" lock(");
__print_lock_name(middle_class);
printk(");\n");
- printk(" <Interrupt>\n");
+ if (reclaim)
+ printk(" <Memory allocation/reclaim>\n");
+ else
+ printk(" <Interrupt>\n");
printk(" lock(");
__print_lock_name(safe_class);
printk(");\n");
@@ -1487,6 +1495,8 @@ print_bad_irq_dependency(struct task_struct *curr,
enum lock_usage_bit bit2,
const char *irqclass)
{
+ int reclaim = strncmp(irqclass, "RECLAIM", 7) == 0;
+
if (!debug_locks_off_graph_unlock() || debug_locks_silent)
return 0;

@@ -1528,7 +1538,7 @@ print_bad_irq_dependency(struct task_struct *curr,

printk("\nother info that might help us debug this:\n\n");
print_irq_lock_scenario(backwards_entry, forwards_entry,
- hlock_class(prev), hlock_class(next));
+ hlock_class(prev), hlock_class(next), reclaim);

lockdep_print_held_locks(curr);

@@ -2200,7 +2210,7 @@ static void check_chain_key(struct task_struct *curr)
}

static void
-print_usage_bug_scenario(struct held_lock *lock)
+print_usage_bug_scenario(struct held_lock *lock, enum lock_usage_bit new_bit)
{
struct lock_class *class = hlock_class(lock);

@@ -2210,7 +2220,11 @@ print_usage_bug_scenario(struct held_lock *lock)
printk(" lock(");
__print_lock_name(class);
printk(");\n");
- printk(" <Interrupt>\n");
+ if (new_bit == LOCK_USED_IN_RECLAIM_FS ||
+ new_bit == LOCK_USED_IN_RECLAIM_FS_READ)
+ printk(" <Memory allocation/reclaim>\n");
+ else
+ printk(" <Interrupt>\n");
printk(" lock(");
__print_lock_name(class);
printk(");\n");
@@ -2246,7 +2260,7 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this,

print_irqtrace_events(curr);
printk("\nother info that might help us debug this:\n");
- print_usage_bug_scenario(this);
+ print_usage_bug_scenario(this, new_bit);

lockdep_print_held_locks(curr);

@@ -2285,13 +2299,17 @@ print_irq_inversion_bug(struct task_struct *curr,
struct lock_list *entry = other;
struct lock_list *middle = NULL;
int depth;
+ int reclaim = strncmp(irqclass, "RECLAIM", 7) == 0;

if (!debug_locks_off_graph_unlock() || debug_locks_silent)
return 0;

printk("\n");
printk("=========================================================\n");
- printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+ if (reclaim)
+ printk("[ INFO: possible memory reclaim lock inversion dependency detected ]\n");
+ else
+ printk("[ INFO: possible irq lock inversion dependency detected ]\n");
print_kernel_ident();
printk("---------------------------------------------------------\n");
printk("%s/%d just changed the state of lock:\n",
@@ -2302,6 +2320,9 @@ print_irq_inversion_bug(struct task_struct *curr,
else
printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
print_lock_name(other->class);
+ if (reclaim)
+ printk("\n\nand memory reclaim could create inverse lock ordering between them.\n\n");
+ else
printk("\n\nand interrupts could create inverse lock ordering between them.\n\n");

printk("\nother info that might help us debug this:\n");
@@ -2319,10 +2340,10 @@ print_irq_inversion_bug(struct task_struct *curr,
} while (entry && entry != root && (depth >= 0));
if (forwards)
print_irq_lock_scenario(root, other,
- middle ? middle->class : root->class, other->class);
+ middle ? middle->class : root->class, other->class, reclaim);
else
print_irq_lock_scenario(other, root,
- middle ? middle->class : other->class, root->class);
+ middle ? middle->class : other->class, root->class, reclaim);

lockdep_print_held_locks(curr);


2014-04-16 04:18:07

by NeilBrown

[permalink] [raw]
Subject: [PATCH 02/19] lockdep: lockdep_set_current_reclaim_state should save old value

Currently kswapd sets current->lockdep_reclaim_gfp but the first
memory allocation call will clear it. So the setting does no good.
Thus the lockdep_set_current_reclaim_state call in kswapd() is
ineffective.

With this patch we always save the old value and then restore it,
so lockdep gets to properly check the locks that kswapd takes.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/lockdep.h | 8 ++++----
kernel/locking/lockdep.c | 8 +++++---
mm/page_alloc.c | 5 +++--
mm/vmscan.c | 10 ++++++----
4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 92b1bfc5da60..18eedd692d16 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -351,8 +351,8 @@ static inline void lock_set_subclass(struct lockdep_map *lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
}

-extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
-extern void lockdep_clear_current_reclaim_state(void);
+extern gfp_t lockdep_set_current_reclaim_state(gfp_t gfp_mask);
+extern void lockdep_restore_current_reclaim_state(gfp_t old_mask);
extern void lockdep_trace_alloc(gfp_t mask);

# define INIT_LOCKDEP .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
@@ -379,8 +379,8 @@ static inline void lockdep_on(void)
# define lock_release(l, n, i) do { } while (0)
# define lock_set_class(l, n, k, s, i) do { } while (0)
# define lock_set_subclass(l, s, i) do { } while (0)
-# define lockdep_set_current_reclaim_state(g) do { } while (0)
-# define lockdep_clear_current_reclaim_state() do { } while (0)
+# define lockdep_set_current_reclaim_state(g) (0)
+# define lockdep_restore_current_reclaim_state(g) do { } while (0)
# define lockdep_trace_alloc(g) do { } while (0)
# define lockdep_init() do { } while (0)
# define lockdep_info() do { } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index eb8a54783fa0..e05b82e92373 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3645,14 +3645,16 @@ int lock_is_held(struct lockdep_map *lock)
}
EXPORT_SYMBOL_GPL(lock_is_held);

-void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
+gfp_t lockdep_set_current_reclaim_state(gfp_t gfp_mask)
{
+ gfp_t old = current->lockdep_reclaim_gfp;
current->lockdep_reclaim_gfp = gfp_mask;
+ return old;
}

-void lockdep_clear_current_reclaim_state(void)
+void lockdep_restore_current_reclaim_state(gfp_t gfp_mask)
{
- current->lockdep_reclaim_gfp = 0;
+ current->lockdep_reclaim_gfp = gfp_mask;
}

#ifdef CONFIG_LOCK_STAT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3d1f5da2f21..ff8b91aa0b87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2327,20 +2327,21 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist,
struct reclaim_state reclaim_state;
int progress;
unsigned int pflags;
+ gfp_t old_mask;

cond_resched();

/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
current_set_flags_nested(&pflags, PF_MEMALLOC);
- lockdep_set_current_reclaim_state(gfp_mask);
+ old_mask = lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = &reclaim_state;

progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);

current->reclaim_state = NULL;
- lockdep_clear_current_reclaim_state();
+ lockdep_restore_current_reclaim_state(old_mask);
current_restore_flags_nested(&pflags, PF_MEMALLOC);

cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 94acf53d9abf..67165f839936 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3344,16 +3344,17 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
struct task_struct *p = current;
unsigned long nr_reclaimed;
unsigned int pflags;
+ gfp_t old_mask;

current_set_flags_nested(&pflags, PF_MEMALLOC);
- lockdep_set_current_reclaim_state(sc.gfp_mask);
+ old_mask = lockdep_set_current_reclaim_state(sc.gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);

p->reclaim_state = NULL;
- lockdep_clear_current_reclaim_state();
+ lockdep_restore_current_reclaim_state(old_mask);
current_restore_flags_nested(&pflags, PF_MEMALLOC);

return nr_reclaimed;
@@ -3532,6 +3533,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
};
unsigned long nr_slab_pages0, nr_slab_pages1;
unsigned int pflags;
+ gfp_t old_mask;

cond_resched();
/*
@@ -3540,7 +3542,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* and RECLAIM_SWAP.
*/
current_set_flags_nested(&pflags, PF_MEMALLOC | PF_SWAPWRITE);
- lockdep_set_current_reclaim_state(gfp_mask);
+ old_mask = lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

@@ -3590,7 +3592,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)

p->reclaim_state = NULL;
current_restore_flags_nested(&pflags, PF_MEMALLOC | PF_SWAPWRITE);
- lockdep_clear_current_reclaim_state();
+ lockdep_restore_current_reclaim_state(old_mask);
return sc.nr_reclaimed >= nr_pages;
}


2014-04-16 04:18:38

by NeilBrown

[permalink] [raw]
Subject: [PATCH 06/19] nfsd: set PF_FSTRANS for nfsd threads.

If a localhost mount is present, then it is easy to deadlock NFS by
nfsd entering direct reclaim and calling nfs_release_page() which
requires nfsd to perform an fsync() (which it cannot do because it is
reclaiming memory).

By setting PF_FSTRANS we stop the memory allocator from ever
attempting any FS operation would could deadlock.

We need this flag set for any thread which is handling a request from
the local host, but we also need to always have it for at least 1 or 2
threads so that we don't end up with all threads blocked in allocation.

When we set PF_FSTRANS we also tell lockdep that we are handling
reclaim so that it can detect deadlocks for us.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfssvc.c | 18 ++++++++++++++++++
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 6 ++++++
3 files changed, 25 insertions(+)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9a4a5f9e7468..6af8bc2daf7d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -565,6 +565,8 @@ nfsd(void *vrqstp)
struct svc_xprt *perm_sock = list_entry(rqstp->rq_server->sv_permsocks.next, typeof(struct svc_xprt), xpt_list);
struct net *net = perm_sock->xpt_net;
int err;
+ unsigned int pflags = 0;
+ gfp_t reclaim_state = 0;

/* Lock module and set up kernel thread */
mutex_lock(&nfsd_mutex);
@@ -611,14 +613,30 @@ nfsd(void *vrqstp)
;
if (err == -EINTR)
break;
+ if (rqstp->rq_local && !current_test_flags(PF_FSTRANS)) {
+ current_set_flags_nested(&pflags, PF_FSTRANS);
+ atomic_inc(&rqstp->rq_pool->sp_nr_fstrans);
+ reclaim_state = lockdep_set_current_reclaim_state(GFP_KERNEL);
+ }
validate_process_creds();
svc_process(rqstp);
validate_process_creds();
+ if (current_test_flags(PF_FSTRANS) &&
+ atomic_dec_if_positive(&rqstp->rq_pool->sp_nr_fstrans) >= 0) {
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
+ lockdep_restore_current_reclaim_state(reclaim_state);
+ }
}

/* Clear signals before calling svc_exit_thread() */
flush_signals(current);

+ if (current_test_flags(PF_FSTRANS)) {
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
+ lockdep_restore_current_reclaim_state(reclaim_state);
+ atomic_dec(&rqstp->rq_pool->sp_nr_fstrans);
+ }
+
mutex_lock(&nfsd_mutex);
nfsdstats.th_cnt --;

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index a0dbbd1e00e9..4b274aba51dd 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -48,6 +48,7 @@ struct svc_pool {
struct list_head sp_threads; /* idle server threads */
struct list_head sp_sockets; /* pending sockets */
unsigned int sp_nrthreads; /* # of threads in pool */
+ atomic_t sp_nr_fstrans; /* # threads with PF_FSTRANS */
struct list_head sp_all_threads; /* all server threads */
struct svc_pool_stats sp_stats; /* statistics on pool operation */
int sp_task_pending;/* has pending task */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 5de6801cd924..8b13f35b6cbb 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -477,6 +477,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
INIT_LIST_HEAD(&pool->sp_threads);
INIT_LIST_HEAD(&pool->sp_sockets);
INIT_LIST_HEAD(&pool->sp_all_threads);
+ /* The number of threads with PF_FSTRANS set
+ * should never be reduced below 2, except when
+ * threads exit. So we use atomic_dec_if_positive()
+ * on this value.
+ */
+ atomic_set(&pool->sp_nr_fstrans, -2);
spin_lock_init(&pool->sp_lock);
}


2014-04-16 04:18:43

by NeilBrown

[permalink] [raw]
Subject: [PATCH 07/19] nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.

nfsd already uses PF_LESS_THROTTLE (and is the only user) to avoid
throttling while dirtying pages. Use it also to avoid throttling while
doing direct reclaim as this can stall nfsd in the same way.

Also only set PF_LESS_THROTTLE when handling a 'write' request for a
local connection. This is the only time when the throttling can cause
a problem. In other cases we should throttle if the system is busy.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfssvc.c | 6 ------
fs/nfsd/vfs.c | 6 ++++++
mm/vmscan.c | 7 +++++--
3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 6af8bc2daf7d..cd24aa76e58d 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -593,12 +593,6 @@ nfsd(void *vrqstp)
nfsdstats.th_cnt++;
mutex_unlock(&nfsd_mutex);

- /*
- * We want less throttling in balance_dirty_pages() so that nfs to
- * localhost doesn't cause nfsd to lock up due to all the client's
- * dirty pages.
- */
- current->flags |= PF_LESS_THROTTLE;
set_freezable();

/*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 6d7be3f80356..be2d7af3beee 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -913,6 +913,10 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
int stable = *stablep;
int use_wgather;
loff_t pos = offset;
+ unsigned int pflags;
+
+ if (rqstp->rq_local)
+ current_set_flags_nested(&pflags, PF_LESS_THROTTLE);

dentry = file->f_path.dentry;
inode = dentry->d_inode;
@@ -950,6 +954,8 @@ out_nfserr:
err = 0;
else
err = nfserrno(host_err);
+ if (rqstp->rq_local)
+ current_restore_flags_nested(&pflags, PF_LESS_THROTTLE);
return err;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 05de3289d031..1b7c4e44f0a1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1552,7 +1552,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* implies that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_unqueued_dirty == nr_taken || nr_immediate)
+ if ((nr_unqueued_dirty == nr_taken || nr_immediate)
+ && !current_test_flags(PF_LESS_THROTTLE))
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

@@ -1561,7 +1562,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* is congested. Allow kswapd to continue until it starts encountering
* unqueued dirty pages or cycling through the LRU too quickly.
*/
- if (!sc->hibernation_mode && !current_is_kswapd())
+ if (!sc->hibernation_mode &&
+ !current_is_kswapd() &&
+ !current_test_flags(PF_LESS_THROTTLE))
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,

2014-04-16 04:18:50

by NeilBrown

[permalink] [raw]
Subject: [PATCH 08/19] Set PF_FSTRANS while write_cache_pages calls ->writepage

It is normally safe for direct reclaim to enter filesystems
even when a page is locked - as can happen if ->writepage
allocates memory with GFP_KERNEL (which xfs does).

However if a localhost NFS mount is present, then a flush-*
thread might hold a page locked and then in direct reclaim,
ask nfs to commit an inode (nfs_release_page). When nfsd
performs the fsync it might try to lock the same page, which leads to
a deadlock.

A ->writepage should not allocate much memory, or do so very often, so
it is safe to set PF_FSTRANS, and this removes the possible deadlock.

This was not detected by lockdep as it doesn't monitor the page lock.
It was found as a real deadlock in testing.

Signed-off-by: NeilBrown <[email protected]>
---
mm/page-writeback.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7106cb1aca8e..572e70b9a3f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1909,6 +1909,7 @@ retry:

for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
+ unsigned int pflags;

/*
* At this point, the page may be truncated or
@@ -1960,8 +1961,10 @@ continue_unlock:
if (!clear_page_dirty_for_io(page))
goto continue_unlock;

+ current_set_flags_nested(&pflags, PF_FSTRANS);
trace_wbc_writepage(wbc, mapping->backing_dev_info);
ret = (*writepage)(page, wbc, data);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
unlock_page(page);

2014-04-16 04:18:58

by NeilBrown

[permalink] [raw]
Subject: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

xfs_file_*_read holds an inode lock while calling a generic 'read'
function. These functions perform read-ahead and are quite likely to
allocate memory.
So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
into a filesystem to free memory.

This can be a problem with loop-back NFS mounts, if free_pages ends up
wating in nfs_release_page(), and nfsd is blocked waiting for the lock
that this code holds.

This was found both by lockdep and as a real deadlock during testing.

Signed-off-by: NeilBrown <[email protected]>
---
fs/xfs/xfs_file.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..88b33ef64668 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -243,6 +243,7 @@ xfs_file_aio_read(
ssize_t ret = 0;
int ioflags = 0;
xfs_fsize_t n;
+ unsigned int pflags;

XFS_STATS_INC(xs_read_calls);

@@ -290,6 +291,10 @@ xfs_file_aio_read(
* proceeed concurrently without serialisation.
*/
xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ /* As we hold a lock, we must ensure that any allocation
+ * in generic_file_aio_read avoid __GFP_FS
+ */
+ current_set_flags_nested(&pflags, PF_FSTRANS);
if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
@@ -313,6 +318,7 @@ xfs_file_aio_read(
if (ret > 0)
XFS_STATS_ADD(xs_read_bytes, ret);

+ current_restore_flags_nested(&pflags, PF_FSTRANS);
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
}
@@ -328,6 +334,7 @@ xfs_file_splice_read(
struct xfs_inode *ip = XFS_I(infilp->f_mapping->host);
int ioflags = 0;
ssize_t ret;
+ unsigned int pflags;

XFS_STATS_INC(xs_read_calls);

@@ -338,6 +345,10 @@ xfs_file_splice_read(
return -EIO;

xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ /* As we hold a lock, we must ensure that any allocation
+ * in generic_file_splice_read avoid __GFP_FS
+ */
+ current_set_flags_nested(&pflags, PF_FSTRANS);

trace_xfs_file_splice_read(ip, count, *ppos, ioflags);

@@ -345,6 +356,7 @@ xfs_file_splice_read(
if (ret > 0)
XFS_STATS_ADD(xs_read_bytes, ret);

+ current_restore_flags_nested(&pflags, PF_FSTRANS);
xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
}

2014-04-16 04:19:06

by NeilBrown

[permalink] [raw]
Subject: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

sk_lock can be taken while reclaiming memory (in nfsd for loop-back
NFS mounts, and presumably in nfs), and memory can be allocated while
holding sk_lock, at least via:

inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc

So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.

This deadlock was found by lockdep.

Signed-off-by: NeilBrown <[email protected]>
---
include/net/sock.h | 1 +
net/core/sock.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9586a137cad..27c355637e44 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -324,6 +324,7 @@ struct sock {
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr

socket_lock_t sk_lock;
+ unsigned int sk_pflags; /* process flags before taking lock */
struct sk_buff_head sk_receive_queue;
/*
* The backlog queue is special, it is always used with
diff --git a/net/core/sock.c b/net/core/sock.c
index cf9bd24e4099..8bc677ef072e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2341,6 +2341,7 @@ void lock_sock_nested(struct sock *sk, int subclass)
/*
* The sk_lock has mutex_lock() semantics here:
*/
+ current_set_flags_nested(&sk->sk_pflags, PF_FSTRANS);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
}
@@ -2352,6 +2353,7 @@ void release_sock(struct sock *sk)
* The sk_lock has mutex_unlock() semantics:
*/
mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
+ current_restore_flags_nested(&sk->sk_pflags, PF_FSTRANS);

spin_lock_bh(&sk->sk_lock.slock);
if (sk->sk_backlog.tail)

2014-04-16 04:19:14

by NeilBrown

[permalink] [raw]
Subject: [PATCH 11/19] FS: set PF_FSTRANS while holding mmap_sem in exec.c

Because mmap_sem is sometimes(*) taken while holding a sock lock,
and the sock lock might be needed for reclaim (at least when loop-back
NFS is active), we must not block on FS reclaim while mmap_sem is
held.

exec.c allocates memory while holding mmap_sem, and so needs
PF_FSTRANS protection.

* lockdep reports:
[ 57.653355] [<ffffffff810eb068>] lock_acquire+0xa8/0x1f0
[ 57.653355] [<ffffffff811835a4>] might_fault+0x84/0xb0
[ 57.653355] [<ffffffff81aec65d>] do_ip_setsockopt.isra.18+0x93d/0xed0
[ 57.653355] [<ffffffff81aecc17>] ip_setsockopt+0x27/0x90
[ 57.653355] [<ffffffff81b15146>] udp_setsockopt+0x16/0x30
[ 57.653355] [<ffffffff81a913cf>] sock_common_setsockopt+0xf/0x20
[ 57.653355] [<ffffffff81a9075e>] SyS_setsockopt+0x5e/0xc0
[ 57.653355] [<ffffffff81c3d062>] system_call_fastpath+0x16/0x1b

to explain why mmap_sem might be taken while sock lock is held.

Signed-off-by: NeilBrown <[email protected]>
---
fs/exec.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..2c70a03ddb2b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -652,6 +652,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
unsigned long stack_size;
unsigned long stack_expand;
unsigned long rlim_stack;
+ unsigned int pflags;

#ifdef CONFIG_STACK_GROWSUP
/* Limit stack size to 1GB */
@@ -688,6 +689,7 @@ int setup_arg_pages(struct linux_binprm *bprm,

down_write(&mm->mmap_sem);
vm_flags = VM_STACK_FLAGS;
+ current_set_flags_nested(&pflags, PF_FSTRANS);

/*
* Adjust stack execute permissions; explicitly enable for
@@ -741,6 +743,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
ret = -EFAULT;

out_unlock:
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
up_write(&mm->mmap_sem);
return ret;
}
@@ -1369,6 +1372,7 @@ int search_binary_handler(struct linux_binprm *bprm)
bool need_retry = IS_ENABLED(CONFIG_MODULES);
struct linux_binfmt *fmt;
int retval;
+ unsigned int pflags;

/* This allows 4 levels of binfmt rewrites before failing hard. */
if (bprm->recursion_depth > 5)
@@ -1381,6 +1385,7 @@ int search_binary_handler(struct linux_binprm *bprm)
retval = -ENOENT;
retry:
read_lock(&binfmt_lock);
+ current_set_flags_nested(&pflags, PF_FSTRANS);
list_for_each_entry(fmt, &formats, lh) {
if (!try_module_get(fmt->module))
continue;
@@ -1396,6 +1401,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock);
put_binfmt(fmt);
}
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
read_unlock(&binfmt_lock);

if (need_retry && retval == -ENOEXEC) {

2014-04-16 04:19:20

by NeilBrown

[permalink] [raw]
Subject: [PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock

As rtnl_mutex can be taken while holding sk_lock, and sk_lock can be
taken while performing memory reclaim (at least when loop-back NFS is
active), any memory allocation under rtnl_mutex must avoid __GFP_FS,
which is most easily done by setting PF_MEMALLOC.


CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock(sk_lock-AF_INET);
lock(rtnl_mutex);
<Memory allocation/reclaim>
lock(sk_lock-AF_INET);

*** DEADLOCK ***

1/ rtnl_mutex is taken while holding sk_lock:

[<ffffffff81abb442>] rtnl_lock+0x12/0x20
[<ffffffff81b28c3a>] ip_mc_leave_group+0x2a/0x160
[<ffffffff81aec70b>] do_ip_setsockopt.isra.18+0x96b/0xed0
[<ffffffff81aecc97>] ip_setsockopt+0x27/0x90
[<ffffffff81b151c6>] udp_setsockopt+0x16/0x30
[<ffffffff81a9144f>] sock_common_setsockopt+0xf/0x20
[<ffffffff81a907de>] SyS_setsockopt+0x5e/0xc0

2/ memory is allocated under rtnl_mutex:
[<ffffffff8166eb41>] kobject_set_name_vargs+0x21/0x70
[<ffffffff81840d92>] dev_set_name+0x42/0x50
[<ffffffff81ac5e97>] netdev_register_kobject+0x57/0x130
[<ffffffff81aaf574>] register_netdevice+0x354/0x550
[<ffffffff81aaf785>] register_netdev+0x15/0x30


Signed-off-by: NeilBrown <[email protected]>
---
net/core/rtnetlink.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 120eecc0f5a4..6870211e93a6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -61,15 +61,18 @@ struct rtnl_link {
};

static DEFINE_MUTEX(rtnl_mutex);
+static int rtnl_pflags;

void rtnl_lock(void)
{
mutex_lock(&rtnl_mutex);
+ current_set_flags_nested(&rtnl_pflags, PF_FSTRANS);
}
EXPORT_SYMBOL(rtnl_lock);

void __rtnl_unlock(void)
{
+ current_restore_flags_nested(&rtnl_pflags, PF_FSTRANS);
mutex_unlock(&rtnl_mutex);
}

@@ -82,7 +85,11 @@ EXPORT_SYMBOL(rtnl_unlock);

int rtnl_trylock(void)
{
- return mutex_trylock(&rtnl_mutex);
+ if (mutex_trylock(&rtnl_mutex)) {
+ current_set_flags_nested(&rtnl_pflags, PF_FSTRANS);
+ return 1;
+ }
+ return 0;
}
EXPORT_SYMBOL(rtnl_trylock);


2014-04-16 04:19:26

by NeilBrown

[permalink] [raw]
Subject: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

lockdep reports a locking chain

sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex

As sk_lock may be needed to reclaim memory, allowing that
reclaim while pcu_alloc_mutex is held can lead to deadlock.
So set PF_FSTRANS while it is help to avoid the FS reclaim.

pcpu_alloc_mutex can be taken when rtnl_mutex is held:

[<ffffffff8117f979>] pcpu_alloc+0x49/0x960
[<ffffffff8118029b>] __alloc_percpu+0xb/0x10
[<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
[<ffffffff81aaf30c>] register_netdevice+0xec/0x550
[<ffffffff81aaf785>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <[email protected]>
---
mm/percpu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 036cfe07050f..77dd24032f41 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -712,6 +712,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
int slot, off, new_alloc;
unsigned long flags;
void __percpu *ptr;
+ unsigned int pflags;

if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) {
WARN(true, "illegal size (%zu) or align (%zu) for "
@@ -720,6 +721,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved)
}

mutex_lock(&pcpu_alloc_mutex);
+ current_set_flags_nested(&pflags, PF_FSTRANS);
spin_lock_irqsave(&pcpu_lock, flags);

/* serve reserved allocations from the reserved chunk if available */
@@ -801,6 +803,7 @@ area_found:
goto fail_unlock;
}

+ current_restore_flags_nested(&pflags, PF_FSTRANS);
mutex_unlock(&pcpu_alloc_mutex);

/* return address relative to base address */
@@ -811,6 +814,7 @@ area_found:
fail_unlock:
spin_unlock_irqrestore(&pcpu_lock, flags);
fail_unlock_mutex:
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
mutex_unlock(&pcpu_alloc_mutex);
if (warn_limit) {
pr_warning("PERCPU: allocation failed, size=%zu align=%zu, "

2014-04-16 04:19:33

by NeilBrown

[permalink] [raw]
Subject: [PATCH 14/19] driver core: set PF_FSTRANS while holding gdp_mutex

lockdep reports a locking chain:

sk_lock-AF_INET --> rtnl_mutex --> gdp_mutex

As sk_lock can be needed for memory reclaim (when loop-back NFS is in
use at least), any memory allocation under gdp_mutex needs to
be protected by PF_FSTRANS.

The path frome rtnl_mutex to gdp_mutex is

[<ffffffff81841ecc>] get_device_parent+0x4c/0x1f0
[<ffffffff81842496>] device_add+0xe6/0x610
[<ffffffff81ac5f2a>] netdev_register_kobject+0x7a/0x130
[<ffffffff81aaf5d4>] register_netdevice+0x354/0x550
[<ffffffff81aaf7e5>] register_netdev+0x15/0x30

Signed-off-by: NeilBrown <[email protected]>
---
drivers/base/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2b567177ef78..1a2735237650 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -750,6 +750,7 @@ static struct kobject *get_device_parent(struct device *dev,
struct kobject *kobj = NULL;
struct kobject *parent_kobj;
struct kobject *k;
+ unsigned int pflags;

#ifdef CONFIG_BLOCK
/* block disks show up in /sys/block */
@@ -788,7 +789,9 @@ static struct kobject *get_device_parent(struct device *dev,
}

/* or create a new class-directory at the parent device */
+ current_set_flags_nested(&pflags, PF_FSTRANS);
k = class_dir_create_and_add(dev->class, parent_kobj);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
/* do not emit an uevent for this simple "glue" directory */
mutex_unlock(&gdp_mutex);
return k;

2014-04-16 04:19:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH 15/19] nfsd: set PF_FSTRANS when client_mutex is held.

When loop-back NFS with NFSv4 is in use, client_mutex might be needed
to reclaim memory, so any memory allocation while client_mutex is held
must avoid __GFP_FS, so best to set PF_FSTRANS.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfs4state.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index d5d070fbeb35..7b7fbcbe20cb 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -75,6 +75,7 @@ static int check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner

/* Currently used for almost all code touching nfsv4 state: */
static DEFINE_MUTEX(client_mutex);
+static unsigned int client_mutex_pflags;

/*
* Currently used for the del_recall_lru and file hash table. In an
@@ -93,6 +94,7 @@ void
nfs4_lock_state(void)
{
mutex_lock(&client_mutex);
+ current_set_flags_nested(&client_mutex_pflags, PF_FSTRANS);
}

static void free_session(struct nfsd4_session *);
@@ -127,6 +129,7 @@ static __be32 nfsd4_get_session_locked(struct nfsd4_session *ses)
void
nfs4_unlock_state(void)
{
+ current_restore_flags_nested(&client_mutex_pflags, PF_FSTRANS);
mutex_unlock(&client_mutex);
}


2014-04-16 04:19:47

by NeilBrown

[permalink] [raw]
Subject: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

__d_alloc can be called with i_mutex held, so it is safer to
use GFP_NOFS.

lockdep reports this can deadlock when loop-back NFS is in use,
as nfsd may be required to write out for reclaim, and nfsd certainly
takes i_mutex.

Signed-off-by: NeilBrown <[email protected]>
---
fs/dcache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca02c13a84aa..3651ff6185b4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1483,7 +1483,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
struct dentry *dentry;
char *dname;

- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache, GFP_NOFS);
if (!dentry)
return NULL;

@@ -1495,7 +1495,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
*/
dentry->d_iname[DNAME_INLINE_LEN-1] = 0;
if (name->len > DNAME_INLINE_LEN-1) {
- dname = kmalloc(name->len + 1, GFP_KERNEL);
+ dname = kmalloc(name->len + 1, GFP_NOFS);
if (!dname) {
kmem_cache_free(dentry_cache, dentry);
return NULL;

2014-04-16 04:19:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

namespace_sem can be taken while various i_mutex locks are held, so we
need to avoid reclaim from blocking on an FS (particularly loop-back
NFS).

A memory allocation happens under namespace_sem at least in:

[<ffffffff8119d16f>] kmem_cache_alloc+0x4f/0x290
[<ffffffff811c2fff>] alloc_vfsmnt+0x1f/0x1d0
[<ffffffff811c339a>] clone_mnt+0x2a/0x310
[<ffffffff811c57e3>] copy_tree+0x53/0x380
[<ffffffff811c6aef>] copy_mnt_ns+0x7f/0x280
[<ffffffff810c16fc>] create_new_namespaces+0x5c/0x190
[<ffffffff810c1ab9>] unshare_nsproxy_namespaces+0x59/0x90

So set PF_FSTRANS in namespace_lock() and restore in
namespace_unlock().

Signed-off-by: NeilBrown <[email protected]>
---
fs/namespace.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2ffc5a2905d4..83dcd5083dbb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -63,6 +63,7 @@ static struct hlist_head *mount_hashtable __read_mostly;
static struct hlist_head *mountpoint_hashtable __read_mostly;
static struct kmem_cache *mnt_cache __read_mostly;
static DECLARE_RWSEM(namespace_sem);
+static unsigned long namespace_sem_pflags;

/* /sys/fs */
struct kobject *fs_kobj;
@@ -1196,6 +1197,8 @@ static void namespace_unlock(void)
struct mount *mnt;
struct hlist_head head = unmounted;

+ current_restore_flags_nested(&namespace_sem_pflags, PF_FSTRANS);
+
if (likely(hlist_empty(&head))) {
up_write(&namespace_sem);
return;
@@ -1220,6 +1223,7 @@ static void namespace_unlock(void)
static inline void namespace_lock(void)
{
down_write(&namespace_sem);
+ current_set_flags_nested(&namespace_sem_pflags, PF_FSTRANS);
}

/*

2014-04-16 04:20:02

by NeilBrown

[permalink] [raw]
Subject: [PATCH 18/19] nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.

nfsd will sometimes call flush_workqueue on the workqueue running
nfsd4_do_callback_rpc, so we must ensure that it doesn't block in
filesystem reclaim.
So set PF_FSTRANS.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfsd/nfs4callback.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 7f05cd140de3..7784b5d4edf0 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -997,6 +997,9 @@ static void nfsd4_do_callback_rpc(struct work_struct *w)
struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
struct nfs4_client *clp = cb->cb_clp;
struct rpc_clnt *clnt;
+ unsigned int pflags;
+
+ current_set_flags_nested(&pflags, PF_FSTRANS);

if (clp->cl_flags & NFSD4_CLIENT_CB_FLAG_MASK)
nfsd4_process_cb_update(cb);
@@ -1005,11 +1008,13 @@ static void nfsd4_do_callback_rpc(struct work_struct *w)
if (!clnt) {
/* Callback channel broken, or client killed; give up: */
nfsd4_release_cb(cb);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
return;
}
cb->cb_msg.rpc_cred = clp->cl_cb_cred;
rpc_call_async(clnt, &cb->cb_msg, RPC_TASK_SOFT | RPC_TASK_SOFTCONN,
cb->cb_ops, cb);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
}

void nfsd4_init_callback(struct nfsd4_callback *cb)

2014-04-16 04:20:08

by NeilBrown

[permalink] [raw]
Subject: [PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

memory allocates can happen while the xfs ilock is held in
xfs_free_eofblocks, particularly

[<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
[<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
[<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
[<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
[<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240

So set PF_FSTRANS to avoid this causing a deadlock.

Care is needed here as xfs_trans_reserve() also sets PF_FSTRANS, while
xfs_trans_cancel and xfs_trans_commit will clear it.
So our extra setting must fully nest these calls.

Signed-off-by: NeilBrown <[email protected]>
---
fs/xfs/xfs_bmap_util.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f264616080ca..53761fe4fada 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -889,6 +889,7 @@ xfs_free_eofblocks(
xfs_filblks_t map_len;
int nimaps;
xfs_bmbt_irec_t imap;
+ unsigned int pflags;

/*
* Figure out if there are any blocks beyond the end
@@ -929,12 +930,14 @@ xfs_free_eofblocks(
}
}

+ current_set_flags_nested(&pflags, PF_FSTRANS);
error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
if (error) {
ASSERT(XFS_FORCED_SHUTDOWN(mp));
xfs_trans_cancel(tp, 0);
if (need_iolock)
xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+ current_restore_flags_nested(&pflags, PF_FSTRANS);
return error;
}

@@ -964,6 +967,7 @@ xfs_free_eofblocks(
xfs_inode_clear_eofblocks_tag(ip);
}

+ current_restore_flags_nested(&pflags, PF_FSTRANS);
xfs_iunlock(ip, XFS_ILOCK_EXCL);
if (need_iolock)
xfs_iunlock(ip, XFS_IOLOCK_EXCL);

2014-04-16 04:18:29

by NeilBrown

[permalink] [raw]
Subject: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

If an incoming NFS request is coming from the local host, then
nfsd will need to perform some special handling. So detect that
possibility and make the source visible in rq_local.

Signed-off-by: NeilBrown <[email protected]>
---
include/linux/sunrpc/svc.h | 1 +
include/linux/sunrpc/svc_xprt.h | 1 +
net/sunrpc/svcsock.c | 10 ++++++++++
3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 04e763221246..a0dbbd1e00e9 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -254,6 +254,7 @@ struct svc_rqst {
u32 rq_prot; /* IP protocol */
unsigned short
rq_secure : 1; /* secure port */
+ unsigned short rq_local : 1; /* local request */

void * rq_argp; /* decoded arguments */
void * rq_resp; /* xdr'd results */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f09ebf..b99bdfb0fcf9 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -63,6 +63,7 @@ struct svc_xprt {
#define XPT_DETACHED 10 /* detached from tempsocks list */
#define XPT_LISTENER 11 /* listening endpoint */
#define XPT_CACHE_AUTH 12 /* cache auth info */
+#define XPT_LOCAL 13 /* connection from loopback interface */

struct svc_serv *xpt_server; /* service for transport */
atomic_t xpt_reserved; /* space on outq that is rsvd */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index b6e59f0a9475..193115fe968c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
struct socket *newsock;
struct svc_sock *newsvsk;
int err, slen;
+ struct dst_entry *dst;
RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);

dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
@@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
}
svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);

+ clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+ rcu_read_lock();
+ dst = rcu_dereference(newsock->sk->sk_dst_cache);
+ if (dst && dst->dev &&
+ (dst->dev->features & NETIF_F_LOOPBACK))
+ set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+ rcu_read_unlock();
+
if (serv->sv_stats)
serv->sv_stats->nettcpconn++;

@@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)

rqstp->rq_xprt_ctxt = NULL;
rqstp->rq_prot = IPPROTO_TCP;
+ rqstp->rq_local = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);

p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
calldir = p[1];

2014-04-16 04:25:18

by NeilBrown

[permalink] [raw]
Subject: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

Currently both xfs and nfs will handle PF_FSTRANS by disabling
__GFP_FS.

Make this effect global by repurposing memalloc_noio_flags (which
does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
impost the task flags on a gfp_t.
Due to this repurposing we change the name of memalloc_noio_flags
to gfp_from_current().

As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
code for this from xfs and nfs.

As we can now expect other code to set PF_FSTRANS, its meaning is more
general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
is not set is no longer appropriate. PF_FSTRANS may be set for other
reasons than an XFS transaction.

As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
__GFP_FS before calling lockdep_alloc_trace() in various places.

Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 3 +--
fs/xfs/kmem.h | 2 --
fs/xfs/xfs_aops.c | 7 -------
include/linux/sched.h | 5 ++++-
mm/page_alloc.c | 3 ++-
mm/slab.c | 2 ++
mm/slob.c | 2 ++
mm/slub.c | 1 +
mm/vmscan.c | 4 ++--
9 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..ed863f52bae7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -472,8 +472,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
* doing this memory reclaim for a fs-related allocation.
*/
- if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
- !(current->flags & PF_FSTRANS)) {
+ if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL) {
int how = FLUSH_SYNC;

/* Don't let kswapd deadlock waiting for OOM RPC calls */
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 64db0e53edea..882b86270ebe 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,8 +50,6 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
- lflags &= ~__GFP_FS;
}

if (flags & KM_ZERO)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index db2cfb067d0b..207a7f86d5d7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -952,13 +952,6 @@ xfs_vm_writepage(
PF_MEMALLOC))
goto redirty;

- /*
- * Given that we do not allow direct reclaim to call us, we should
- * never be called while in a filesystem transaction.
- */
- if (WARN_ON(current->flags & PF_FSTRANS))
- goto redirty;
-
/* Is this page beyond the end of the file? */
offset = i_size_read(inode);
end_index = offset >> PAGE_CACHE_SHIFT;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 56fa52a0654c..f3291ed33c27 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1860,10 +1860,13 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define used_math() tsk_used_math(current)

/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+/* __GFP_FS isn't allowed if PF_FSTRANS is set in current->flags */
+static inline gfp_t gfp_from_current(gfp_t flags)
{
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~__GFP_IO;
+ if (unlikely(current->flags & PF_FSTRANS))
+ flags &= ~__GFP_FS;
return flags;
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff8b91aa0b87..5e9225df3447 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2718,6 +2718,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct mem_cgroup *memcg = NULL;

gfp_mask &= gfp_allowed_mask;
+ gfp_mask = gfp_from_current(gfp_mask);

lockdep_trace_alloc(gfp_mask);

@@ -2765,7 +2766,7 @@ retry_cpuset:
* can deadlock because I/O on the device might not
* complete.
*/
- gfp_mask = memalloc_noio_flags(gfp_mask);
+ gfp_mask = gfp_from_current(gfp_mask);
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
diff --git a/mm/slab.c b/mm/slab.c
index b264214c77ea..914d88661f3d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3206,6 +3206,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
int slab_node = numa_mem_id();

flags &= gfp_allowed_mask;
+ flags = gfp_from_current(flags);

lockdep_trace_alloc(flags);

@@ -3293,6 +3294,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
void *objp;

flags &= gfp_allowed_mask;
+ flags = gfp_from_current(flags);

lockdep_trace_alloc(flags);

diff --git a/mm/slob.c b/mm/slob.c
index 4bf8809dfcce..18206f54d227 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -431,6 +431,7 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
void *ret;

gfp &= gfp_allowed_mask;
+ flags = gfp_from_current(flags);

lockdep_trace_alloc(gfp);

@@ -539,6 +540,7 @@ void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
void *b;

flags &= gfp_allowed_mask;
+ flags = gfp_from_current(flags);

lockdep_trace_alloc(flags);

diff --git a/mm/slub.c b/mm/slub.c
index 25f14ad8f817..ff7c15e977da 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -961,6 +961,7 @@ static inline void kfree_hook(const void *x)
static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
{
flags &= gfp_allowed_mask;
+ flags = gfp_from_current(flags);
lockdep_trace_alloc(flags);
might_sleep_if(flags & __GFP_WAIT);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67165f839936..05de3289d031 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2592,7 +2592,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
{
unsigned long nr_reclaimed;
struct scan_control sc = {
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = gfp_from_current(gfp_mask)),
.may_writepage = !laptop_mode,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
@@ -3524,7 +3524,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = gfp_from_current(gfp_mask)),
.order = order,
.priority = ZONE_RECLAIM_PRIORITY,
};

2014-04-16 04:46:24

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> namespace_sem can be taken while various i_mutex locks are held, so we
> need to avoid reclaim from blocking on an FS (particularly loop-back
> NFS).

I would really prefer to deal with that differently - by explicit change of
gfp_t arguments of allocators.

The thing is, namespace_sem is held *only* over allocations, and not a lot
of them, at that - only mnt_alloc_id(), mnt_alloc_group_id(), alloc_vfsmnt()
and new_mountpoint(). That is all that is allowed.

Again, actual work with filesystems (setup, shutdown, remount, pathname
resolution, etc.) is all done outside of namespace_sem; it's held only
for manipulations of fs/{namespace,pnode}.c data structures and the only
reason it isn't a spinlock is that we need to do some allocations.

So I'd rather slap GFP_NOFS on those few allocations...

2014-04-16 05:13:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> NFS mounts, and presumably in nfs), and memory can be allocated while
> holding sk_lock, at least via:
>
> inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
>
> So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
>
> This deadlock was found by lockdep.

Wow, this is adding expensive stuff in fast path, only for nfsd :(

BTW, why should the current->flags should be saved on a socket field,
and not a current->save_flags. This really looks a thread property, not
a socket one.

Why nfsd could not have PF_FSTRANS in its current->flags ?

For applications handling millions of sockets, this makes a difference.

2014-04-16 05:38:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> Currently both xfs and nfs will handle PF_FSTRANS by disabling
> __GFP_FS.
>
> Make this effect global by repurposing memalloc_noio_flags (which
> does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> impost the task flags on a gfp_t.
> Due to this repurposing we change the name of memalloc_noio_flags
> to gfp_from_current().
>
> As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> code for this from xfs and nfs.
>
> As we can now expect other code to set PF_FSTRANS, its meaning is more
> general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> is not set is no longer appropriate. PF_FSTRANS may be set for other
> reasons than an XFS transaction.

So PF_FSTRANS no longer means "filesystem in transaction context".
Are you going to rename to match whatever it's meaning is now?
I'm not exactly clear on what it means now...


> As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
> __GFP_FS before calling lockdep_alloc_trace() in various places.
>
> Signed-off-by: NeilBrown <[email protected]>
....
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 64db0e53edea..882b86270ebe 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,8 +50,6 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> } else {
> lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> - lflags &= ~__GFP_FS;
> }

I think KM_NOFS needs to remain here, as it has use outside of
transaction contexts that set PF_FSTRANS....

> if (flags & KM_ZERO)
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index db2cfb067d0b..207a7f86d5d7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -952,13 +952,6 @@ xfs_vm_writepage(
> PF_MEMALLOC))
> goto redirty;
>
> - /*
> - * Given that we do not allow direct reclaim to call us, we should
> - * never be called while in a filesystem transaction.
> - */
> - if (WARN_ON(current->flags & PF_FSTRANS))
> - goto redirty;

We still need to ensure this rule isn't broken. If it is, the
filesystem will silently deadlock in delayed allocation rather than
gracefully handle the problem with a warning....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 05:47:58

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Tue, 15 Apr 2014 22:13:46 -0700 Eric Dumazet <[email protected]>
wrote:

> On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> > sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> > NFS mounts, and presumably in nfs), and memory can be allocated while
> > holding sk_lock, at least via:
> >
> > inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> >
> > So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> >
> > This deadlock was found by lockdep.
>
> Wow, this is adding expensive stuff in fast path, only for nfsd :(

Yes, this was probably one part that I was least comfortable about.

>
> BTW, why should the current->flags should be saved on a socket field,
> and not a current->save_flags. This really looks a thread property, not
> a socket one.
>
> Why nfsd could not have PF_FSTRANS in its current->flags ?

nfsd does have PF_FSTRANS set in current->flags. But some other processes
might not.

If any process takes sk_lock, allocates memory, and then blocks in reclaim it
could be waiting for nfsd. If nfsd waits for that sk_lock, it would cause a
deadlock.

Thinking a bit more carefully .... I suspect that any socket that nfsd
created would only ever be locked by nfsd. If that is the case then the
problem can be resolved entirely within nfsd. We would need to tell lockdep
that there are two sorts of sk_locks, those which nfsd uses and all the
rest. That might get a little messy, but wouldn't impact performance.

Is it justified to assume that sockets created by nfsd threads would only
ever be locked by nfsd threads (and interrupts, which won't be allocating
memory so don't matter), or might there be locked by other threads - e.g for
'netstat -a' etc??


>
> For applications handling millions of sockets, this makes a difference.
>

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-16 05:49:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep reports a locking chain
>
> sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
>
> As sk_lock may be needed to reclaim memory, allowing that
> reclaim while pcu_alloc_mutex is held can lead to deadlock.
> So set PF_FSTRANS while it is help to avoid the FS reclaim.
>
> pcpu_alloc_mutex can be taken when rtnl_mutex is held:
>
> [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> [<ffffffff81aaf785>] register_netdev+0x15/0x30
>
> Signed-off-by: NeilBrown <[email protected]>

This looks like a workaround to avoid passing a gfp mask around to
describe the context in which the allocation is taking place.
Whether or not that's the right solution, I can't say, but spreading
this "we can turn off all reclaim of filesystem objects" mechanism
all around the kernel doesn't sit well with me...

And, again, PF_FSTRANS looks plainly wrong in this code - it sure
isn't a fs transaction context we are worried about here...


--
Dave Chinner
[email protected]

2014-04-16 05:52:40

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, 16 Apr 2014 05:46:18 +0100 Al Viro <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > namespace_sem can be taken while various i_mutex locks are held, so we
> > need to avoid reclaim from blocking on an FS (particularly loop-back
> > NFS).
>
> I would really prefer to deal with that differently - by explicit change of
> gfp_t arguments of allocators.
>
> The thing is, namespace_sem is held *only* over allocations, and not a lot
> of them, at that - only mnt_alloc_id(), mnt_alloc_group_id(), alloc_vfsmnt()
> and new_mountpoint(). That is all that is allowed.
>
> Again, actual work with filesystems (setup, shutdown, remount, pathname
> resolution, etc.) is all done outside of namespace_sem; it's held only
> for manipulations of fs/{namespace,pnode}.c data structures and the only
> reason it isn't a spinlock is that we need to do some allocations.
>
> So I'd rather slap GFP_NOFS on those few allocations...

So something like this? I put that in to my testing instead.

Thanks,
NeilBrown

diff --git a/fs/namespace.c b/fs/namespace.c
index 83dcd5083dbb..8e103b8c8323 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -103,7 +103,7 @@ static int mnt_alloc_id(struct mount *mnt)
int res;

retry:
- ida_pre_get(&mnt_id_ida, GFP_KERNEL);
+ ida_pre_get(&mnt_id_ida, GFP_NOFS);
spin_lock(&mnt_id_lock);
res = ida_get_new_above(&mnt_id_ida, mnt_id_start, &mnt->mnt_id);
if (!res)
@@ -134,7 +134,7 @@ static int mnt_alloc_group_id(struct mount *mnt)
{
int res;

- if (!ida_pre_get(&mnt_group_ida, GFP_KERNEL))
+ if (!ida_pre_get(&mnt_group_ida, GFP_NOFS))
return -ENOMEM;

res = ida_get_new_above(&mnt_group_ida,
@@ -193,7 +193,7 @@ unsigned int mnt_get_count(struct mount *mnt)

static struct mount *alloc_vfsmnt(const char *name)
{
- struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
+ struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_NOFS);
if (mnt) {
int err;

@@ -202,7 +202,7 @@ static struct mount *alloc_vfsmnt(const char *name)
goto out_free_cache;

if (name) {
- mnt->mnt_devname = kstrdup(name, GFP_KERNEL);
+ mnt->mnt_devname = kstrdup(name, GFP_NOFS);
if (!mnt->mnt_devname)
goto out_free_id;
}
@@ -682,7 +682,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
}
}

- mp = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
+ mp = kmalloc(sizeof(struct mountpoint), GFP_NOFS);
if (!mp)
return ERR_PTR(-ENOMEM);


Attachments:
signature.asc (828.00 B)

2014-04-16 06:05:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> xfs_file_*_read holds an inode lock while calling a generic 'read'
> function. These functions perform read-ahead and are quite likely to
> allocate memory.

Yes, that's what reading data from disk requires.

> So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> into a filesystem to free memory.

We already have that protection via the
>
> This can be a problem with loop-back NFS mounts, if free_pages ends up
> wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> that this code holds.
>
> This was found both by lockdep and as a real deadlock during testing.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/xfs/xfs_file.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..88b33ef64668 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -243,6 +243,7 @@ xfs_file_aio_read(
> ssize_t ret = 0;
> int ioflags = 0;
> xfs_fsize_t n;
> + unsigned int pflags;
>
> XFS_STATS_INC(xs_read_calls);
>
> @@ -290,6 +291,10 @@ xfs_file_aio_read(
> * proceeed concurrently without serialisation.
> */
> xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> + /* As we hold a lock, we must ensure that any allocation
> + * in generic_file_aio_read avoid __GFP_FS
> + */
> + current_set_flags_nested(&pflags, PF_FSTRANS);

Ugh. No. This is Simply Wrong.

We handle the memory allocations in the IO path with
GFP_NOFS/KM_NOFS where necessary.

We also do this when setting up regular file inodes in
xfs_setup_inode():

/*
* Ensure all page cache allocations are done from GFP_NOFS context to
* prevent direct reclaim recursion back into the filesystem and blowing
* stacks or deadlocking.
*/
gfp_mask = mapping_gfp_mask(inode->i_mapping);
mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));

Which handles all of the mapping allocations that occur within the
page cache read/write paths.

Remember, you removed the KM_NOFS code from the XFS allocator that
caused it to clear __GFP_FS in an earlier patch - the read Io path
is one of the things you broke by doing that....

If there are places where we don't use GFP_NOFS context allocations
that we should, then we need to fix them individually....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 06:17:38

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > Currently both xfs and nfs will handle PF_FSTRANS by disabling
> > __GFP_FS.
> >
> > Make this effect global by repurposing memalloc_noio_flags (which
> > does the same thing for PF_MEMALLOC_NOIO and __GFP_IO) to generally
> > impost the task flags on a gfp_t.
> > Due to this repurposing we change the name of memalloc_noio_flags
> > to gfp_from_current().
> >
> > As PF_FSTRANS now uniformly removes __GFP_FS we can remove special
> > code for this from xfs and nfs.
> >
> > As we can now expect other code to set PF_FSTRANS, its meaning is more
> > general, so the WARN_ON in xfs_vm_writepage() which checks PF_FSTRANS
> > is not set is no longer appropriate. PF_FSTRANS may be set for other
> > reasons than an XFS transaction.
>
> So PF_FSTRANS no longer means "filesystem in transaction context".
> Are you going to rename to match whatever it's meaning is now?
> I'm not exactly clear on what it means now...

I did consider renaming it to "PF_MEMALLOC_NOFS" as it is similar to
"PF_MEMALLOC_NOIO", except that it disables __GFP_FS rather than __GFP_IO.
Maybe I should go ahead with that.

>
>
> > As lockdep cares about __GFP_FS, we need to translate PF_FSTRANS to
> > __GFP_FS before calling lockdep_alloc_trace() in various places.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> ....
> > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> > index 64db0e53edea..882b86270ebe 100644
> > --- a/fs/xfs/kmem.h
> > +++ b/fs/xfs/kmem.h
> > @@ -50,8 +50,6 @@ kmem_flags_convert(xfs_km_flags_t flags)
> > lflags = GFP_ATOMIC | __GFP_NOWARN;
> > } else {
> > lflags = GFP_KERNEL | __GFP_NOWARN;
> > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > - lflags &= ~__GFP_FS;
> > }
>
> I think KM_NOFS needs to remain here, as it has use outside of
> transaction contexts that set PF_FSTRANS....

Argh, yes of course.
I'll have to re-test the other xfs changes now to see if they are really
needed.

Thanks!


>
> > if (flags & KM_ZERO)
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index db2cfb067d0b..207a7f86d5d7 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -952,13 +952,6 @@ xfs_vm_writepage(
> > PF_MEMALLOC))
> > goto redirty;
> >
> > - /*
> > - * Given that we do not allow direct reclaim to call us, we should
> > - * never be called while in a filesystem transaction.
> > - */
> > - if (WARN_ON(current->flags & PF_FSTRANS))
> > - goto redirty;
>
> We still need to ensure this rule isn't broken. If it is, the
> filesystem will silently deadlock in delayed allocation rather than
> gracefully handle the problem with a warning....

Hmm... that might be tricky. The 'new' PF_FSTRANS can definitely be set when
xfs_vm_writepage is called and we really want the write to happen.
I don't suppose there is any other way to detect if a transaction is
happening?

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-16 06:18:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 19/19] XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> memory allocates can happen while the xfs ilock is held in
> xfs_free_eofblocks, particularly
>
> [<ffffffff813e6667>] kmem_zone_alloc+0x67/0xc0
> [<ffffffff813e5945>] xfs_trans_add_item+0x25/0x50
> [<ffffffff8143d64c>] xfs_trans_ijoin+0x2c/0x60
> [<ffffffff8142275e>] xfs_itruncate_extents+0xbe/0x400
> [<ffffffff813c72f4>] xfs_free_eofblocks+0x1c4/0x240
>
> So set PF_FSTRANS to avoid this causing a deadlock.

Another "You broke KM_NOFS" moment. You win a Kit Kat. ;)

xfs_trans_add_item():

lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP | KM_NOFS);

KM_NOFS needs to work, otherwise XFS is just a huge steaming pile of
memory reclaim deadlocks regardless of whether you are using
loopback NFS or not.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 06:22:13

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > lockdep reports a locking chain
> >
> > sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> >
> > As sk_lock may be needed to reclaim memory, allowing that
> > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> >
> > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> >
> > [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> > [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> > [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> > [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> > [<ffffffff81aaf785>] register_netdev+0x15/0x30
> >
> > Signed-off-by: NeilBrown <[email protected]>
>
> This looks like a workaround to avoid passing a gfp mask around to
> describe the context in which the allocation is taking place.
> Whether or not that's the right solution, I can't say, but spreading
> this "we can turn off all reclaim of filesystem objects" mechanism
> all around the kernel doesn't sit well with me...

We are (effectively) passing a gfp mask around, except that it lives in
'current' rather than lots of other places.
I actually like the idea of discarding PF_MEMALLOC, PF_FSTRANS and
PF_MEMALLOC_NOIO, and just having current->gfp_allowed_mask (to match the
global variable of the same name).

>
> And, again, PF_FSTRANS looks plainly wrong in this code - it sure
> isn't a fs transaction context we are worried about here...

So would PF_MEMALLOC_NOFS work for you?

NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-16 06:25:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> __d_alloc can be called with i_mutex held, so it is safer to
> use GFP_NOFS.
>
> lockdep reports this can deadlock when loop-back NFS is in use,
> as nfsd may be required to write out for reclaim, and nfsd certainly
> takes i_mutex.

But not the same i_mutex as is currently held. To me, this seems
like a false positive? If you are holding the i_mutex on an inode,
then you have a reference to the inode and hence memory reclaim
won't ever take the i_mutex on that inode.

FWIW, this sort of false positive was a long stabding problem for
XFS - we managed to get rid of most of the false positives like this
by ensuring that only the ilock is taken within memory reclaim and
memory reclaim can't be entered while we hold the ilock.

You can't do that with the i_mutex, though....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 06:27:56

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, 16 Apr 2014 16:04:59 +1000 Dave Chinner <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function. These functions perform read-ahead and are quite likely to
> > allocate memory.
>
> Yes, that's what reading data from disk requires.
>
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
>
> We already have that protection via the
> >
> > This can be a problem with loop-back NFS mounts, if free_pages ends up
> > wating in nfs_release_page(), and nfsd is blocked waiting for the lock
> > that this code holds.
> >
> > This was found both by lockdep and as a real deadlock during testing.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 64b48eade91d..88b33ef64668 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -243,6 +243,7 @@ xfs_file_aio_read(
> > ssize_t ret = 0;
> > int ioflags = 0;
> > xfs_fsize_t n;
> > + unsigned int pflags;
> >
> > XFS_STATS_INC(xs_read_calls);
> >
> > @@ -290,6 +291,10 @@ xfs_file_aio_read(
> > * proceeed concurrently without serialisation.
> > */
> > xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
> > + /* As we hold a lock, we must ensure that any allocation
> > + * in generic_file_aio_read avoid __GFP_FS
> > + */
> > + current_set_flags_nested(&pflags, PF_FSTRANS);
>
> Ugh. No. This is Simply Wrong.
>
> We handle the memory allocations in the IO path with
> GFP_NOFS/KM_NOFS where necessary.
>
> We also do this when setting up regular file inodes in
> xfs_setup_inode():
>
> /*
> * Ensure all page cache allocations are done from GFP_NOFS context to
> * prevent direct reclaim recursion back into the filesystem and blowing
> * stacks or deadlocking.
> */
> gfp_mask = mapping_gfp_mask(inode->i_mapping);
> mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
>
> Which handles all of the mapping allocations that occur within the
> page cache read/write paths.
>
> Remember, you removed the KM_NOFS code from the XFS allocator that
> caused it to clear __GFP_FS in an earlier patch - the read Io path
> is one of the things you broke by doing that....
>
> If there are places where we don't use GFP_NOFS context allocations
> that we should, then we need to fix them individually....
>
> Cheers,
>
> Dave.

Thanks Dave. Having fixed the KM_NOFS error I'll start retesting with all
the other xfs patches removed.

NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-16 06:30:42

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 13/19] MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.

On Wed, Apr 16, 2014 at 04:22:01PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 15:49:42 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > > lockdep reports a locking chain
> > >
> > > sk_lock-AF_INET --> rtnl_mutex --> pcpu_alloc_mutex
> > >
> > > As sk_lock may be needed to reclaim memory, allowing that
> > > reclaim while pcu_alloc_mutex is held can lead to deadlock.
> > > So set PF_FSTRANS while it is help to avoid the FS reclaim.
> > >
> > > pcpu_alloc_mutex can be taken when rtnl_mutex is held:
> > >
> > > [<ffffffff8117f979>] pcpu_alloc+0x49/0x960
> > > [<ffffffff8118029b>] __alloc_percpu+0xb/0x10
> > > [<ffffffff8193b9f7>] loopback_dev_init+0x17/0x60
> > > [<ffffffff81aaf30c>] register_netdevice+0xec/0x550
> > > [<ffffffff81aaf785>] register_netdev+0x15/0x30
> > >
> > > Signed-off-by: NeilBrown <[email protected]>
> >
> > This looks like a workaround to avoid passing a gfp mask around to
> > describe the context in which the allocation is taking place.
> > Whether or not that's the right solution, I can't say, but spreading
> > this "we can turn off all reclaim of filesystem objects" mechanism
> > all around the kernel doesn't sit well with me...
>
> We are (effectively) passing a gfp mask around, except that it lives in
> 'current' rather than lots of other places.
> I actually like the idea of discarding PF_MEMALLOC, PF_FSTRANS and
> PF_MEMALLOC_NOIO, and just having current->gfp_allowed_mask (to match the
> global variable of the same name).

Given that we've had problems getting gfp flags propagated into the
VM code (vmalloc, I'm looking at you!) making the current task
carry the valid memory allocation and reclaim context mask woul dbe
a good idea. That's effectively the problem PF_MEMALLOC_NOIO is
working around, and we've recently added it to XFS to silence all
the lockdep warnings using vm_map_ram in GFP_NOFS contexts have been
causing us....

> > And, again, PF_FSTRANS looks plainly wrong in this code - it sure
> > isn't a fs transaction context we are worried about here...
>
> So would PF_MEMALLOC_NOFS work for you?

Better than PF_FSTRANS, that's for sure ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 06:31:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 09/19] XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.

On Wed, Apr 16, 2014 at 04:04:59PM +1000, Dave Chinner wrote:
> On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> > xfs_file_*_read holds an inode lock while calling a generic 'read'
> > function. These functions perform read-ahead and are quite likely to
> > allocate memory.
>
> Yes, that's what reading data from disk requires.
>
> > So set PF_FSTRANS to ensure they avoid __GFP_FS and so don't recurse
> > into a filesystem to free memory.
>
> We already have that protection via the

Oops, stray paste. Ignore that comment.

-Dave.
--
Dave Chinner
[email protected]

2014-04-16 06:49:53

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > __d_alloc can be called with i_mutex held, so it is safer to
> > use GFP_NOFS.
> >
> > lockdep reports this can deadlock when loop-back NFS is in use,
> > as nfsd may be required to write out for reclaim, and nfsd certainly
> > takes i_mutex.
>
> But not the same i_mutex as is currently held. To me, this seems
> like a false positive? If you are holding the i_mutex on an inode,
> then you have a reference to the inode and hence memory reclaim
> won't ever take the i_mutex on that inode.
>
> FWIW, this sort of false positive was a long stabding problem for
> XFS - we managed to get rid of most of the false positives like this
> by ensuring that only the ilock is taken within memory reclaim and
> memory reclaim can't be entered while we hold the ilock.
>
> You can't do that with the i_mutex, though....
>
> Cheers,
>
> Dave.

I'm not sure this is a false positive.
You can call __d_alloc when creating a file and so are holding i_mutex on the
directory.
nfsd might also want to access that directory.

If there was only 1 nfsd thread, it would need to get i_mutex and do it's
thing before replying to that request and so before it could handle the
COMMIT which __d_alloc is waiting for.

Obviously we would normally have multiple nfsd threads but if they were all
blocked on an i_mutex which itself was blocked on nfs_release_page which was
waiting for an nfsd thread to handling its COMMIT request, this could be a
real deadlock.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-16 07:22:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/19] lockdep: improve scenario messages for RECLAIM_FS errors.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> lockdep can check for locking problems involving reclaim using
> the same infrastructure as used for interrupts.
>
> However a number of the messages still refer to interrupts even
> if it was actually a reclaim-related problem.
>
> So determine where the problem was caused by reclaim or irq and adjust
> messages accordingly.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> kernel/locking/lockdep.c | 43 ++++++++++++++++++++++++++++++++-----------
> 1 file changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index e05b82e92373..33d2ac7519dc 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -1423,7 +1423,8 @@ static void
> print_irq_lock_scenario(struct lock_list *safe_entry,
> struct lock_list *unsafe_entry,
> struct lock_class *prev_class,
> - struct lock_class *next_class)
> + struct lock_class *next_class,
> + int reclaim)

I would rather we just pass enum lock_usage_bit along from the
callsites.

> {
> struct lock_class *safe_class = safe_entry->class;
> struct lock_class *unsafe_class = unsafe_entry->class;

> @@ -1487,6 +1495,8 @@ print_bad_irq_dependency(struct task_struct *curr,
> enum lock_usage_bit bit2,
> const char *irqclass)
> {
> + int reclaim = strncmp(irqclass, "RECLAIM", 7) == 0;
> +

irqclass := state_name(bit2), so instead of relying on the unreliable,
why not use the lock_usage_bit ?

> if (!debug_locks_off_graph_unlock() || debug_locks_silent)
> return 0;
>
> @@ -1528,7 +1538,7 @@ print_bad_irq_dependency(struct task_struct *curr,
>
> printk("\nother info that might help us debug this:\n\n");
> print_irq_lock_scenario(backwards_entry, forwards_entry,
> - hlock_class(prev), hlock_class(next));
> + hlock_class(prev), hlock_class(next), reclaim);

So that would become bit2.

>
> lockdep_print_held_locks(curr);
>
> @@ -2200,7 +2210,7 @@ static void check_chain_key(struct task_struct *curr)
> }
>
> static void
> -print_usage_bug_scenario(struct held_lock *lock)
> +print_usage_bug_scenario(struct held_lock *lock, enum lock_usage_bit new_bit)

Like you did here.

> {
> struct lock_class *class = hlock_class(lock);
>
> @@ -2210,7 +2220,11 @@ print_usage_bug_scenario(struct held_lock *lock)
> printk(" lock(");
> __print_lock_name(class);
> printk(");\n");
> - printk(" <Interrupt>\n");
> + if (new_bit == LOCK_USED_IN_RECLAIM_FS ||
> + new_bit == LOCK_USED_IN_RECLAIM_FS_READ)

And if we're going to do this all over, we might want a helper for this
condition.

> + printk(" <Memory allocation/reclaim>\n");
> + else
> + printk(" <Interrupt>\n");
> printk(" lock(");
> __print_lock_name(class);
> printk(");\n");

Same for the rest I think..

2014-04-16 07:28:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 06/19] nfsd: set PF_FSTRANS for nfsd threads.

On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
> If a localhost mount is present, then it is easy to deadlock NFS by
> nfsd entering direct reclaim and calling nfs_release_page() which
> requires nfsd to perform an fsync() (which it cannot do because it is
> reclaiming memory).
>
> By setting PF_FSTRANS we stop the memory allocator from ever
> attempting any FS operation would could deadlock.
>
> We need this flag set for any thread which is handling a request from
> the local host, but we also need to always have it for at least 1 or 2
> threads so that we don't end up with all threads blocked in allocation.
>
> When we set PF_FSTRANS we also tell lockdep that we are handling
> reclaim so that it can detect deadlocks for us.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/nfsd/nfssvc.c | 18 ++++++++++++++++++
> include/linux/sunrpc/svc.h | 1 +
> net/sunrpc/svc.c | 6 ++++++
> 3 files changed, 25 insertions(+)
>
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9a4a5f9e7468..6af8bc2daf7d 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -565,6 +565,8 @@ nfsd(void *vrqstp)
> struct svc_xprt *perm_sock = list_entry(rqstp->rq_server->sv_permsocks.next, typeof(struct svc_xprt), xpt_list);
> struct net *net = perm_sock->xpt_net;
> int err;
> + unsigned int pflags = 0;
> + gfp_t reclaim_state = 0;
>
> /* Lock module and set up kernel thread */
> mutex_lock(&nfsd_mutex);
> @@ -611,14 +613,30 @@ nfsd(void *vrqstp)
> ;
> if (err == -EINTR)
> break;
> + if (rqstp->rq_local && !current_test_flags(PF_FSTRANS)) {
> + current_set_flags_nested(&pflags, PF_FSTRANS);
> + atomic_inc(&rqstp->rq_pool->sp_nr_fstrans);
> + reclaim_state = lockdep_set_current_reclaim_state(GFP_KERNEL);
> + }
> validate_process_creds();
> svc_process(rqstp);
> validate_process_creds();
> + if (current_test_flags(PF_FSTRANS) &&
> + atomic_dec_if_positive(&rqstp->rq_pool->sp_nr_fstrans) >= 0) {

Yuck, that function has horrible semantics :/ Not your fault though.

> + current_restore_flags_nested(&pflags, PF_FSTRANS);
> + lockdep_restore_current_reclaim_state(reclaim_state);
> + }


Not really knowing much about NFS anymore (all I did know seems to have
bitrotten) the above looks ok I suppose. Refcounting that task state in
something !task_struct feels weird though.

> }
>
> /* Clear signals before calling svc_exit_thread() */
> flush_signals(current);
>
> + if (current_test_flags(PF_FSTRANS)) {
> + current_restore_flags_nested(&pflags, PF_FSTRANS);
> + lockdep_restore_current_reclaim_state(reclaim_state);
> + atomic_dec(&rqstp->rq_pool->sp_nr_fstrans);
> + }
> +
> mutex_lock(&nfsd_mutex);
> nfsdstats.th_cnt --;
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index a0dbbd1e00e9..4b274aba51dd 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -48,6 +48,7 @@ struct svc_pool {
> struct list_head sp_threads; /* idle server threads */
> struct list_head sp_sockets; /* pending sockets */
> unsigned int sp_nrthreads; /* # of threads in pool */
> + atomic_t sp_nr_fstrans; /* # threads with PF_FSTRANS */
> struct list_head sp_all_threads; /* all server threads */
> struct svc_pool_stats sp_stats; /* statistics on pool operation */
> int sp_task_pending;/* has pending task */
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 5de6801cd924..8b13f35b6cbb 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -477,6 +477,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> INIT_LIST_HEAD(&pool->sp_threads);
> INIT_LIST_HEAD(&pool->sp_sockets);
> INIT_LIST_HEAD(&pool->sp_all_threads);
> + /* The number of threads with PF_FSTRANS set
> + * should never be reduced below 2, except when
> + * threads exit. So we use atomic_dec_if_positive()
> + * on this value.
> + */
> + atomic_set(&pool->sp_nr_fstrans, -2);

Its not my code, but that's not the recommended multi-line comment
style.

> spin_lock_init(&pool->sp_lock);
> }
>
>
>

2014-04-16 09:00:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > __d_alloc can be called with i_mutex held, so it is safer to
> > > use GFP_NOFS.
> > >
> > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > takes i_mutex.
> >
> > But not the same i_mutex as is currently held. To me, this seems
> > like a false positive? If you are holding the i_mutex on an inode,
> > then you have a reference to the inode and hence memory reclaim
> > won't ever take the i_mutex on that inode.
> >
> > FWIW, this sort of false positive was a long stabding problem for
> > XFS - we managed to get rid of most of the false positives like this
> > by ensuring that only the ilock is taken within memory reclaim and
> > memory reclaim can't be entered while we hold the ilock.
> >
> > You can't do that with the i_mutex, though....
> >
> > Cheers,
> >
> > Dave.
>
> I'm not sure this is a false positive.
> You can call __d_alloc when creating a file and so are holding i_mutex on the
> directory.
> nfsd might also want to access that directory.
>
> If there was only 1 nfsd thread, it would need to get i_mutex and do it's
> thing before replying to that request and so before it could handle the
> COMMIT which __d_alloc is waiting for.

That seems wrong - the NFS client in __d_alloc holds a mutex on a
NFS client directory inode. The NFS server can't access that
specific mutex - it's on the other side of the "network". The NFS
server accesses mutexs from local filesystems, so __d_alloc would
have to be blocked on a local filesystem inode i_mutex for the nfsd
to get hung up behind it...

However, my confusion comes from the fact that we do GFP_KERNEL
memory allocation with the i_mutex held all over the place. If the
problem is:

local fs access -> i_mutex
.....
nfsd -> i_mutex (blocked)
.....
local fs access -> kmalloc(GFP_KERNEL)
-> direct reclaim
-> nfs_release_page
-> <send write/commit request to blocked nfsds>
<deadlock>

then why is it just __d_alloc that needs this fix? Either this is a
problem *everywhere* or it's not a problem at all.

If it's a problem everywhere it means that we simply can't allow
reclaim from localhost NFS mounts to run from contexts that could
block an NFSD. i.e. you cannot run NFS client memory reclaim from
filesystems that are NFS server exported filesystems.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-16 13:00:09

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

From: Eric Dumazet <[email protected]>
Date: Tue, 15 Apr 2014 22:13:46 -0700

> For applications handling millions of sockets, this makes a difference.

Indeed, this really is not acceptable.

2014-04-16 14:42:42

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Wed, 16 Apr 2014 14:03:35 +1000
NeilBrown <[email protected]> wrote:

> Loop-back NFS mounts are when the NFS client and server run on the
> same host.
>
> The use-case for this is a high availability cluster with shared
> storage. The shared filesystem is mounted on any one machine and
> NFS-mounted on the others.
> If the nfs server fails, some other node will take over that service,
> and then it will have a loop-back NFS mount which needs to keep
> working.
>
> This patch set addresses the "keep working" bit and specifically
> addresses deadlocks and livelocks.
> Allowing the fail-over itself to be deadlock free is a separate
> challenge for another day.
>
> The short description of how this works is:
>
> deadlocks:
> - Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
> PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
> disables __GFP_IO.
> - Set PF_FSTRANS in nfsd when handling requests related to
> memory reclaim, or requests which could block requests related
> to memory reclaim.
> - Use lockdep to find all consequent deadlocks from some other
> thread allocating memory while holding a lock that nfsd might
> want.
> - Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
> as appropriate.
>
> livelocks:
> - identify throttling during reclaim and bypass it when
> PF_LESS_THROTTLE is set
> - only set PF_LESS_THROTTLE for nfsd when handling write requests
> from the local host.
>
> The last 12 patches address various deadlocks due to locking chains.
> 11 were found by lockdep, 2 by testing. There is a reasonable chance
> that there are more, I just need to exercise more code while
> testing....
>
> There is one issue that lockdep reports which I haven't fixed (I've
> just hacked the code out for my testing). That issue relates to
> freeze_super().
> I may not be interpreting the lockdep reports perfectly, but I think
> they are basically saying that if I were to freeze a filesystem that
> was exported to the local host, then we could end up deadlocking.
> This is to be expected. The NFS filesystem would need to be frozen
> first. I don't know how to tell lockdep that I know that is a problem
> and I don't want to be warned about it. Suggestions welcome.
> Until this is addressed I cannot really ask others to test the code
> with lockdep enabled.
>
> There are more subsidiary places that I needed to add PF_FSTRANS than
> I would have liked. The thought keeps crossing my mind that maybe we
> can get rid of __GFP_FS and require that memory reclaim never ever
> block on a filesystem. Then most of these patches go away.
>
> Now that writeback doesn't happen from reclaim (but from kswapd) much
> of the calls from reclaim to FS are gone.
> The ->releasepage call is the only one that I *know* causes me
> problems so I'd like to just say that that must never block. I don't
> really understand the consequences of that though.
> There are a couple of other places where __GFP_FS is used and I'd need
> to carefully analyze those. But if someone just said "no, that is
> impossible", I could be happy and stick with the current approach....
>
> I've cc:ed Peter Zijlstra and Ingo Molnar only on the lockdep-related
> patches, Ming Lei only on the PF_MEMALLOC_NOIO related patches,
> and net-dev only on the network-related patches.
> There are probably other people I should CC. Apologies if I missed you.
> I'll ensure better coverage if the nfs/mm/xfs people are reasonably happy.
>
> Comments, criticisms, etc most welcome.
>
> Thanks,
> NeilBrown
>

I've only given this a once-over, but the basic concept seems a bit
flawed. IIUC, the basic idea is to disallow allocations done in knfsd
threads context from doing fs-based reclaim.

This seems very heavy-handed, and like it could cause problems on a
busy NFS server. Those sorts of servers are likely to have a lot of
data in pagecache and thus we generally want to allow them to do do
writeback when memory is tight.

It's generally acceptable for knfsd to recurse into local filesystem
code for writeback. What you want to avoid in this situation is reclaim
on NFS filesystems that happen to be from knfsd on the same box.

If you really want to fix this, what may make more sense is trying to
plumb that information down more granularly. Maybe GFP_NONETFS and/or
PF_NETFSTRANS flags?

>
> ---
>
> NeilBrown (19):
> Promote current_{set,restore}_flags_nested from xfs to global.
> lockdep: lockdep_set_current_reclaim_state should save old value
> lockdep: improve scenario messages for RECLAIM_FS errors.
> Make effect of PF_FSTRANS to disable __GFP_FS universal.
> SUNRPC: track whether a request is coming from a loop-back interface.
> nfsd: set PF_FSTRANS for nfsd threads.
> nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.
> Set PF_FSTRANS while write_cache_pages calls ->writepage
> XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.
> NET: set PF_FSTRANS while holding sk_lock
> FS: set PF_FSTRANS while holding mmap_sem in exec.c
> NET: set PF_FSTRANS while holding rtnl_lock
> MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.
> driver core: set PF_FSTRANS while holding gdp_mutex
> nfsd: set PF_FSTRANS when client_mutex is held.
> VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.
> VFS: set PF_FSTRANS while namespace_sem is held.
> nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.
> XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks
>
>
> drivers/base/core.c | 3 ++
> drivers/base/power/runtime.c | 6 ++---
> drivers/block/nbd.c | 6 ++---
> drivers/md/dm-bufio.c | 6 ++---
> drivers/md/dm-ioctl.c | 6 ++---
> drivers/mtd/nand/nandsim.c | 28 ++++++---------------
> drivers/scsi/iscsi_tcp.c | 6 ++---
> drivers/usb/core/hub.c | 6 ++---
> fs/dcache.c | 4 ++-
> fs/exec.c | 6 +++++
> fs/fs-writeback.c | 5 ++--
> fs/namespace.c | 4 +++
> fs/nfs/file.c | 3 +-
> fs/nfsd/nfs4callback.c | 5 ++++
> fs/nfsd/nfs4state.c | 3 ++
> fs/nfsd/nfssvc.c | 24 ++++++++++++++----
> fs/nfsd/vfs.c | 6 +++++
> fs/xfs/kmem.h | 2 --
> fs/xfs/xfs_aops.c | 7 -----
> fs/xfs/xfs_bmap_util.c | 4 +++
> fs/xfs/xfs_file.c | 12 +++++++++
> fs/xfs/xfs_linux.h | 7 -----
> include/linux/lockdep.h | 8 +++---
> include/linux/sched.h | 32 +++++++++---------------
> include/linux/sunrpc/svc.h | 2 ++
> include/linux/sunrpc/svc_xprt.h | 1 +
> include/net/sock.h | 1 +
> kernel/locking/lockdep.c | 51 ++++++++++++++++++++++++++++-----------
> kernel/softirq.c | 6 ++---
> mm/migrate.c | 9 +++----
> mm/page-writeback.c | 3 ++
> mm/page_alloc.c | 18 ++++++++------
> mm/percpu.c | 4 +++
> mm/slab.c | 2 ++
> mm/slob.c | 2 ++
> mm/slub.c | 1 +
> mm/vmscan.c | 31 +++++++++++++++---------
> net/core/dev.c | 6 ++---
> net/core/rtnetlink.c | 9 ++++++-
> net/core/sock.c | 8 ++++--
> net/sunrpc/sched.c | 5 ++--
> net/sunrpc/svc.c | 6 +++++
> net/sunrpc/svcsock.c | 10 ++++++++
> net/sunrpc/xprtrdma/transport.c | 5 ++--
> net/sunrpc/xprtsock.c | 17 ++++++++-----
> 45 files changed, 247 insertions(+), 149 deletions(-)
>


--
Jeff Layton <[email protected]>

2014-04-16 14:47:35

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 14:03:36 +1000
NeilBrown <[email protected]> wrote:

> If an incoming NFS request is coming from the local host, then
> nfsd will need to perform some special handling. So detect that
> possibility and make the source visible in rq_local.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> include/linux/sunrpc/svc.h | 1 +
> include/linux/sunrpc/svc_xprt.h | 1 +
> net/sunrpc/svcsock.c | 10 ++++++++++
> 3 files changed, 12 insertions(+)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 04e763221246..a0dbbd1e00e9 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -254,6 +254,7 @@ struct svc_rqst {
> u32 rq_prot; /* IP protocol */
> unsigned short
> rq_secure : 1; /* secure port */
> + unsigned short rq_local : 1; /* local request */
>
> void * rq_argp; /* decoded arguments */
> void * rq_resp; /* xdr'd results */
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f09ebf..b99bdfb0fcf9 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -63,6 +63,7 @@ struct svc_xprt {
> #define XPT_DETACHED 10 /* detached from tempsocks list */
> #define XPT_LISTENER 11 /* listening endpoint */
> #define XPT_CACHE_AUTH 12 /* cache auth info */
> +#define XPT_LOCAL 13 /* connection from loopback interface */
>
> struct svc_serv *xpt_server; /* service for transport */
> atomic_t xpt_reserved; /* space on outq that is rsvd */
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index b6e59f0a9475..193115fe968c 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> struct socket *newsock;
> struct svc_sock *newsvsk;
> int err, slen;
> + struct dst_entry *dst;
> RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
>
> dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
> @@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> }
> svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
>
> + clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> + rcu_read_lock();
> + dst = rcu_dereference(newsock->sk->sk_dst_cache);
> + if (dst && dst->dev &&
> + (dst->dev->features & NETIF_F_LOOPBACK))
> + set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> + rcu_read_unlock();
> +

In the use-case you describe, the client is unlikely to have mounted
"localhost", but is more likely to be mounting a floating address in
the cluster.

Will this flag end up being set in such a situation? It looks like
NETIF_F_LOOPBACK isn't set on all adapters -- mostly on "lo" and a few
others that support MAC-LOOPBACK.


> if (serv->sv_stats)
> serv->sv_stats->nettcpconn++;
>
> @@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>
> rqstp->rq_xprt_ctxt = NULL;
> rqstp->rq_prot = IPPROTO_TCP;
> + rqstp->rq_local = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);
>
> p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
> calldir = p[1];
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
Jeff Layton <[email protected]>

2014-04-16 16:37:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 17/19] VFS: set PF_FSTRANS while namespace_sem is held.

On Wed, Apr 16, 2014 at 03:52:30PM +1000, NeilBrown wrote:

> So something like this? I put that in to my testing instead.

Something like this, yes... And TBH I would prefer the same approach
elsewhere - this kind of hidden state makes it hard to do any kind of
analysis.

Put it that way - the simplest situation is when the allocation flags
depend only on the call site. Next one is when it's a function of
call chain - somewhat harder to review. And the worst is when it's
a function of previous history of execution - not just the call chain,
but the things that had been called (and returned) prior to that one.

How many of those locations need to be of the third kind? All fs/namespace.c
ones are of the first one...

2014-04-16 23:25:26

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.

On Wed, 16 Apr 2014 10:47:02 -0400 Jeff Layton <[email protected]>
wrote:

> On Wed, 16 Apr 2014 14:03:36 +1000
> NeilBrown <[email protected]> wrote:
>
> > If an incoming NFS request is coming from the local host, then
> > nfsd will need to perform some special handling. So detect that
> > possibility and make the source visible in rq_local.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > include/linux/sunrpc/svc.h | 1 +
> > include/linux/sunrpc/svc_xprt.h | 1 +
> > net/sunrpc/svcsock.c | 10 ++++++++++
> > 3 files changed, 12 insertions(+)
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 04e763221246..a0dbbd1e00e9 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -254,6 +254,7 @@ struct svc_rqst {
> > u32 rq_prot; /* IP protocol */
> > unsigned short
> > rq_secure : 1; /* secure port */
> > + unsigned short rq_local : 1; /* local request */
> >
> > void * rq_argp; /* decoded arguments */
> > void * rq_resp; /* xdr'd results */
> > diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> > index b05963f09ebf..b99bdfb0fcf9 100644
> > --- a/include/linux/sunrpc/svc_xprt.h
> > +++ b/include/linux/sunrpc/svc_xprt.h
> > @@ -63,6 +63,7 @@ struct svc_xprt {
> > #define XPT_DETACHED 10 /* detached from tempsocks list */
> > #define XPT_LISTENER 11 /* listening endpoint */
> > #define XPT_CACHE_AUTH 12 /* cache auth info */
> > +#define XPT_LOCAL 13 /* connection from loopback interface */
> >
> > struct svc_serv *xpt_server; /* service for transport */
> > atomic_t xpt_reserved; /* space on outq that is rsvd */
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index b6e59f0a9475..193115fe968c 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> > struct socket *newsock;
> > struct svc_sock *newsvsk;
> > int err, slen;
> > + struct dst_entry *dst;
> > RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
> >
> > dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
> > @@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> > }
> > svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
> >
> > + clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> > + rcu_read_lock();
> > + dst = rcu_dereference(newsock->sk->sk_dst_cache);
> > + if (dst && dst->dev &&
> > + (dst->dev->features & NETIF_F_LOOPBACK))
> > + set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> > + rcu_read_unlock();
> > +
>
> In the use-case you describe, the client is unlikely to have mounted
> "localhost", but is more likely to be mounting a floating address in
> the cluster.
>
> Will this flag end up being set in such a situation? It looks like
> NETIF_F_LOOPBACK isn't set on all adapters -- mostly on "lo" and a few
> others that support MAC-LOOPBACK.

I was hoping someone on net-dev would have commented if it didn't work - I
probably should have explicitly asked.

My testing certainly suggests that this works if I use any local address to
perform the mount, not just 127.0.0.1.
I don't know enough about routing and neighbours and the dst cache to be
certain but my shallow understanding was always that traffic to any local
address was pseudo-routed through the lo interface and would never get
anywhere near e.g. the eth0 interface.

Can any network people confirm?

Thanks,
NeilBrown


>
>
> > if (serv->sv_stats)
> > serv->sv_stats->nettcpconn++;
> >
> > @@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> >
> > rqstp->rq_xprt_ctxt = NULL;
> > rqstp->rq_prot = IPPROTO_TCP;
> > + rqstp->rq_local = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);
> >
> > p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
> > calldir = p[1];
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>


Attachments:
signature.asc (828.00 B)

2014-04-17 00:21:00

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Wed, 16 Apr 2014 10:42:07 -0400 Jeff Layton <[email protected]> wrote:

> On Wed, 16 Apr 2014 14:03:35 +1000
> NeilBrown <[email protected]> wrote:
>

> > Comments, criticisms, etc most welcome.
> >
> > Thanks,
> > NeilBrown
> >
>
> I've only given this a once-over, but the basic concept seems a bit
> flawed. IIUC, the basic idea is to disallow allocations done in knfsd
> threads context from doing fs-based reclaim.
>
> This seems very heavy-handed, and like it could cause problems on a
> busy NFS server. Those sorts of servers are likely to have a lot of
> data in pagecache and thus we generally want to allow them to do do
> writeback when memory is tight.
>
> It's generally acceptable for knfsd to recurse into local filesystem
> code for writeback. What you want to avoid in this situation is reclaim
> on NFS filesystems that happen to be from knfsd on the same box.
>
> If you really want to fix this, what may make more sense is trying to
> plumb that information down more granularly. Maybe GFP_NONETFS and/or
> PF_NETFSTRANS flags?

Hi Jeff,
a few clarifications first:

1/ These changes probably won't affect a "busy NFS server" at all. The
PF_FSTRANS flag only get set in nfsd when it sees a request from the local
host. Most busy NFS servers would never see that, and so would never set
PF_FSTRANS.

2/ Setting PF_FSTRANS does not affect where writeback is done. Direct
reclaim hasn't performed filesystem writeback since 3.2, it is all done
by kswapd (I think direct reclaim still writes to swap sometimes).
The main effects of setting PF_FSTRANS (as modified by this page set)
are:
- when reclaim calls ->releasepage __GFP_FS is not set in the gfp_t arg
- various caches like dcache, icache etc are not shrunk from
direct reclaim
There are other effects, but I'm less clear on exactly what they mean.

A flag specific to network filesystems might make sense, but I don't think it
would solve all the deadlocks.

A good example is the deadlock with the flush-* threads.
flush-* will lock a page, and then call ->writepage. If ->writepage
allocates memory it can enter reclaim, call ->releasepage on NFS, and block
waiting for a COMMIT to complete.
The COMMIT might already be running, performing fsync on that same file that
flush-* is flushing. It locks each page in turn. When it gets to the page
that flush-* has locked, it will deadlock.

xfs_vm_writepage does allocate memory with __GFP_FS set
xfs_vm_writepage -> xfs_setfilesize_trans_alloc -> xfs_trans_alloc ->
_xfs_trans_allo

and I have had this deadlock happen. To avoid this we need flush-* to ensure
that no memory allocation blocks on NFS. We could set a PF_NETFSTRANS there,
but as that code really has nothing to do with networks it would seem an odd
place to put a network-fs-specific flag.

In general, if nfsd is allowed to block on local filesystem, and local
filesystem is allowed to block on NFS, then a deadlock can happen.
We would need a clear hierarchy

__GFP_NETFS > __GFP_FS > __GFP_IO

for it to work. I'm not sure the extra level really helps a lot and it would
be a lot of churn.


Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-17 00:51:16

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <[email protected]> wrote:
> >
> > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > use GFP_NOFS.
> > > >
> > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > takes i_mutex.
> > >
> > > But not the same i_mutex as is currently held. To me, this seems
> > > like a false positive? If you are holding the i_mutex on an inode,
> > > then you have a reference to the inode and hence memory reclaim
> > > won't ever take the i_mutex on that inode.
> > >
> > > FWIW, this sort of false positive was a long stabding problem for
> > > XFS - we managed to get rid of most of the false positives like this
> > > by ensuring that only the ilock is taken within memory reclaim and
> > > memory reclaim can't be entered while we hold the ilock.
> > >
> > > You can't do that with the i_mutex, though....
> > >
> > > Cheers,
> > >
> > > Dave.
> >
> > I'm not sure this is a false positive.
> > You can call __d_alloc when creating a file and so are holding i_mutex on the
> > directory.
> > nfsd might also want to access that directory.
> >
> > If there was only 1 nfsd thread, it would need to get i_mutex and do it's
> > thing before replying to that request and so before it could handle the
> > COMMIT which __d_alloc is waiting for.
>
> That seems wrong - the NFS client in __d_alloc holds a mutex on a
> NFS client directory inode. The NFS server can't access that
> specific mutex - it's on the other side of the "network". The NFS
> server accesses mutexs from local filesystems, so __d_alloc would
> have to be blocked on a local filesystem inode i_mutex for the nfsd
> to get hung up behind it...

I'm not thinking of mutexes on the NFS inodes but the local filesystem inodes
exactly as you describe below.

>
> However, my confusion comes from the fact that we do GFP_KERNEL
> memory allocation with the i_mutex held all over the place.

Do we? Should we? Isn't the whole point of GFP_NOFS to use it when holding
any filesystem lock?

> If the
> problem is:
>
> local fs access -> i_mutex
> .....
> nfsd -> i_mutex (blocked)
> .....
> local fs access -> kmalloc(GFP_KERNEL)
> -> direct reclaim
> -> nfs_release_page
> -> <send write/commit request to blocked nfsds>
> <deadlock>
>
> then why is it just __d_alloc that needs this fix? Either this is a
> problem *everywhere* or it's not a problem at all.

I think it is a problem everywhere that it is a problem :-)
If you are holding an FS lock, then you should be using GFP_NOFS.
Currently a given filesystem can get away with sometimes using GFP_KERNEL
because that particular lock never causes contention during reclaim for that
particular filesystem.

Adding loop-back NFS into the mix broadens the number of locks which can
cause a problem as it creates interdependencies between different filesystems.

>
> If it's a problem everywhere it means that we simply can't allow
> reclaim from localhost NFS mounts to run from contexts that could
> block an NFSD. i.e. you cannot run NFS client memory reclaim from
> filesystems that are NFS server exported filesystems.....

Well.. you cannot allow NFS client memory reclaim *while holding locks in*
filesystems that are NFS exported.

I think this is most effectively generalised to:
you cannot allow FS memory reclaim while holding locks in filesystems which
can be NFS exported

which I think is largely the case already - and lockdep can help us find
those places where we currently do allow FS reclaim while holding an FS lock.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-17 01:04:02

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Wed, 16 Apr 2014 16:17:26 +1000 NeilBrown <[email protected]> wrote:

> On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:

> > > - /*
> > > - * Given that we do not allow direct reclaim to call us, we should
> > > - * never be called while in a filesystem transaction.
> > > - */
> > > - if (WARN_ON(current->flags & PF_FSTRANS))
> > > - goto redirty;
> >
> > We still need to ensure this rule isn't broken. If it is, the
> > filesystem will silently deadlock in delayed allocation rather than
> > gracefully handle the problem with a warning....
>
> Hmm... that might be tricky. The 'new' PF_FSTRANS can definitely be set when
> xfs_vm_writepage is called and we really want the write to happen.
> I don't suppose there is any other way to detect if a transaction is
> happening?

I've been thinking about this some more....

That code is in xfs_vm_writepage which is only called as ->writepage.
xfs never calls that directly so it could only possibly be called during
reclaim?

We know that doesn't happen, but if it does then PF_MEMALLOC would be set,
but PF_KSWAPD would not... and you already have a test for that.

How about every time we set PF_FSTRANS, we store the corresponding
xfs_trans_t in current->journal_info, and clear that field when PF_FSTRANS is
cleared. Then xfs_vm_writepage can test for current->journal_info being
clear.
That is the field that several other filesystems use to keep track of the
'current' transaction.
??

I don't know what xfs_trans_t we would use in xfs_bmapi_allocate_worker, but
I suspect you do :-)

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-17 01:27:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> A good example is the deadlock with the flush-* threads.
> flush-* will lock a page, and then call ->writepage. If ->writepage
> allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> waiting for a COMMIT to complete.
> The COMMIT might already be running, performing fsync on that same file that
> flush-* is flushing. It locks each page in turn. When it gets to the page
> that flush-* has locked, it will deadlock.

It's nfs_release_page() again....

> In general, if nfsd is allowed to block on local filesystem, and local
> filesystem is allowed to block on NFS, then a deadlock can happen.
> We would need a clear hierarchy
>
> __GFP_NETFS > __GFP_FS > __GFP_IO
>
> for it to work. I'm not sure the extra level really helps a lot and it would
> be a lot of churn.

I think you are looking at this the wrong way - it's not the other
filesystems that have to avoid memory reclaim recursion, it's the
NFS client mount that is on loopback that needs to avoid recursion.

IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
on the same host during memory reclaim. That is, nfs_release_page()
cannot send commit messages to the server if the server is on
localhost. Instead, it just tells memory reclaim that it can't
reclaim that page.

If nfs_release_page() no longer blocks in memory reclaim, and all
these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
problems go away. Do the same for all the other memory reclaim
operations in the NFS client, and you've got a solution that should
work without needing to walk all over the rest of the kernel....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-17 01:50:32

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <[email protected]> wrote:

> On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > A good example is the deadlock with the flush-* threads.
> > flush-* will lock a page, and then call ->writepage. If ->writepage
> > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > waiting for a COMMIT to complete.
> > The COMMIT might already be running, performing fsync on that same file that
> > flush-* is flushing. It locks each page in turn. When it gets to the page
> > that flush-* has locked, it will deadlock.
>
> It's nfs_release_page() again....
>
> > In general, if nfsd is allowed to block on local filesystem, and local
> > filesystem is allowed to block on NFS, then a deadlock can happen.
> > We would need a clear hierarchy
> >
> > __GFP_NETFS > __GFP_FS > __GFP_IO
> >
> > for it to work. I'm not sure the extra level really helps a lot and it would
> > be a lot of churn.
>
> I think you are looking at this the wrong way - it's not the other
> filesystems that have to avoid memory reclaim recursion, it's the
> NFS client mount that is on loopback that needs to avoid recursion.
>
> IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> on the same host during memory reclaim. That is, nfs_release_page()
> cannot send commit messages to the server if the server is on
> localhost. Instead, it just tells memory reclaim that it can't
> reclaim that page.
>
> If nfs_release_page() no longer blocks in memory reclaim, and all
> these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
> problems go away. Do the same for all the other memory reclaim
> operations in the NFS client, and you've got a solution that should
> work without needing to walk all over the rest of the kernel....

Maybe.
It is nfs_release_page() today. I wonder if it could be other things another
day. I want to be sure I have a solution that really makes sense.

However ... the thing that nfs_release_page is doing it sending a COMMIT to
tell the server to flush to stable storage. It does that so that if the
server crashes, then the client can re-send.
Of course when it is a loop-back mount the client is the server so the COMMIT
is completely pointless. If the client notices that it is sending a COMMIT
to itself, it can simply assume a positive reply.

You are right, that would make the patch set a lot less intrusive. I'll give
it some serious thought - thanks.

NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-17 02:38:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock

On Wed, 16 Apr 2014 09:00:02 -0400 (EDT) David Miller <[email protected]>
wrote:

> From: Eric Dumazet <[email protected]>
> Date: Tue, 15 Apr 2014 22:13:46 -0700
>
> > For applications handling millions of sockets, this makes a difference.
>
> Indeed, this really is not acceptable.

As you say...
I've just discovered that I can get rid of the lockdep message (and hence
presumably the deadlock risk) with a well placed:

newsock->sk->sk_allocation = GFP_NOFS;

which surprised me as it seemed to be an explicit GFP_KERNEL allocation that
was mentioned in the lockdep trace. Obviously these traces require quite
some sophistication to understand.

So - thanks for the feedback, patch can be ignored.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2014-04-17 04:23:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH/RFC 00/19] Support loop-back NFS mounts

On Thu, Apr 17, 2014 at 11:50:18AM +1000, NeilBrown wrote:
> On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > > A good example is the deadlock with the flush-* threads.
> > > flush-* will lock a page, and then call ->writepage. If ->writepage
> > > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > > waiting for a COMMIT to complete.
> > > The COMMIT might already be running, performing fsync on that same file that
> > > flush-* is flushing. It locks each page in turn. When it gets to the page
> > > that flush-* has locked, it will deadlock.
> >
> > It's nfs_release_page() again....
> >
> > > In general, if nfsd is allowed to block on local filesystem, and local
> > > filesystem is allowed to block on NFS, then a deadlock can happen.
> > > We would need a clear hierarchy
> > >
> > > __GFP_NETFS > __GFP_FS > __GFP_IO
> > >
> > > for it to work. I'm not sure the extra level really helps a lot and it would
> > > be a lot of churn.
> >
> > I think you are looking at this the wrong way - it's not the other
> > filesystems that have to avoid memory reclaim recursion, it's the
> > NFS client mount that is on loopback that needs to avoid recursion.
> >
> > IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> > on the same host during memory reclaim. That is, nfs_release_page()
> > cannot send commit messages to the server if the server is on
> > localhost. Instead, it just tells memory reclaim that it can't
> > reclaim that page.
> >
> > If nfs_release_page() no longer blocks in memory reclaim, and all
> > these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
> > problems go away. Do the same for all the other memory reclaim
> > operations in the NFS client, and you've got a solution that should
> > work without needing to walk all over the rest of the kernel....
>
> Maybe.
> It is nfs_release_page() today. I wonder if it could be other things another
> day. I want to be sure I have a solution that really makes sense.

There could be other things, but in the absence of those things,
I don't think that adding another layer to memory reclaim
dependencies for this niche corner case makes a lot of sense. ;)

> However ... the thing that nfs_release_page is doing it sending a COMMIT to
> tell the server to flush to stable storage. It does that so that if the
> server crashes, then the client can re-send.
> Of course when it is a loop-back mount the client is the server so the COMMIT
> is completely pointless. If the client notices that it is sending a COMMIT
> to itself, it can simply assume a positive reply.

Yes, that's very true. You might have to treat ->writepage
specially, too, if that can block, say, on the number of outstanding
requests that can be sent to the server.

> You are right, that would make the patch set a lot less intrusive. I'll give
> it some serious thought - thanks.

No worries. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-17 04:42:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 04/19] Make effect of PF_FSTRANS to disable __GFP_FS universal.

On Thu, Apr 17, 2014 at 11:03:50AM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 16:17:26 +1000 NeilBrown <[email protected]> wrote:
>
> > On Wed, 16 Apr 2014 15:37:56 +1000 Dave Chinner <[email protected]> wrote:
> >
> > > On Wed, Apr 16, 2014 at 02:03:36PM +1000, NeilBrown wrote:
>
> > > > - /*
> > > > - * Given that we do not allow direct reclaim to call us, we should
> > > > - * never be called while in a filesystem transaction.
> > > > - */
> > > > - if (WARN_ON(current->flags & PF_FSTRANS))
> > > > - goto redirty;
> > >
> > > We still need to ensure this rule isn't broken. If it is, the
> > > filesystem will silently deadlock in delayed allocation rather than
> > > gracefully handle the problem with a warning....
> >
> > Hmm... that might be tricky. The 'new' PF_FSTRANS can definitely be set when
> > xfs_vm_writepage is called and we really want the write to happen.
> > I don't suppose there is any other way to detect if a transaction is
> > happening?
>
> I've been thinking about this some more....
>
> That code is in xfs_vm_writepage which is only called as ->writepage.
> xfs never calls that directly so it could only possibly be called during
> reclaim?

__filemap_fdatawrite_range or __writeback_single_inode
do_writepages
->writepages
xfs_vm_writepages
write_cache_pages
->writepage
xfs_vm_writepage

So explicit data flushes or background writeback still end up in
xfs_vm_writepage.

> We know that doesn't happen, but if it does then PF_MEMALLOC would be set,
> but PF_KSWAPD would not... and you already have a test for that.
>
> How about every time we set PF_FSTRANS, we store the corresponding
> xfs_trans_t in current->journal_info, and clear that field when PF_FSTRANS is
> cleared. Then xfs_vm_writepage can test for current->journal_info being
> clear.
> That is the field that several other filesystems use to keep track of the
> 'current' transaction.

The difference is that we have an explicit transaction handle in XFS
which defines the transaction context. i.e. we don't hide
transactions in thread contexts - the transaction defines the atomic
context of the modification being made.....

> I don't know what xfs_trans_t we would use in
> xfs_bmapi_allocate_worker, but I suspect you do :-)

The same one we use now.

But that's exactly my point. i.e. the transaction handle belongs to
the operation being executed, not the thread that is currently
executing it. We also hand transaction contexts to IO completion,
do interesting things with log space reservations for operations
that require multiple commits to complete and so pass state when
handles are duplicated prior to commit, etc. We still need direct
manipulation and control of the transaction structure, regardless of
where it is stored.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-17 05:58:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 16/19] VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.

On Thu, Apr 17, 2014 at 10:51:05AM +1000, NeilBrown wrote:
> On Wed, 16 Apr 2014 19:00:51 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Wed, Apr 16, 2014 at 04:49:41PM +1000, NeilBrown wrote:
> > > On Wed, 16 Apr 2014 16:25:20 +1000 Dave Chinner <[email protected]> wrote:
> > >
> > > > On Wed, Apr 16, 2014 at 02:03:37PM +1000, NeilBrown wrote:
> > > > > __d_alloc can be called with i_mutex held, so it is safer to
> > > > > use GFP_NOFS.
> > > > >
> > > > > lockdep reports this can deadlock when loop-back NFS is in use,
> > > > > as nfsd may be required to write out for reclaim, and nfsd certainly
> > > > > takes i_mutex.
> > > >
> > > > But not the same i_mutex as is currently held. To me, this seems
> > > > like a false positive? If you are holding the i_mutex on an inode,
> > > > then you have a reference to the inode and hence memory reclaim
> > > > won't ever take the i_mutex on that inode.
> > > >
> > > > FWIW, this sort of false positive was a long stabding problem for
> > > > XFS - we managed to get rid of most of the false positives like this
> > > > by ensuring that only the ilock is taken within memory reclaim and
> > > > memory reclaim can't be entered while we hold the ilock.
> > > >
> > > > You can't do that with the i_mutex, though....
> > > >
> > > > Cheers,
> > > >
> > > > Dave.
> > >
> > > I'm not sure this is a false positive.
> > > You can call __d_alloc when creating a file and so are holding i_mutex on the
> > > directory.
> > > nfsd might also want to access that directory.
> > >
> > > If there was only 1 nfsd thread, it would need to get i_mutex and do it's
> > > thing before replying to that request and so before it could handle the
> > > COMMIT which __d_alloc is waiting for.
> >
> > That seems wrong - the NFS client in __d_alloc holds a mutex on a
> > NFS client directory inode. The NFS server can't access that
> > specific mutex - it's on the other side of the "network". The NFS
> > server accesses mutexs from local filesystems, so __d_alloc would
> > have to be blocked on a local filesystem inode i_mutex for the nfsd
> > to get hung up behind it...
>
> I'm not thinking of mutexes on the NFS inodes but the local filesystem inodes
> exactly as you describe below.
>
> >
> > However, my confusion comes from the fact that we do GFP_KERNEL
> > memory allocation with the i_mutex held all over the place.
>
> Do we?

Yes.

A simple example: fs/xattr.c. Setting or removing an xattr is done
under the i_mutex, yet have a look and the simple_xattr_*
implementation that in memory/psuedo filesystems can use. They use
GFP_KERNEL for all their allocations....

> Should we?

No, I don't think so, because it means that under heavy filesystem
memory pressure workloads, direct reclaim can effectively shut off
and only kswapd can free memory.

> Isn't the whole point of GFP_NOFS to use it when holding
> any filesystem lock?

Not the way I understand it - we've always used it in XFS to prevent
known deadlocks due to recursion, not as a big hammer that we hit
everything with. Every filesystem has different recursion deadlock
triggers, so use GFP_NOFS differently. using t as a big hammer
doesn't play well with memory reclaim on filesystem memory pressure
generating workloads...

> > If the
> > problem is:
> >
> > local fs access -> i_mutex
> > .....
> > nfsd -> i_mutex (blocked)
> > .....
> > local fs access -> kmalloc(GFP_KERNEL)
> > -> direct reclaim
> > -> nfs_release_page
> > -> <send write/commit request to blocked nfsds>
> > <deadlock>
> >
> > then why is it just __d_alloc that needs this fix? Either this is a
> > problem *everywhere* or it's not a problem at all.
>
> I think it is a problem everywhere that it is a problem :-)

<head implosion>

> If you are holding an FS lock, then you should be using GFP_NOFS.

Only if reclaim recursion can cause a deadlock.

> Currently a given filesystem can get away with sometimes using GFP_KERNEL
> because that particular lock never causes contention during reclaim for that
> particular filesystem.

Right, be cause we directly control what recursion can happen.

What you are doing is changing the global reclaim recursion context.
You're introducing recursion loops between filesystems *of different
types*. IOWs, at no point is it safe for anyone to allow any
recursion because we don't have the state available to determine if
recursion is safe or not.

> Adding loop-back NFS into the mix broadens the number of locks which can
> cause a problem as it creates interdependencies between different filesystems.

And that's a major architectural change to the memory reclaim
heirarchy and that means we're going to be breaking assumptions all
over the place, including in places we didn't know we had
dependencies...

> > If it's a problem everywhere it means that we simply can't allow
> > reclaim from localhost NFS mounts to run from contexts that could
> > block an NFSD. i.e. you cannot run NFS client memory reclaim from
> > filesystems that are NFS server exported filesystems.....
>
> Well.. you cannot allow NFS client memory reclaim *while holding locks in*
> filesystems that are NFS exported.
>
> I think this is most effectively generalised to:
> you cannot allow FS memory reclaim while holding locks in filesystems which
> can be NFS exported

Yes, that's exactly what I said yesterday.

> which I think is largely the case already

It most definitely isn't.

> - and lockdep can help us find
> those places where we currently do allow FS reclaim while holding an FS lock.

A pox on lockdep. Lockdep is not a crutch that we can to wave away
problems a architectural change doesn't identify. We need to think
about the architectural impact of the change - if we can't see all
the assumptions we're breaking and the downstream impacts, then we
nee dto think about the problem harder. We should not not wave away
concerns by saying "lockdep will find the things we missed"...

For example, if you make all filesystem allocations GFP_NOFS, then
the only thing that will be able to do reclaim of inodes and
dentries is kswapd, because the shrinkers don't run in GFP_NOFS
context. Hence filesystem memory pressure (e.g. find) won't be able
to directly reclaim the objects that it is generating, and that will
significantly change the behaviour and performance of the system.

IOWs, there are lots of nasty, subtle downstream affects of making a
blanket "filesystems must use GFP_NOFS is they hold any lock" rule
that I can see, so I really think thatwe shoul dbe looking to solve
this problem in the NFS client and exhausting all possibilities
there before we even consider changing something as fundamental as
memory reclaim recursion heirarchies....

Cheers,

Dave.
--
Dave Chinner
[email protected]