This set includes acked-by's from Andrew and Peter so it should be
OK for all five patches to go upstream through the NFS tree.
I split the congestion tracking patch out from the wait-for-PG_private
patch as they are conceptually separate.
This set continues to perform well in my tests and addresses all
issues that have been raised.
Thanks a lot,
NeilBrown
---
NeilBrown (5):
SCHED: add some "wait..on_bit...timeout()" interfaces.
MM: export page_wakeup functions
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
NFS: avoid waiting at all in nfs_release_page when congested.
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
fs/nfs/file.c | 29 +++++++++++++++++++----------
fs/nfs/write.c | 7 +++++++
include/linux/pagemap.h | 12 ++++++++++--
include/linux/wait.h | 5 ++++-
kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
mm/filemap.c | 21 +++++++++++++++------
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
9 files changed, 91 insertions(+), 33 deletions(-)
--
Signature
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.
If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service. If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.
nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().
This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second. If the commit doesn't complete
in this time, nfs_release_page() will fail. This means it might now
fail in some cases where it wouldn't before. These cases are only
when 'gfp' includes '__GFP_WAIT'.
nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
- page_cache_pipe_buf_steal() in splice.c
- shrink_page_list() in vmscan.c
- invalidate_inode_pages2_range() in truncate.c
The first two handle failure quite safely. The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.
So aborting if the commit takes longer than 1 second is perfectly safe.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 26 ++++++++++++++++----------
fs/nfs/write.c | 2 ++
2 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..ef5513322cf6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -468,17 +468,23 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
- /* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
- * doing this memory reclaim for a fs-related allocation.
+ /* Always try to initiate a 'commit' if relevant, but only
+ * wait for it if __GFP_WAIT is set and the calling process is
+ * allowed to block. Even then, only wait 1 second.
+ * Waiting indefinitely can cause deadlocks when the NFS
+ * server is on this machine, and there is no particular need
+ * to wait extensively here. A short wait has the benefit
+ * that someone else can worry about the freezer.
*/
- if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
- !(current->flags & PF_FSTRANS)) {
- int how = FLUSH_SYNC;
-
- /* Don't let kswapd deadlock waiting for OOM RPC calls */
- if (current_is_kswapd())
- how = 0;
- nfs_commit_inode(mapping->host, how);
+ if (mapping) {
+ struct nfs_server *nfss = NFS_SERVER(mapping->host);
+ nfs_commit_inode(mapping->host, 0);
+ if ((gfp & __GFP_WAIT) &&
+ !current_is_kswapd() &&
+ !(current->flags & PF_FSTRANS)) {
+ wait_on_page_bit_killable_timeout(page, PG_private,
+ HZ);
+ }
}
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 175d5d073ccf..b5d83c7545d4 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
if (likely(!PageSwapCache(head->wb_page))) {
set_page_private(head->wb_page, 0);
ClearPagePrivate(head->wb_page);
+ smp_mb__after_atomic();
+ wake_up_page(head->wb_page, PG_private);
clear_bit(PG_MAPPED, &head->wb_flags);
}
nfsi->npages--;
In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
include/linux/pagemap.h | 2 ++
include/linux/wait.h | 5 ++++-
kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
mm/filemap.c | 13 +++++++++++++
4 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7db7a4e..87f9e4230d3a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -502,6 +502,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout);
static inline int wait_on_page_locked_killable(struct page *page)
{
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5f9b2f..80115bf88671 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -25,7 +25,7 @@ struct wait_bit_key {
void *flags;
int bit_nr;
#define WAIT_ATOMIC_T_BIT_NR -1
- unsigned long private;
+ unsigned long timeout;
};
struct wait_bit_queue {
@@ -154,6 +154,7 @@ int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_ac
void wake_up_bit(void *, int);
void wake_up_atomic_t(atomic_t *);
int out_of_line_wait_on_bit(void *, int, wait_bit_action_f *, unsigned);
+int out_of_line_wait_on_bit_timeout(void *, int, wait_bit_action_f *, unsigned, unsigned long);
int out_of_line_wait_on_bit_lock(void *, int, wait_bit_action_f *, unsigned);
int out_of_line_wait_on_atomic_t(atomic_t *, int (*)(atomic_t *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);
@@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
extern int bit_wait(struct wait_bit_key *);
extern int bit_wait_io(struct wait_bit_key *);
+extern int bit_wait_timeout(struct wait_bit_key *);
+extern int bit_wait_io_timeout(struct wait_bit_key *);
/**
* wait_on_bit - wait for a bit to be cleared
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a4f84e..380678b3cba4 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
}
EXPORT_SYMBOL(out_of_line_wait_on_bit);
+int __sched out_of_line_wait_on_bit_timeout(
+ void *word, int bit, wait_bit_action_f *action,
+ unsigned mode, unsigned long timeout)
+{
+ wait_queue_head_t *wq = bit_waitqueue(word, bit);
+ DEFINE_WAIT_BIT(wait, word, bit);
+
+ wait.key.timeout = jiffies + timeout;
+ return __wait_on_bit(wq, &wait, action, mode);
+}
+EXPORT_SYMBOL(out_of_line_wait_on_bit_timeout);
+
int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
wait_bit_action_f *action, unsigned mode)
@@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
return 0;
}
EXPORT_SYMBOL(bit_wait_io);
+
+__sched int bit_wait_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL(bit_wait_timeout);
+
+__sched int bit_wait_io_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ io_schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL(bit_wait_io_timeout);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcdf948d..4a19c084bdb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -703,6 +703,19 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
bit_wait_io, TASK_KILLABLE);
}
+int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+ wait.key.timeout = jiffies + timeout;
+ if (!test_bit(bit_nr, &page->flags))
+ return 0;
+ return __wait_on_bit(page_waitqueue(page), &wait,
+ bit_wait_io_timeout, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(wait_on_page_bit_killable_timeout);
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
On Wed, 24 Sep 2014 11:28:32 +1000
NeilBrown <[email protected]> wrote:
> This set includes acked-by's from Andrew and Peter so it should be
> OK for all five patches to go upstream through the NFS tree.
>
> I split the congestion tracking patch out from the wait-for-PG_private
> patch as they are conceptually separate.
>
> This set continues to perform well in my tests and addresses all
> issues that have been raised.
>
> Thanks a lot,
> NeilBrown
>
>
> ---
>
> NeilBrown (5):
> SCHED: add some "wait..on_bit...timeout()" interfaces.
> MM: export page_wakeup functions
> NFS: avoid deadlocks with loop-back mounted NFS filesystems.
> NFS: avoid waiting at all in nfs_release_page when congested.
> NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
>
>
> fs/nfs/file.c | 29 +++++++++++++++++++----------
> fs/nfs/write.c | 7 +++++++
> include/linux/pagemap.h | 12 ++++++++++--
> include/linux/wait.h | 5 ++++-
> kernel/sched/wait.c | 36 ++++++++++++++++++++++++++++++++++++
> mm/filemap.c | 21 +++++++++++++++------
> net/sunrpc/sched.c | 2 --
> net/sunrpc/xprtrdma/transport.c | 2 --
> net/sunrpc/xprtsock.c | 10 ----------
> 9 files changed, 91 insertions(+), 33 deletions(-)
>
Cool!
This looks like it'll address my earlier concern about setting the BDI
congested inappropriately. You can add this to the set if you like:
Acked-by: Jeff Layton <[email protected]>
* NeilBrown <[email protected]> wrote:
> @@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
>
> extern int bit_wait(struct wait_bit_key *);
> extern int bit_wait_io(struct wait_bit_key *);
> +extern int bit_wait_timeout(struct wait_bit_key *);
> +extern int bit_wait_io_timeout(struct wait_bit_key *);
>
> /**
> * wait_on_bit - wait for a bit to be cleared
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 15cab1a4f84e..380678b3cba4 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
> }
> EXPORT_SYMBOL(out_of_line_wait_on_bit);
>
> +int __sched out_of_line_wait_on_bit_timeout(
> + void *word, int bit, wait_bit_action_f *action,
> + unsigned mode, unsigned long timeout)
> +{
> + wait_queue_head_t *wq = bit_waitqueue(word, bit);
> + DEFINE_WAIT_BIT(wait, word, bit);
> +
> + wait.key.timeout = jiffies + timeout;
> + return __wait_on_bit(wq, &wait, action, mode);
> +}
> +EXPORT_SYMBOL(out_of_line_wait_on_bit_timeout);
> +
> int __sched
> __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> wait_bit_action_f *action, unsigned mode)
> @@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
> return 0;
> }
> EXPORT_SYMBOL(bit_wait_io);
> +
> +__sched int bit_wait_timeout(struct wait_bit_key *word)
> +{
> + unsigned long now = ACCESS_ONCE(jiffies);
> + if (signal_pending_state(current->state, current))
> + return 1;
> + if (time_after_eq(now, word->timeout))
> + return -EAGAIN;
> + schedule_timeout(word->timeout - now);
> + return 0;
> +}
> +EXPORT_SYMBOL(bit_wait_timeout);
> +
> +__sched int bit_wait_io_timeout(struct wait_bit_key *word)
> +{
> + unsigned long now = ACCESS_ONCE(jiffies);
> + if (signal_pending_state(current->state, current))
> + return 1;
> + if (time_after_eq(now, word->timeout))
> + return -EAGAIN;
> + io_schedule_timeout(word->timeout - now);
> + return 0;
> +}
> +EXPORT_SYMBOL(bit_wait_io_timeout);
New scheduler APIs should be exported via EXPORT_SYMBOL_GPL().
Thanks,
Ingo
On Tue, Sep 23, 2014 at 9:28 PM, NeilBrown <[email protected]> wrote:
> This set includes acked-by's from Andrew and Peter so it should be
> OK for all five patches to go upstream through the NFS tree.
>
> I split the congestion tracking patch out from the wait-for-PG_private
> patch as they are conceptually separate.
>
> This set continues to perform well in my tests and addresses all
> issues that have been raised.
>
> Thanks a lot,
> NeilBrown
>
Thanks Neil! I'll give them a final review tomorrow, and then queue
them up for the 3.18 merge window.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive. I have
seen delays of as much as 208 seconds.
To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished. Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().
With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds. With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 9 +++++++--
fs/nfs/write.c | 5 +++++
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index ef5513322cf6..1243a15438d0 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -470,7 +470,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
/* Always try to initiate a 'commit' if relevant, but only
* wait for it if __GFP_WAIT is set and the calling process is
- * allowed to block. Even then, only wait 1 second.
+ * allowed to block. Even then, only wait 1 second and only
+ * if the 'bdi' is not congested.
* Waiting indefinitely can cause deadlocks when the NFS
* server is on this machine, and there is no particular need
* to wait extensively here. A short wait has the benefit
@@ -481,9 +482,13 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
nfs_commit_inode(mapping->host, 0);
if ((gfp & __GFP_WAIT) &&
!current_is_kswapd() &&
- !(current->flags & PF_FSTRANS)) {
+ !(current->flags & PF_FSTRANS) &&
+ !bdi_write_congested(&nfss->backing_dev_info)) {
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
+ if (PagePrivate(page))
+ set_bdi_congested(&nfss->backing_dev_info,
+ BLK_RW_ASYNC);
}
}
/* If PagePrivate() is set, then the page is not freeable */
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b5d83c7545d4..3066c7fcb565 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1638,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
struct nfs_page *req;
int status = data->task.tk_status;
struct nfs_commit_info cinfo;
+ struct nfs_server *nfss;
while (!list_empty(&data->pages)) {
req = nfs_list_entry(data->pages.next);
@@ -1671,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
next:
nfs_unlock_and_release_request(req);
}
+ nfss = NFS_SERVER(data->inode);
+ if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+ clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
nfs_init_cinfo(&cinfo, data->inode, data->dreq);
if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
nfs_commit_clear_lock(NFS_I(data->inode));
* NeilBrown <[email protected]> wrote:
>
> In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
> sched: Allow wait_on_bit_action() functions to support a timeout
>
> I suggested that a "wait_on_bit_timeout()" interface would not meet my
> need. This isn't true - I was just over-engineering.
>
> Including a 'private' field in wait_bit_key instead of a focused
> "timeout" field was just premature generalization. If some other
> use is ever found, it can be generalized or added later.
>
> So this patch renames "private" to "timeout" with a meaning "stop
> waiting when "jiffies" reaches or passes "timeout",
> and adds two of the many possible wait..bit..timeout() interfaces:
>
> wait_on_page_bit_killable_timeout(), which is the one I want to use,
> and out_of_line_wait_on_bit_timeout() which is a reasonably general
> example. Others can be added as needed.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
>
> ---
> This time with EXPORT_SYMBOL_GPL.
Looks good to me, thanks!
Acked-by: Ingo Molnar <[email protected]>
Ingo
On Wed, 24 Sep 2014 09:04:18 +0200 Ingo Molnar <[email protected]> wrote:
>
> * NeilBrown <[email protected]> wrote:
>
> > @@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
> >
> > extern int bit_wait(struct wait_bit_key *);
> > extern int bit_wait_io(struct wait_bit_key *);
> > +extern int bit_wait_timeout(struct wait_bit_key *);
> > +extern int bit_wait_io_timeout(struct wait_bit_key *);
> >
> > /**
> > * wait_on_bit - wait for a bit to be cleared
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 15cab1a4f84e..380678b3cba4 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
> > }
> > EXPORT_SYMBOL(out_of_line_wait_on_bit);
> >
> > +int __sched out_of_line_wait_on_bit_timeout(
> > + void *word, int bit, wait_bit_action_f *action,
> > + unsigned mode, unsigned long timeout)
> > +{
> > + wait_queue_head_t *wq = bit_waitqueue(word, bit);
> > + DEFINE_WAIT_BIT(wait, word, bit);
> > +
> > + wait.key.timeout = jiffies + timeout;
> > + return __wait_on_bit(wq, &wait, action, mode);
> > +}
> > +EXPORT_SYMBOL(out_of_line_wait_on_bit_timeout);
> > +
> > int __sched
> > __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > wait_bit_action_f *action, unsigned mode)
> > @@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
> > return 0;
> > }
> > EXPORT_SYMBOL(bit_wait_io);
> > +
> > +__sched int bit_wait_timeout(struct wait_bit_key *word)
> > +{
> > + unsigned long now = ACCESS_ONCE(jiffies);
> > + if (signal_pending_state(current->state, current))
> > + return 1;
> > + if (time_after_eq(now, word->timeout))
> > + return -EAGAIN;
> > + schedule_timeout(word->timeout - now);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(bit_wait_timeout);
> > +
> > +__sched int bit_wait_io_timeout(struct wait_bit_key *word)
> > +{
> > + unsigned long now = ACCESS_ONCE(jiffies);
> > + if (signal_pending_state(current->state, current))
> > + return 1;
> > + if (time_after_eq(now, word->timeout))
> > + return -EAGAIN;
> > + io_schedule_timeout(word->timeout - now);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(bit_wait_io_timeout);
>
> New scheduler APIs should be exported via EXPORT_SYMBOL_GPL().
>
Fine with me.
Trond, can you just edit that into the patch you have, or do you want me to
re-send?
Also maybe added Jeff's
Acked-by: Jeff Layton <[email protected]>
to the NFS bits.
Thanks,
NeilBrown
In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
This time with EXPORT_SYMBOL_GPL.
Thanks,
NeilBrown
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7db7a4e..87f9e4230d3a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -502,6 +502,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout);
static inline int wait_on_page_locked_killable(struct page *page)
{
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5f9b2f..80115bf88671 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -25,7 +25,7 @@ struct wait_bit_key {
void *flags;
int bit_nr;
#define WAIT_ATOMIC_T_BIT_NR -1
- unsigned long private;
+ unsigned long timeout;
};
struct wait_bit_queue {
@@ -154,6 +154,7 @@ int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_ac
void wake_up_bit(void *, int);
void wake_up_atomic_t(atomic_t *);
int out_of_line_wait_on_bit(void *, int, wait_bit_action_f *, unsigned);
+int out_of_line_wait_on_bit_timeout(void *, int, wait_bit_action_f *, unsigned, unsigned long);
int out_of_line_wait_on_bit_lock(void *, int, wait_bit_action_f *, unsigned);
int out_of_line_wait_on_atomic_t(atomic_t *, int (*)(atomic_t *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);
@@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
extern int bit_wait(struct wait_bit_key *);
extern int bit_wait_io(struct wait_bit_key *);
+extern int bit_wait_timeout(struct wait_bit_key *);
+extern int bit_wait_io_timeout(struct wait_bit_key *);
/**
* wait_on_bit - wait for a bit to be cleared
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a4f84e..5a62915f47a8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
}
EXPORT_SYMBOL(out_of_line_wait_on_bit);
+int __sched out_of_line_wait_on_bit_timeout(
+ void *word, int bit, wait_bit_action_f *action,
+ unsigned mode, unsigned long timeout)
+{
+ wait_queue_head_t *wq = bit_waitqueue(word, bit);
+ DEFINE_WAIT_BIT(wait, word, bit);
+
+ wait.key.timeout = jiffies + timeout;
+ return __wait_on_bit(wq, &wait, action, mode);
+}
+EXPORT_SYMBOL_GPL(out_of_line_wait_on_bit_timeout);
+
int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
wait_bit_action_f *action, unsigned mode)
@@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
return 0;
}
EXPORT_SYMBOL(bit_wait_io);
+
+__sched int bit_wait_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(bit_wait_timeout);
+
+__sched int bit_wait_io_timeout(struct wait_bit_key *word)
+{
+ unsigned long now = ACCESS_ONCE(jiffies);
+ if (signal_pending_state(current->state, current))
+ return 1;
+ if (time_after_eq(now, word->timeout))
+ return -EAGAIN;
+ io_schedule_timeout(word->timeout - now);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(bit_wait_io_timeout);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcdf948d..cbe5a9013f70 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -703,6 +703,19 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
bit_wait_io, TASK_KILLABLE);
}
+int wait_on_page_bit_killable_timeout(struct page *page,
+ int bit_nr, unsigned long timeout)
+{
+ DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+
+ wait.key.timeout = jiffies + timeout;
+ if (!test_bit(bit_nr, &page->flags))
+ return 0;
+ return __wait_on_bit(page_waitqueue(page), &wait,
+ bit_wait_io_timeout, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(wait_on_page_bit_killable_timeout);
+
/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
* @page: Page defining the wait queue of interest
This will allow NFS to wait for PG_private to be cleared and,
particularly, to send a wake-up when it is.
Signed-off-by: NeilBrown <[email protected]>
Acked-by: Andrew Morton <[email protected]>
---
include/linux/pagemap.h | 10 ++++++++--
mm/filemap.c | 8 ++------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 87f9e4230d3a..2dca0cef3506 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -496,8 +496,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
}
/*
- * This is exported only for wait_on_page_locked/wait_on_page_writeback.
- * Never use this directly!
+ * This is exported only for wait_on_page_locked/wait_on_page_writeback,
+ * and for filesystems which need to wait on PG_private.
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
@@ -512,6 +512,12 @@ static inline int wait_on_page_locked_killable(struct page *page)
return 0;
}
+extern wait_queue_head_t *page_waitqueue(struct page *page);
+static inline void wake_up_page(struct page *page, int bit)
+{
+ __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+}
+
/*
* Wait for a page to be unlocked.
*
diff --git a/mm/filemap.c b/mm/filemap.c
index 4a19c084bdb1..c9ba09f2ad3c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -670,17 +670,13 @@ EXPORT_SYMBOL(__page_cache_alloc);
* at a cost of "thundering herd" phenomena during rare hash
* collisions.
*/
-static wait_queue_head_t *page_waitqueue(struct page *page)
+wait_queue_head_t *page_waitqueue(struct page *page)
{
const struct zone *zone = page_zone(page);
return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}
-
-static inline void wake_up_page(struct page *page, int bit)
-{
- __wake_up_bit(page_waitqueue(page), &page->flags, bit);
-}
+EXPORT_SYMBOL(page_waitqueue);
void wait_on_page_bit(struct page *page, int bit_nr)
{
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
- it doesn't hurt for kswapd to block occasionally. If it doesn't
want to block it would clear __GFP_WAIT. The current_is_kswapd()
was only added to avoid deadlocks and we have a new approach for
that.
- memory allocation in the SUNRPC layer can very rarely try to
->releasepage() a page it is trying to handle. The deadlock
is removed as nfs_release_page() doesn't block indefinitely.
So we don't need to set PF_FSTRANS for sunrpc network operations any
more.
Signed-off-by: NeilBrown <[email protected]>
---
fs/nfs/file.c | 14 ++++++--------
net/sunrpc/sched.c | 2 --
net/sunrpc/xprtrdma/transport.c | 2 --
net/sunrpc/xprtsock.c | 10 ----------
4 files changed, 6 insertions(+), 22 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 1243a15438d0..de322d3f4a29 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,20 +469,18 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
/* Always try to initiate a 'commit' if relevant, but only
- * wait for it if __GFP_WAIT is set and the calling process is
- * allowed to block. Even then, only wait 1 second and only
- * if the 'bdi' is not congested.
+ * wait for it if __GFP_WAIT is set. Even then, only wait 1
+ * second and only if the 'bdi' is not congested.
* Waiting indefinitely can cause deadlocks when the NFS
- * server is on this machine, and there is no particular need
- * to wait extensively here. A short wait has the benefit
- * that someone else can worry about the freezer.
+ * server is on this machine, when a new TCP connection is
+ * needed and in other rare cases. There is no particular
+ * need to wait extensively here. A short wait has the
+ * benefit that someone else can worry about the freezer.
*/
if (mapping) {
struct nfs_server *nfss = NFS_SERVER(mapping->host);
nfs_commit_inode(mapping->host, 0);
if ((gfp & __GFP_WAIT) &&
- !current_is_kswapd() &&
- !(current->flags & PF_FSTRANS) &&
!bdi_write_congested(&nfss->backing_dev_info)) {
wait_on_page_bit_killable_timeout(page, PG_private,
HZ);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79fd589..fe3441abdbe5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -821,9 +821,7 @@ void rpc_execute(struct rpc_task *task)
static void rpc_async_schedule(struct work_struct *work)
{
- current->flags |= PF_FSTRANS;
__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
- current->flags &= ~PF_FSTRANS;
}
/**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac4940563..6a4615dd0261 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -205,7 +205,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
struct rpc_xprt *xprt = &r_xprt->xprt;
int rc = 0;
- current->flags |= PF_FSTRANS;
xprt_clear_connected(xprt);
dprintk("RPC: %s: %sconnect\n", __func__,
@@ -216,7 +215,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
dprintk("RPC: %s: exit\n", __func__);
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
}
/*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..4707c0c8568b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1927,8 +1927,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
struct socket *sock;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
status = __sock_create(xprt->xprt_net, AF_LOCAL,
SOCK_STREAM, 0, &sock, 1);
@@ -1968,7 +1966,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
return status;
}
@@ -2071,8 +2068,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
struct socket *sock = transport->sock;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
/* Start by resetting any existing state */
xs_reset_transport(transport);
sock = xs_create_sock(xprt, transport,
@@ -2092,7 +2087,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}
/*
@@ -2229,8 +2223,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
struct rpc_xprt *xprt = &transport->xprt;
int status = -EIO;
- current->flags |= PF_FSTRANS;
-
if (!sock) {
clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
sock = xs_create_sock(xprt, transport,
@@ -2276,7 +2268,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -EINPROGRESS:
case -EALREADY:
xprt_clear_connecting(xprt);
- current->flags &= ~PF_FSTRANS;
return;
case -EINVAL:
/* Happens, for instance, if the user specified a link
@@ -2294,7 +2285,6 @@ out_eagain:
out:
xprt_clear_connecting(xprt);
xprt_wake_pending_tasks(xprt, status);
- current->flags &= ~PF_FSTRANS;
}
/**
On Wed, Sep 24, 2014 at 11:23 PM, NeilBrown <[email protected]> wrote:
> On Wed, 24 Sep 2014 09:04:18 +0200 Ingo Molnar <[email protected]> wrote:
>
>>
>> * NeilBrown <[email protected]> wrote:
>>
>> > @@ -859,6 +860,8 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
>> >
>> > extern int bit_wait(struct wait_bit_key *);
>> > extern int bit_wait_io(struct wait_bit_key *);
>> > +extern int bit_wait_timeout(struct wait_bit_key *);
>> > +extern int bit_wait_io_timeout(struct wait_bit_key *);
>> >
>> > /**
>> > * wait_on_bit - wait for a bit to be cleared
>> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>> > index 15cab1a4f84e..380678b3cba4 100644
>> > --- a/kernel/sched/wait.c
>> > +++ b/kernel/sched/wait.c
>> > @@ -343,6 +343,18 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
>> > }
>> > EXPORT_SYMBOL(out_of_line_wait_on_bit);
>> >
>> > +int __sched out_of_line_wait_on_bit_timeout(
>> > + void *word, int bit, wait_bit_action_f *action,
>> > + unsigned mode, unsigned long timeout)
>> > +{
>> > + wait_queue_head_t *wq = bit_waitqueue(word, bit);
>> > + DEFINE_WAIT_BIT(wait, word, bit);
>> > +
>> > + wait.key.timeout = jiffies + timeout;
>> > + return __wait_on_bit(wq, &wait, action, mode);
>> > +}
>> > +EXPORT_SYMBOL(out_of_line_wait_on_bit_timeout);
>> > +
>> > int __sched
>> > __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
>> > wait_bit_action_f *action, unsigned mode)
>> > @@ -520,3 +532,27 @@ __sched int bit_wait_io(struct wait_bit_key *word)
>> > return 0;
>> > }
>> > EXPORT_SYMBOL(bit_wait_io);
>> > +
>> > +__sched int bit_wait_timeout(struct wait_bit_key *word)
>> > +{
>> > + unsigned long now = ACCESS_ONCE(jiffies);
>> > + if (signal_pending_state(current->state, current))
>> > + return 1;
>> > + if (time_after_eq(now, word->timeout))
>> > + return -EAGAIN;
>> > + schedule_timeout(word->timeout - now);
>> > + return 0;
>> > +}
>> > +EXPORT_SYMBOL(bit_wait_timeout);
>> > +
>> > +__sched int bit_wait_io_timeout(struct wait_bit_key *word)
>> > +{
>> > + unsigned long now = ACCESS_ONCE(jiffies);
>> > + if (signal_pending_state(current->state, current))
>> > + return 1;
>> > + if (time_after_eq(now, word->timeout))
>> > + return -EAGAIN;
>> > + io_schedule_timeout(word->timeout - now);
>> > + return 0;
>> > +}
>> > +EXPORT_SYMBOL(bit_wait_io_timeout);
>>
>> New scheduler APIs should be exported via EXPORT_SYMBOL_GPL().
>>
>
> Fine with me.
>
> Trond, can you just edit that into the patch you have, or do you want me to
> re-send?
> Also maybe added Jeff's
> Acked-by: Jeff Layton <[email protected]>
> to the NFS bits.
>
Can you please resend just this patch so that the final version goes
out to linux-mm, linux-kernel etc?
I can edit in the Acked-by Jeff to the NFS bits as I apply.
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]