2015-08-05 09:51:16

by Michal Hocko

[permalink] [raw]
Subject: [RFC 0/8] Allow GFP_NOFS allocation to fail

Hi,
small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
traditionally even though their reclaim capabilities are restricted
because the VM code cannot recurse into filesystems to clean dirty
pages. At the same time these allocation requests do not allow to
trigger the OOM killer because that would lead to pre-mature OOM killing
during heavy fs metadata workloads.

This leaves the VM code in an unfortunate situation where GFP_NOFS
requests is looping inside the allocator relying on somebody else to
make a progress on its behalf. This is prone to deadlocks when the
request is holding resources which are necessary for other task to make
a progress and release memory (e.g. OOM victim is blocked on the lock
held by the NONFS request). Another drawback is that the caller of
the allocator cannot define any fallback strategy because the request
doesn't fail.

As the VM cannot do much about these requests we should face the reality
and allow those allocations to fail. Johannes has already posted the
patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2)
but the discussion died pretty quickly.

I was playing with this patch and xfs, ext[34] and btrfs for a while to
see what is the effect under heavy memory pressure. As expected this led
to some fallouts.
My test consisted of a simple memory hog which allocates a lot of
anonymous memory and writes to a fs mainly to trigger a fs activity on
exit. In parallel there is a parallel fs metadata load (multiple tasks
creating thousands of empty files and directories). All is running
in a VM with small amount of memory to emulate an under provisioned
system. The metadata load is triggering a sufficient load to invoke the
direct reclaim even without the memory hog. The memory hog forks several
tasks sharing the VM and OOM killer manages to kill it without locking
up the system (this was based on the test case from Tetsuo Handa -
http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't
want to kill my machine ;)).
With all the patches applied none of the 4 filesystems gets aborted
transactions and RO remount (well xfs didn't need any special
treatment). This is obviously not sufficient to claim that failing
GFP_NOFS is OK now but I think it is a good start for the further
discussion. I would be grateful if FS people could have a look at those
patches. I have simply used __GFP_NOFAIL in the critical paths. This
might be not the best strategy but it sounds like a good first step.

The first patch in the series also allows __GFP_NOFAIL allocations to
access memory reserves when the system is OOM which should help those
requests to make a forward progress - especially in combination with
GFP_NOFS.

The second patch tries to address a potential pre-mature OOM killer from
the page fault path. I have posted it separately but it didn't get much
traction.

The third patch allows GFP_NOFS to fail and I believe it should see much
more testing coverage. It would be really great if it could sit in the
mmotm tree for few release cycles so that we can catch more fallouts.

The rest are the FS specific patches to fortify allocations
requests which are really needed to finish transactions without RO
remounts. There might be more needed but my test case survives with
these in place.
They would obviously need some rewording if they are going to be applied
even without Patch3 and I will do that if respective maintainers will
take them. Ext3 and JBD are going away soon so they might be dropped but
they have been in the tree while I was testing so I've kept them.

Thoughts? Opinions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2015-08-05 09:51:17

by Michal Hocko

[permalink] [raw]
Subject: [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

From: Michal Hocko <[email protected]>

__GFP_NOFAIL is a big hammer used to ensure that the allocation
request can never fail. This is a strong requirement and as such
it also deserves a special treatment when the system is OOM. The
primary problem here is that the allocation request might have
come with some locks held and the oom victim might be blocked
on the same locks. This is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them
to make a progress and release resources they are holding. The OOM
victim should compensate for the reserves consumption.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..ee69c338ca2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
/* Exhausted what can be done so it's blamo time */
if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
- || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+ || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
*did_some_progress = 1;
+
+ if (gfp_mask & __GFP_NOFAIL) {
+ page = get_page_from_freelist(gfp_mask, order,
+ ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+ WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
+ " Consider increasing min_free_kbytes.\n");
+ }
+ }
out:
mutex_unlock(&oom_lock);
return page;
--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:19

by Michal Hocko

[permalink] [raw]
Subject: [RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

From: Johannes Weiner <[email protected]>

GFP_NOFS allocations are not allowed to invoke the OOM killer since
their reclaim abilities are severely diminished. However, without the
OOM killer available there is no hope of progress once the reclaimable
pages have been exhausted.

Don't risk hanging these allocations. Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee69c338ca2a..024d45d51700 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
if (ac->high_zoneidx < ZONE_NORMAL)
goto out;
/* The OOM killer does not compensate for IO-less reclaim */
- if (!(gfp_mask & __GFP_FS)) {
- /*
- * XXX: Page reclaim didn't yield anything,
- * and the OOM killer can't be invoked, but
- * keep looping as per tradition.
- */
- *did_some_progress = 1;
+ if (!(gfp_mask & __GFP_FS))
goto out;
- }
if (pm_suspended_storage())
goto out;
/* The OOM killer may not free memory on a specific node */
--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:18

by Michal Hocko

[permalink] [raw]
Subject: [RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation

From: Michal Hocko <[email protected]>

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layera directly so it doesn't need this
protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers
seem to be OK because they are not taking any fs lock before invoking
generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe
from the reclaim recursion POV because this lock serializes truncate
and punch hole with the page faults and it doesn't get involved in the
reclaim.

The GFP_NOFS protection might be even harmful. There is a push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this
should be safe from the page fault path normally. Why do we care
about mapping_gfp_mask at all then? Because this doesn't hold only
reclaim protection flags but it also might contain zone and movability
restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect
those.

Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 4 ++++
mm/filemap.c | 9 ++++-----
mm/memory.c | 17 +++++++++++++++++
3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..962e37c7cd6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -220,10 +220,14 @@ extern pgprot_t protection_map[16];
* ->fault function. The vma's ->fault is responsible for returning a bitmask
* of VM_FAULT_xxx flags that give details about how the fault was handled.
*
+ * MM layer fills up gfp_mask for page allocations but fault handler might
+ * alter it if its implementation requires a different allocation context.
+ *
* pgoff should be used in favour of virtual_address, if possible.
*/
struct vm_fault {
unsigned int flags; /* FAULT_FLAG_xxx flags */
+ gfp_t gfp_mask; /* gfp mask to be used for allocations */
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */

diff --git a/mm/filemap.c b/mm/filemap.c
index b63fb81df336..8a16a07bbe02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
* This adds the requested page to the page cache if it isn't already there,
* and schedules an I/O to read in its contents from disk.
*/
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
{
struct address_space *mapping = file->f_mapping;
struct page *page;
int ret;

do {
- page = page_cache_alloc_cold(mapping);
+ page = __page_cache_alloc(gfp_mask|__GFP_COLD);
if (!page)
return -ENOMEM;

- ret = add_to_page_cache_lru(page, mapping, offset,
- GFP_KERNEL & mapping_gfp_mask(mapping));
+ ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL & gfp_mask);
if (ret == 0)
ret = mapping->a_ops->readpage(file, page);
else if (ret == -EEXIST)
@@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
- error = page_cache_read(file, offset);
+ error = page_cache_read(file, offset, vmf->gfp_mask);

/*
* The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..25ab29560dca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
copy_user_highpage(dst, src, va, vma);
}

+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+{
+ struct file *vm_file = vma->vm_file;
+
+ if (vm_file)
+ return mapping_gfp_mask(vm_file->f_mapping) | GFP_IOFS;
+
+ /*
+ * Special mappings (e.g. VDSO) do not have any file so fake
+ * a default GFP_KERNEL for them.
+ */
+ return GFP_KERNEL;
+}
+
/*
* Notify the address space that the page is about to become writable so that
* it can prohibit this or wait for the page to get into an appropriate state.
@@ -1964,6 +1978,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.virtual_address = (void __user *)(address & PAGE_MASK);
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+ vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.page = page;
vmf.cow_page = NULL;

@@ -2763,6 +2778,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);
@@ -2929,6 +2945,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.max_pgoff = max_pgoff;
vmf.flags = flags;
+ vmf.gfp_mask = __get_fault_gfp_mask(vma);
vma->vm_ops->map_pages(vma, &vmf);
}

--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:20

by Michal Hocko

[permalink] [raw]
Subject: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

From: Michal Hocko <[email protected]>

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[ 72.440013] do_get_write_access: OOM for frozen_buffer
[ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[ 72.495559] do_get_write_access: OOM for frozen_buffer
[ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.496839] do_get_write_access: OOM for frozen_buffer
[ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.505766] Aborting journal on device sda1-8.
[ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM" because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd/transaction.c | 11 +----------
fs/jbd2/transaction.c | 14 +++-----------
2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd_alloc(jh2bh(jh)->b_size,
- GFP_NOFS);
- if (!frozen_buffer) {
- printk(KERN_ERR
- "%s: OOM for frozen_buffer\n",
- __func__);
- JBUFFER_TRACE(jh, "oom!");
- error = -ENOMEM;
- jbd_lock_bh_state(bh);
- goto done;
- }
+ GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh->b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..bff071e21553 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd2_alloc(jh2bh(jh)->b_size,
- GFP_NOFS);
- if (!frozen_buffer) {
- printk(KERN_ERR
- "%s: OOM for frozen_buffer\n",
- __func__);
- JBUFFER_TRACE(jh, "oom!");
- error = -ENOMEM;
- jbd_lock_bh_state(bh);
- goto done;
- }
+ GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh->b_frozen_data = frozen_buffer;
@@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd2_alloc(jh2bh(jh)->b_size,
+ GFP_NOFS|__GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR "%s: No memory for committed data\n",
__func__);
--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:22

by Michal Hocko

[permalink] [raw]
Subject: [RFC 6/8] ext3: Do not abort journal prematurely

From: Michal Hocko <[email protected]>

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[ 83.256914] journal_get_undo_access: No memory for committed data
[ 83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[ 83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[ 83.267130] Aborting journal on device hdb1.
[ 83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[ 83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd/transaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..6c60376a29bc 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh)

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS | __GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR "%s: No memory for committed data\n",
__func__);
--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:23

by Michal Hocko

[permalink] [raw]
Subject: [RFC 7/8] btrfs: Prevent from early transaction abort

From: Michal Hocko <[email protected]>

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[ 55.328093] Call Trace:
[ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[ 55.384280] ------------[ cut here ]------------
[ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[ 55.384337] Call Trace:
[ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[ 55.384608] ---[ end trace c29799da1d4dd621 ]---
[ 55.437323] BTRFS info (device hdb1): forced readonly
[ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/btrfs/extent_io.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..88fad7051e38 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
{
struct extent_buffer *eb = NULL;

- eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
+ eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
if (eb == NULL)
return NULL;
eb->start = start;
@@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
return NULL;

for (i = 0; i < num_pages; i++, index++) {
- p = find_or_create_page(mapping, index, GFP_NOFS);
+ p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
if (!p)
goto free_eb;

--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:21

by Michal Hocko

[permalink] [raw]
Subject: [RFC 5/8] ext4: Do not fail journal due to block allocator

From: Michal Hocko <[email protected]>

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G W 4.0.0-nofs3-00006-gdfe9931f5f68 #588
[ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[ 345.028339] 0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
[ 345.028341] 0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
[ 345.028342] 0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
[ 345.028343] Call Trace:
[ 345.028348] [<ffffffff81538a54>] dump_stack+0x4f/0x7b
[ 345.028370] [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
[ 345.028373] [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
[ 345.028375] [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
[ 345.028390] [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
[ 345.028414] [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
[ 345.028425] [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
[ 345.028434] [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
[ 345.028441] [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
[ 345.028462] [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
[ 345.028464] [<ffffffff8116d04f>] evict+0xa0/0x148
[ 345.028466] [<ffffffff8116dca8>] iput+0x1a1/0x1f0
[ 345.028468] [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
[ 345.028470] [<ffffffff81169a3e>] dput+0x21a/0x243
[ 345.028472] [<ffffffff81157cda>] __fput+0x184/0x19b
[ 345.028473] [<ffffffff81157d29>] ____fput+0xe/0x10
[ 345.028475] [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
[ 345.028477] [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
[ 345.028482] [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
[ 345.028483] [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
[ 345.028488] [<ffffffff81002202>] do_signal+0x28/0x5d0
[...]
[ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
[ 345.033097] Aborting journal on device hdb1-8.
[ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
[ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
[ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
[ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
[ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/ext4/mballoc.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..e6361622bfd5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
block = group * 2;
pnum = block / blocks_per_page;
poff = block % blocks_per_page;
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page->mapping != inode->i_mapping);
@@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,

block++;
pnum = block / blocks_per_page;
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page->mapping != inode->i_mapping);
@@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
* wait for it to initialize.
*/
page_cache_release(page);
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
if (page) {
BUG_ON(page->mapping != inode->i_mapping);
if (!PageUptodate(page)) {
@@ -1194,7 +1197,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
if (page == NULL || !PageUptodate(page)) {
if (page)
page_cache_release(page);
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
if (page) {
BUG_ON(page->mapping != inode->i_mapping);
if (!PageUptodate(page)) {
--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 09:51:24

by Michal Hocko

[permalink] [raw]
Subject: [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

From: Michal Hocko <[email protected]>

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
allowed to fail which can lead to
[ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/btrfs/volumes.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..57a99d19533d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
* and the stripes
*/
sizeof(u64) * (total_stripes),
- GFP_NOFS);
+ GFP_NOFS|__GFP_NOFAIL);
if (!bbio)
return NULL;

--
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 11:42:35

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

On Wed 05-08-15 11:51:20, [email protected] wrote:
> From: Michal Hocko <[email protected]>
>
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [ 72.440013] do_get_write_access: OOM for frozen_buffer
> [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> (...snipped....)
> [ 72.495559] do_get_write_access: OOM for frozen_buffer
> [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.496839] do_get_write_access: OOM for frozen_buffer
> [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.505766] Aborting journal on device sda1-8.
> [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only
>
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
>
> jbd code has the very same issue so let's do the same there as well.

The patch looks good. Btw, the patch 6 can be folded into this patch since
it fixes the issue you fix for jbd2 here... But jbd parts will be dropped
in the next merge window anyway so it doesn't really matter.

You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/jbd/transaction.c | 11 +----------
> fs/jbd2/transaction.c | 14 +++-----------
> 2 files changed, 4 insertions(+), 21 deletions(-)
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..bf7474deda2f 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> jbd_alloc(jh2bh(jh)->b_size,
> - GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> - "%s: OOM for frozen_buffer\n",
> - __func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> + GFP_NOFS|__GFP_NOFAIL);
> goto repeat;
> }
> jh->b_frozen_data = frozen_buffer;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ff2f2e6ad311..bff071e21553 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> jbd2_alloc(jh2bh(jh)->b_size,
> - GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> - "%s: OOM for frozen_buffer\n",
> - __func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> + GFP_NOFS|__GFP_NOFAIL);
> goto repeat;
> }
> jh->b_frozen_data = frozen_buffer;
> @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>
> repeat:
> if (!jh->b_committed_data) {
> - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> + committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> + GFP_NOFS|__GFP_NOFAIL);
> if (!committed_data) {
> printk(KERN_ERR "%s: No memory for committed data\n",
> __func__);
> --
> 2.5.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 11:43:36

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 5/8] ext4: Do not fail journal due to block allocator

On Wed 05-08-15 11:51:21, [email protected] wrote:
> From: Michal Hocko <[email protected]>
>
> Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> memory allocator doesn't endlessly loop to satisfy low-order allocations
> and instead fails them to allow callers to handle them gracefully.
>
> Some of the callers are not yet prepared for this behavior though. ext4
> block allocator relies solely on GFP_NOFS allocation requests and
> allocation failures lead to aborting yournal too easily:
>
> [ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50
> [ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G W 4.0.0-nofs3-00006-gdfe9931f5f68 #588
> [ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
> [ 345.028339] 0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
> [ 345.028341] 0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
> [ 345.028342] 0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
> [ 345.028343] Call Trace:
> [ 345.028348] [<ffffffff81538a54>] dump_stack+0x4f/0x7b
> [ 345.028370] [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
> [ 345.028373] [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
> [ 345.028375] [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
> [ 345.028390] [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
> [ 345.028414] [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
> [ 345.028425] [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
> [ 345.028434] [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
> [ 345.028441] [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
> [ 345.028462] [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
> [ 345.028464] [<ffffffff8116d04f>] evict+0xa0/0x148
> [ 345.028466] [<ffffffff8116dca8>] iput+0x1a1/0x1f0
> [ 345.028468] [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
> [ 345.028470] [<ffffffff81169a3e>] dput+0x21a/0x243
> [ 345.028472] [<ffffffff81157cda>] __fput+0x184/0x19b
> [ 345.028473] [<ffffffff81157d29>] ____fput+0xe/0x10
> [ 345.028475] [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
> [ 345.028477] [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
> [ 345.028482] [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
> [ 345.028483] [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
> [ 345.028488] [<ffffffff81002202>] do_signal+0x28/0x5d0
> [...]
> [ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
> [ 345.033097] Aborting journal on device hdb1-8.
> [ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
> [ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
> [ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
> [ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
> [ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
> [ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
>
> The failure is really premature because GFP_NOFS allocation context is
> very restricted - especially in the fs metadata heavy loads. Before we
> go with a more sofisticated solution, let's simply imitate the previous
> behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
> buddy block allocator. I wasn't able to trigger the issue with this
> patch anymore.

The patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/ext4/mballoc.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5b1613a54307..e6361622bfd5 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
> block = group * 2;
> pnum = block / blocks_per_page;
> poff = block % blocks_per_page;
> - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> + page = find_or_create_page(inode->i_mapping, pnum,
> + GFP_NOFS|__GFP_NOFAIL);
> if (!page)
> return -ENOMEM;
> BUG_ON(page->mapping != inode->i_mapping);
> @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
>
> block++;
> pnum = block / blocks_per_page;
> - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> + page = find_or_create_page(inode->i_mapping, pnum,
> + GFP_NOFS|__GFP_NOFAIL);
> if (!page)
> return -ENOMEM;
> BUG_ON(page->mapping != inode->i_mapping);
> @@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
> * wait for it to initialize.
> */
> page_cache_release(page);
> - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> + page = find_or_create_page(inode->i_mapping, pnum,
> + GFP_NOFS|__GFP_NOFAIL);
> if (page) {
> BUG_ON(page->mapping != inode->i_mapping);
> if (!PageUptodate(page)) {
> @@ -1194,7 +1197,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
> if (page == NULL || !PageUptodate(page)) {
> if (page)
> page_cache_release(page);
> - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> + page = find_or_create_page(inode->i_mapping, pnum,
> + GFP_NOFS|__GFP_NOFAIL);
> if (page) {
> BUG_ON(page->mapping != inode->i_mapping);
> if (!PageUptodate(page)) {
> --
> 2.5.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 16:31:54

by David Sterba

[permalink] [raw]
Subject: Re: [RFC 7/8] btrfs: Prevent from early transaction abort

On Wed, Aug 05, 2015 at 11:51:23AM +0200, [email protected] wrote:
> From: Michal Hocko <[email protected]>
...
> Fix this by reintroducing the no-fail behavior of this allocation path
> with the explicit __GFP_NOFAIL.
>
> Signed-off-by: Michal Hocko <[email protected]>

Reviewed-by: David Sterba <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 16:32:12

by David Sterba

[permalink] [raw]
Subject: Re: [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

On Wed, Aug 05, 2015 at 11:51:24AM +0200, [email protected] wrote:
> From: Michal Hocko <[email protected]>
>
> alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
> page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
> allowed to fail which can lead to
> [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
>
> This is clearly undesirable and the nofail behavior should be explicit
> if the allocation failure cannot be tolerated.
>
> Signed-off-by: Michal Hocko <[email protected]>

Reviewed-by: David Sterba <[email protected]>

2015-08-05 16:49:24

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure


[email protected] wrote:

> From: Michal Hocko <[email protected]>
>
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [ 72.440013] do_get_write_access: OOM for frozen_buffer
> [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> (...snipped....)
> [ 72.495559] do_get_write_access: OOM for frozen_buffer
> [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.496839] do_get_write_access: OOM for frozen_buffer
> [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.505766] Aborting journal on device sda1-8.
> [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only
>
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
>
> jbd code has the very same issue so let's do the same there as well.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/jbd/transaction.c | 11 +----------
> fs/jbd2/transaction.c | 14 +++-----------
> 2 files changed, 4 insertions(+), 21 deletions(-)
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..bf7474deda2f 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> jbd_alloc(jh2bh(jh)->b_size,
> - GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> - "%s: OOM for frozen_buffer\n",
> - __func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> + GFP_NOFS|__GFP_NOFAIL);
> goto repeat;
> }
> jh->b_frozen_data = frozen_buffer;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ff2f2e6ad311..bff071e21553 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> jbd2_alloc(jh2bh(jh)->b_size,
> - GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> - "%s: OOM for frozen_buffer\n",
> - __func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> + GFP_NOFS|__GFP_NOFAIL);
> goto repeat;
> }
> jh->b_frozen_data = frozen_buffer;
> @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>
> repeat:
> if (!jh->b_committed_data) {
> - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> + committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> + GFP_NOFS|__GFP_NOFAIL);
> if (!committed_data) {
> printk(KERN_ERR "%s: No memory for committed data\n",
> __func__);

Is this "if (!committed_data) {" check now dead code?

I also see other similar suspected dead sites in the rest of the series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-05 19:58:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC 0/8] Allow GFP_NOFS allocation to fail

On Aug 5, 2015, at 3:51 AM, [email protected] wrote:
> Hi,
> small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
> traditionally even though their reclaim capabilities are restricted
> because the VM code cannot recurse into filesystems to clean dirty
> pages. At the same time these allocation requests do not allow to
> trigger the OOM killer because that would lead to pre-mature OOM killing
> during heavy fs metadata workloads.
>
> This leaves the VM code in an unfortunate situation where GFP_NOFS
> requests is looping inside the allocator relying on somebody else to
> make a progress on its behalf. This is prone to deadlocks when the
> request is holding resources which are necessary for other task to make
> a progress and release memory (e.g. OOM victim is blocked on the lock
> held by the NONFS request). Another drawback is that the caller of
> the allocator cannot define any fallback strategy because the request
> doesn't fail.
>
> As the VM cannot do much about these requests we should face the reality
> and allow those allocations to fail. Johannes has already posted the
> patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2)
> but the discussion died pretty quickly.
>
> I was playing with this patch and xfs, ext[34] and btrfs for a while
> to see what is the effect under heavy memory pressure. As expected
> this led to some fallouts.
>
> My test consisted of a simple memory hog which allocates a lot of
> anonymous memory and writes to a fs mainly to trigger a fs activity on
> exit. In parallel there is a parallel fs metadata load (multiple tasks
> creating thousands of empty files and directories). All is running
> in a VM with small amount of memory to emulate an under provisioned
> system. The metadata load is triggering a sufficient load to invoke
> the direct reclaim even without the memory hog. The memory hog forks
> several tasks sharing the VM and OOM killer manages to kill it without
> locking up the system (this was based on the test case from Tetsuo
> Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html -
> I just didn't want to kill my machine ;)).
>
> With all the patches applied none of the 4 filesystems gets aborted
> transactions and RO remount (well xfs didn't need any special
> treatment). This is obviously not sufficient to claim that failing
> GFP_NOFS is OK now but I think it is a good start for the further
> discussion. I would be grateful if FS people could have a look at
> those patches. I have simply used __GFP_NOFAIL in the critical paths.
> This might be not the best strategy but it sounds like a good first
> step.
>
> The first patch in the series also allows __GFP_NOFAIL allocations to
> access memory reserves when the system is OOM which should help those
> requests to make a forward progress - especially in combination with
> GFP_NOFS.
>
> The second patch tries to address a potential pre-mature OOM killer
> from the page fault path. I have posted it separately but it didn't
> get much traction.
>
> The third patch allows GFP_NOFS to fail and I believe it should see
> much more testing coverage. It would be really great if it could sit
> in the mmotm tree for few release cycles so that we can catch more
> fallouts.
>
> The rest are the FS specific patches to fortify allocations
> requests which are really needed to finish transactions without RO
> remounts. There might be more needed but my test case survives with
> these in place.

Wouldn't it make more sense to order the fs-specific patches _before_
the "GFP_NOFS can fail" patch (#3), so that once that patch is applied
all known failures have already been fixed? Otherwise it could show
test failures during bisection that would be confusing.

Cheers, Andreas

> They would obviously need some rewording if they are going to be
> applied even without Patch3 and I will do that if respective
> maintainers will take them. Ext3 and JBD are going away soon so they
> might be dropped but they have been in the tree while I was testing
> so I've kept them.
>
> Thoughts? Opinions?
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-06 14:34:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 0/8] Allow GFP_NOFS allocation to fail

On Wed 05-08-15 20:58:25, Andreas Dilger wrote:
> On Aug 5, 2015, at 3:51 AM, [email protected] wrote:
[...]
> > The rest are the FS specific patches to fortify allocations
> > requests which are really needed to finish transactions without RO
> > remounts. There might be more needed but my test case survives with
> > these in place.
>
> Wouldn't it make more sense to order the fs-specific patches _before_
> the "GFP_NOFS can fail" patch (#3), so that once that patch is applied
> all known failures have already been fixed? Otherwise it could show
> test failures during bisection that would be confusing.

As I write below. If maintainers consider them useful even when GFP_NOFS
doesn't fail I will reword them and resend. But you cannot fix the world
without breaking it first in this case ;)

> > They would obviously need some rewording if they are going to be
> > applied even without Patch3 and I will do that if respective
> > maintainers will take them. Ext3 and JBD are going away soon so they
> > might be dropped but they have been in the tree while I was testing
> > so I've kept them.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-12 09:14:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

On Wed 05-08-15 09:49:24, Greg Thelen wrote:
>
> [email protected] wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > Journal transaction might fail prematurely because the frozen_buffer
> > is allocated by GFP_NOFS request:
> > [ 72.440013] do_get_write_access: OOM for frozen_buffer
> > [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> > (...snipped....)
> > [ 72.495559] do_get_write_access: OOM for frozen_buffer
> > [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [ 72.496839] do_get_write_access: OOM for frozen_buffer
> > [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [ 72.505766] Aborting journal on device sda1-8.
> > [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only
> >
> > This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> > allocations upon OOM" because small GPF_NOFS allocations never failed.
> > This allocation seems essential for the journal and GFP_NOFS is too
> > restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> > emulate the previous behavior.
> >
> > jbd code has the very same issue so let's do the same there as well.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > fs/jbd/transaction.c | 11 +----------
> > fs/jbd2/transaction.c | 14 +++-----------
> > 2 files changed, 4 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> > index 1695ba8334a2..bf7474deda2f 100644
> > --- a/fs/jbd/transaction.c
> > +++ b/fs/jbd/transaction.c
> > @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> > jbd_unlock_bh_state(bh);
> > frozen_buffer =
> > jbd_alloc(jh2bh(jh)->b_size,
> > - GFP_NOFS);
> > - if (!frozen_buffer) {
> > - printk(KERN_ERR
> > - "%s: OOM for frozen_buffer\n",
> > - __func__);
> > - JBUFFER_TRACE(jh, "oom!");
> > - error = -ENOMEM;
> > - jbd_lock_bh_state(bh);
> > - goto done;
> > - }
> > + GFP_NOFS|__GFP_NOFAIL);
> > goto repeat;
> > }
> > jh->b_frozen_data = frozen_buffer;
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index ff2f2e6ad311..bff071e21553 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> > jbd_unlock_bh_state(bh);
> > frozen_buffer =
> > jbd2_alloc(jh2bh(jh)->b_size,
> > - GFP_NOFS);
> > - if (!frozen_buffer) {
> > - printk(KERN_ERR
> > - "%s: OOM for frozen_buffer\n",
> > - __func__);
> > - JBUFFER_TRACE(jh, "oom!");
> > - error = -ENOMEM;
> > - jbd_lock_bh_state(bh);
> > - goto done;
> > - }
> > + GFP_NOFS|__GFP_NOFAIL);
> > goto repeat;
> > }
> > jh->b_frozen_data = frozen_buffer;
> > @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
> >
> > repeat:
> > if (!jh->b_committed_data) {
> > - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> > + committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> > + GFP_NOFS|__GFP_NOFAIL);
> > if (!committed_data) {
> > printk(KERN_ERR "%s: No memory for committed data\n",
> > __func__);
>
> Is this "if (!committed_data) {" check now dead code?
>
> I also see other similar suspected dead sites in the rest of the series.

You are absolutely right. I have updated the patches.

Thanks!
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-15 13:54:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > Is this "if (!committed_data) {" check now dead code?
> >
> > I also see other similar suspected dead sites in the rest of the series.
>
> You are absolutely right. I have updated the patches.

Have you sent out an updated version of these patches? Maybe I missed
it, but I don't think I saw them.

Thanks,

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 10:36:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

On Sat 15-08-15 09:54:22, Theodore Ts'o wrote:
> On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > > Is this "if (!committed_data) {" check now dead code?
> > >
> > > I also see other similar suspected dead sites in the rest of the series.
> >
> > You are absolutely right. I have updated the patches.
>
> Have you sent out an updated version of these patches? Maybe I missed
> it, but I don't think I saw them.

I haven't yet. I was waiting for more feedback and didn't want to spam
the mailing list too much. I will post them now.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 10:38:23

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

From: Michal Hocko <[email protected]>

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[ 72.440013] do_get_write_access: OOM for frozen_buffer
[ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[ 72.495559] do_get_write_access: OOM for frozen_buffer
[ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.496839] do_get_write_access: OOM for frozen_buffer
[ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.505766] Aborting journal on device sda1-8.
[ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM" because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd/transaction.c | 11 +----------
fs/jbd2/transaction.c | 23 ++++-------------------
2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd_alloc(jh2bh(jh)->b_size,
- GFP_NOFS);
- if (!frozen_buffer) {
- printk(KERN_ERR
- "%s: OOM for frozen_buffer\n",
- __func__);
- JBUFFER_TRACE(jh, "oom!");
- error = -ENOMEM;
- jbd_lock_bh_state(bh);
- goto done;
- }
+ GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh->b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..4d63c5911afa 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd2_alloc(jh2bh(jh)->b_size,
- GFP_NOFS);
- if (!frozen_buffer) {
- printk(KERN_ERR
- "%s: OOM for frozen_buffer\n",
- __func__);
- JBUFFER_TRACE(jh, "oom!");
- error = -ENOMEM;
- jbd_lock_bh_state(bh);
- goto done;
- }
+ GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh->b_frozen_data = frozen_buffer;
@@ -1156,15 +1147,9 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
goto out;

repeat:
- if (!jh->b_committed_data) {
- committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
- if (!committed_data) {
- printk(KERN_ERR "%s: No memory for committed data\n",
- __func__);
- err = -ENOMEM;
- goto out;
- }
- }
+ if (!jh->b_committed_data)
+ committed_data = jbd2_alloc(jh2bh(jh)->b_size,
+ GFP_NOFS|__GFP_NOFAIL);

jbd_lock_bh_state(bh);
if (!jh->b_committed_data) {
--
2.5.0

--
Michal Hocko
SUSE Labs

2015-08-18 10:39:03

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2 5/8] ext4: Do not fail journal due to block allocator

From: Michal Hocko <[email protected]>

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G W 4.0.0-nofs3-00006-gdfe9931f5f68 #588
[ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[ 345.028339] 0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
[ 345.028341] 0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
[ 345.028342] 0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
[ 345.028343] Call Trace:
[ 345.028348] [<ffffffff81538a54>] dump_stack+0x4f/0x7b
[ 345.028370] [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
[ 345.028373] [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
[ 345.028375] [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
[ 345.028390] [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
[ 345.028414] [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
[ 345.028425] [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
[ 345.028434] [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
[ 345.028441] [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
[ 345.028462] [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
[ 345.028464] [<ffffffff8116d04f>] evict+0xa0/0x148
[ 345.028466] [<ffffffff8116dca8>] iput+0x1a1/0x1f0
[ 345.028468] [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
[ 345.028470] [<ffffffff81169a3e>] dput+0x21a/0x243
[ 345.028472] [<ffffffff81157cda>] __fput+0x184/0x19b
[ 345.028473] [<ffffffff81157d29>] ____fput+0xe/0x10
[ 345.028475] [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
[ 345.028477] [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
[ 345.028482] [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
[ 345.028483] [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
[ 345.028488] [<ffffffff81002202>] do_signal+0x28/0x5d0
[...]
[ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
[ 345.033097] Aborting journal on device hdb1-8.
[ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
[ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
[ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
[ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
[ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/ext4/mballoc.c | 52 ++++++++++++++++++++++++----------------------------
1 file changed, 24 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..0360ea32c30f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,9 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
block = group * 2;
pnum = block / blocks_per_page;
poff = block % blocks_per_page;
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
- if (!page)
- return -ENOMEM;
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
BUG_ON(page->mapping != inode->i_mapping);
e4b->bd_bitmap_page = page;
e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
@@ -1006,9 +1005,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,

block++;
pnum = block / blocks_per_page;
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
- if (!page)
- return -ENOMEM;
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
BUG_ON(page->mapping != inode->i_mapping);
e4b->bd_buddy_page = page;
return 0;
@@ -1158,20 +1156,19 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
* wait for it to initialize.
*/
page_cache_release(page);
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
- if (page) {
- BUG_ON(page->mapping != inode->i_mapping);
- if (!PageUptodate(page)) {
- ret = ext4_mb_init_cache(page, NULL);
- if (ret) {
- unlock_page(page);
- goto err;
- }
- mb_cmp_bitmaps(e4b, page_address(page) +
- (poff * sb->s_blocksize));
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
+ BUG_ON(page->mapping != inode->i_mapping);
+ if (!PageUptodate(page)) {
+ ret = ext4_mb_init_cache(page, NULL);
+ if (ret) {
+ unlock_page(page);
+ goto err;
}
- unlock_page(page);
+ mb_cmp_bitmaps(e4b, page_address(page) +
+ (poff * sb->s_blocksize));
}
+ unlock_page(page);
}
if (page == NULL) {
ret = -ENOMEM;
@@ -1194,18 +1191,17 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
if (page == NULL || !PageUptodate(page)) {
if (page)
page_cache_release(page);
- page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
- if (page) {
- BUG_ON(page->mapping != inode->i_mapping);
- if (!PageUptodate(page)) {
- ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
- if (ret) {
- unlock_page(page);
- goto err;
- }
+ page = find_or_create_page(inode->i_mapping, pnum,
+ GFP_NOFS|__GFP_NOFAIL);
+ BUG_ON(page->mapping != inode->i_mapping);
+ if (!PageUptodate(page)) {
+ ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
+ if (ret) {
+ unlock_page(page);
+ goto err;
}
- unlock_page(page);
}
+ unlock_page(page);
}
if (page == NULL) {
ret = -ENOMEM;
--
2.5.0

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 10:39:37

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2 6/8] ext3: Do not abort journal prematurely

From: Michal Hocko <[email protected]>

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[ 83.256914] journal_get_undo_access: No memory for committed data
[ 83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[ 83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[ 83.267130] Aborting journal on device hdb1.
[ 83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[ 83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd/transaction.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..2151b80276c3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -886,15 +886,8 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
goto out;

repeat:
- if (!jh->b_committed_data) {
- committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
- if (!committed_data) {
- printk(KERN_ERR "%s: No memory for committed data\n",
- __func__);
- err = -ENOMEM;
- goto out;
- }
- }
+ if (!jh->b_committed_data)
+ committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS | __GFP_NOFAIL);

jbd_lock_bh_state(bh);
if (!jh->b_committed_data) {
--
2.5.0

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 10:40:32

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2 7/8] btrfs: Prevent from early transaction abort

From: Michal Hocko <[email protected]>

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[ 55.328093] Call Trace:
[ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[ 55.384280] ------------[ cut here ]------------
[ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[ 55.384337] Call Trace:
[ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[ 55.384608] ---[ end trace c29799da1d4dd621 ]---
[ 55.437323] BTRFS info (device hdb1): forced readonly
[ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/btrfs/extent_io.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..d855ddffd5fe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
{
struct extent_buffer *eb = NULL;

- eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
- if (eb == NULL)
- return NULL;
+ eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
eb->start = start;
eb->len = len;
eb->fs_info = fs_info;
@@ -4867,9 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
return NULL;

for (i = 0; i < num_pages; i++, index++) {
- p = find_or_create_page(mapping, index, GFP_NOFS);
- if (!p)
- goto free_eb;
+ p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);

spin_lock(&mapping->private_lock);
if (PagePrivate(p)) {
--
2.5.0

--
Michal Hocko
SUSE Labs

2015-08-18 10:41:16

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

From: Michal Hocko <[email protected]>

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
allowed to fail which can lead to
[ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/btrfs/volumes.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..42b9949dd71d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,9 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
* and the stripes
*/
sizeof(u64) * (total_stripes),
- GFP_NOFS);
- if (!bbio)
- return NULL;
+ GFP_NOFS|__GFP_NOFAIL);

atomic_set(&bbio->error, 0);
atomic_set(&bbio->refs, 1);
--
2.5.0

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 10:55:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2 5/8] ext4: Do not fail journal due to block allocator

On Tue 18-08-15 12:39:03, Michal Hocko wrote:
[...]
> @@ -992,9 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
> block = group * 2;
> pnum = block / blocks_per_page;
> poff = block % blocks_per_page;
> - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> - if (!page)
> - return -ENOMEM;
> + page = find_or_create_page(inode->i_mapping, pnum,
> + GFP_NOFS|__GFP_NOFAIL);

Scratch this one. find_or_create_page is allowed to return NULL. The
patch is bogus. I was overly eager to turn all places to not check the
return value.

--
Michal Hocko
SUSE Labs

2015-08-18 11:01:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort

On Tue 18-08-15 12:40:31, Michal Hocko wrote:
[...]
> @@ -4867,9 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> return NULL;
>
> for (i = 0; i < num_pages; i++, index++) {
> - p = find_or_create_page(mapping, index, GFP_NOFS);
> - if (!p)
> - goto free_eb;
> + p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>
> spin_lock(&mapping->private_lock);
> if (PagePrivate(p)) {

Same here. find_or_create_page might return NULL.
---
>From f430e5f54367b8815e1099f26fedd2873b597a07 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 15 Jul 2015 19:27:06 +0200
Subject: [PATCH] btrfs: Prevent from early transaction abort

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[ 55.328093] Call Trace:
[ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[ 55.384280] ------------[ cut here ]------------
[ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[ 55.384337] Call Trace:
[ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e
[ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[ 55.384608] ---[ end trace c29799da1d4dd621 ]---
[ 55.437323] BTRFS info (device hdb1): forced readonly
[ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/btrfs/extent_io.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..f4d6eea975d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
{
struct extent_buffer *eb = NULL;

- eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
- if (eb == NULL)
- return NULL;
+ eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
eb->start = start;
eb->len = len;
eb->fs_info = fs_info;
@@ -4867,7 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
return NULL;

for (i = 0; i < num_pages; i++, index++) {
- p = find_or_create_page(mapping, index, GFP_NOFS);
+ p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
if (!p)
goto free_eb;

--
2.5.0

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 17:11:44

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort

On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> those allocations are allowed to fail which can lead to a pre-mature
> transaction abort:

I can either put the btrfs nofail ones on my pull for Linus, or you can
add my sob and send as one unit. Just let me know how you'd rather do
it.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-18 17:29:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort

On Tue 18-08-15 13:11:44, Chris Mason wrote:
> On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> > since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> > those allocations are allowed to fail which can lead to a pre-mature
> > transaction abort:
>
> I can either put the btrfs nofail ones on my pull for Linus, or you can
> add my sob and send as one unit. Just let me know how you'd rather do
> it.

OK, I will rephrase the changelogs (tomorrow) to not refer to an
unmerged patch and would appreciate if you can take them and route them
through your tree. I will then drop them from my pile.

Thanks.
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-08-19 12:26:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort

On Tue 18-08-15 19:29:14, Michal Hocko wrote:
> On Tue 18-08-15 13:11:44, Chris Mason wrote:
> > On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> > > since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> > > those allocations are allowed to fail which can lead to a pre-mature
> > > transaction abort:
> >
> > I can either put the btrfs nofail ones on my pull for Linus, or you can
> > add my sob and send as one unit. Just let me know how you'd rather do
> > it.
>
> OK, I will rephrase the changelogs (tomorrow) to not refer to an
> unmerged patch and would appreciate if you can take them and route them
> through your tree. I will then drop them from my pile.

Poste in a separate thread
http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2015-08-24 12:06:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

Hi Ted,

On Sat 15-08-15 09:54:22, Theodore Ts'o wrote:
> On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > > Is this "if (!committed_data) {" check now dead code?
> > >
> > > I also see other similar suspected dead sites in the rest of the series.
> >
> > You are absolutely right. I have updated the patches.
>
> Have you sent out an updated version of these patches? Maybe I missed
> it, but I don't think I saw them.

would you be interested in these two patches sent with rephrased
changelog to not depend on the patch which allows GFP_NOFS to fail? The
way this has been handled for btrfs...
--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-13 21:37:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC -v2 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

On Tue, Aug 18, 2015 at 12:38:23PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [ 72.440013] do_get_write_access: OOM for frozen_buffer
> [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> (...snipped....)
> [ 72.495559] do_get_write_access: OOM for frozen_buffer
> [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.496839] do_get_write_access: OOM for frozen_buffer
> [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [ 72.505766] Aborting journal on device sda1-8.
> [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only
>
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
>
> jbd code has the very same issue so let's do the same there as well.
>
> Signed-off-by: Michal Hocko <[email protected]>

Applied, thanks.

- Ted