Hi,
I encounter a problem which I'd like to discuss here (tested on 3.10 and 4.5).
While running some workloads we noticed that in case of "improper" application
exit (like SIGTERM) quite a bit (a few GBs) of memory is not being reclaimed
after process termination.
Executing echo 1 > /proc/sys/vm/compact_memory makes the memory available again.
This memory is not reclaimed so OOM will kill process trying to allocate memory
which technically should be available.
Such behavior is present only when THP are [always] enabled.
Disabling it makes the issue not visible to the naked eye.
An important information is that it is visible mostly due to large amount of CPUs
in the system (>200) and amount of missing memory varies with the number of CPUs.
This memory seems to not be accounted anywhere, but I was able to found it on
per cpu lru_add_pvec lists thanks to Dave Hansen's suggestion.
Knowing that I am able to reproduce this problem with much simpler code:
//compile with: gcc repro.c -o repro -fopenmp
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include "omp.h"
int main() {
#pragma omp parallel
{
size_t size = 55*1000*1000; // tweaked for 288cpus, "leaks" ~3.5GB
unsigned long nodemask = 1;
void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
if(p)
memset(p, 0, size);
//munmap(p, size); // uncomment to make the problem go away
}
return 0;
}
Exemplary execution:
$ numactl -H | grep "node 1" | grep MB
node 1 size: 16122 MB
node 1 free: 16026 MB
$ ./repro
$ numactl -H | grep "node 1" | grep MB
node 1 size: 16122 MB
node 1 free: 13527 MB
After a couple of minutes on idle system some of this memory is reclaimed, but never all
unless I run tasks on every CPU:
node 1 size: 16122 MB
node 1 free: 14823 MB
Pieces of the puzzle:
A) after process termination memory is not getting freed nor accounted as free
B) memory cannot be allocated by other processes (unless it is allocated by all CPUs)
I am not sure whether it is expected behavior or a side effect of something else not
going as it should. Temporarily I added lru_add_drain_all() to try_to_free_pages()
which sort of hammers B case, but A is still present.
I am not familiar with this code, but I feel like draining lru_add work should be split
into smaller pieces and done by kswapd to fix A and drain only as much pages as
needed in try_to_free_pages to fix B.
Any comments/ideas/patches for a proper fix are welcome.
Thanks,
Lukas
On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
> Pieces of the puzzle:
> A) after process termination memory is not getting freed nor accounted as free
I don't think this part is necessarily a bug. As long as we have stats
*somewhere*, and we really do "reclaim" them, I don't think we need to
call these pages "free".
> I am not sure whether it is expected behavior or a side effect of something else not
> going as it should. Temporarily I added lru_add_drain_all() to try_to_free_pages()
> which sort of hammers B case, but A is still present.
It's not expected behavior. It's an unanticipated side effect of large
numbers of cpu threads, large pages on the LRU, and (relatively) small
zones.
> I am not familiar with this code, but I feel like draining lru_add work should be split
> into smaller pieces and done by kswapd to fix A and drain only as much pages as
> needed in try_to_free_pages to fix B.
>
> Any comments/ideas/patches for a proper fix are welcome.
Here are my suggestions. I've passed these along multiple times, but I
guess I'll repeat them again for good measure.
> 1. We need some statistics on the number and total *SIZES* of all pages
> in the lru pagevecs. It's too opaque now.
> 2. We need to make darn sure we drain the lru pagevecs before failing
> any kind of allocation.
> 3. We need some way to drain the lru pagevecs directly. Maybe the buddy
> pcp lists too.
> 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> too.
> 5. The VM stats and their updates are now related to how often
> drain_zone_pages() gets run. That might be interacting here too.
6. Perhaps don't use the LRU pagevecs for large pages. It limits the
severity of the problem.
On Wed 27-04-16 10:11:04, Dave Hansen wrote:
> On 04/27/2016 10:01 AM, Odzioba, Lukasz wrote:
[...]
> > 1. We need some statistics on the number and total *SIZES* of all pages
> > in the lru pagevecs. It's too opaque now.
> > 2. We need to make darn sure we drain the lru pagevecs before failing
> > any kind of allocation.
lru_add_drain_all is unfortunatelly too costly (especially on large
machines). You are right that failing an allocation with a lot of cached
pages is less than suboptimal though. So maybe we can do it from the
slow path after the first round of direct reclaim failed to allocate
anything. Something like the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd65d9fb76a..0743c58c2e9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3559,6 +3559,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum compact_result compact_result;
int compaction_retries = 0;
int no_progress_loops = 0;
+ bool drained_lru = false;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3667,6 +3668,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;
+ if (!drained_lru) {
+ drained_lru = true;
+ lru_add_drain_all();
+ }
+
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto noretry;
The downside would be that we really depend on the WQ to make any
progress here. If we are really out of memory then we are screwed so
we would need a flush_work_timeout() or something else that would
guarantee maximum timeout. That something else might be to stop using WQ
and move the flushing into the IRQ context. Not for free too but at
least not dependant on having some memory to make a progress.
> > 3. We need some way to drain the lru pagevecs directly. Maybe the buddy
> > pcp lists too.
> > 4. We need to make sure that a zone_reclaim_mode=0 system still drains
> > too.
> > 5. The VM stats and their updates are now related to how often
> > drain_zone_pages() gets run. That might be interacting here too.
>
> 6. Perhaps don't use the LRU pagevecs for large pages. It limits the
> severity of the problem.
7. Hook into vmstat and flush from there? This would drain them
periodically but it would also introduce an undeterministic interference
as well.
--
Michal Hocko
SUSE Labs