Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752758AbcJNX6G (ORCPT ); Fri, 14 Oct 2016 19:58:06 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:40696 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751195AbcJNX55 (ORCPT ); Fri, 14 Oct 2016 19:57:57 -0400 Subject: Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system To: Jan Stancek , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <57FF7BB4.1070202@redhat.com> <277142fc-330d-76c7-1f03-a1c8ac0cf336@oracle.com> <58009BE2.5010805@redhat.com> Cc: hillf.zj@alibaba-inc.com, dave.hansen@linux.intel.com, kirill.shutemov@linux.intel.com, mhocko@suse.cz, n-horiguchi@ah.jp.nec.com, aneesh.kumar@linux.vnet.ibm.com, iamjoonsoo.kim@lge.com From: Mike Kravetz Message-ID: <0c9e132e-694c-17cd-1890-66fcfd2e8a0d@oracle.com> Date: Fri, 14 Oct 2016 16:57:31 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <58009BE2.5010805@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Source-IP: aserv0022.oracle.com [141.146.126.234] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10583 Lines: 277 On 10/14/2016 01:48 AM, Jan Stancek wrote: > On 10/14/2016 01:26 AM, Mike Kravetz wrote: >> >> Hi Jan, >> >> Any chance you can get the contents of /sys/kernel/mm/hugepages >> before and after the first run of libhugetlbfs testsuite on Power? >> Perhaps a script like: >> >> cd /sys/kernel/mm/hugepages >> for f in hugepages-*/*; do >> n=`cat $f`; >> echo -e "$n\t$f"; >> done >> >> Just want to make sure the numbers look as they should. >> > > Hi Mike, > > Numbers are below. I have also isolated a single testcase from "func" > group of tests: corrupt-by-cow-opt [1]. This test stops working if I > run it 19 times (with 20 hugepages). And if I disable this test, > "func" group tests can all pass repeatedly. Thanks Jan, I appreciate your efforts. > > [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c > > Regards, > Jan > > Kernel is v4.8-14230-gb67be92, with reboot between each run. > 1) Only func tests > System boot > After setup: > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 0 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > After func tests: > ********** TEST SUMMARY > * 16M > * 32-bit 64-bit > * Total testcases: 0 85 > * Skipped: 0 0 > * PASS: 0 81 > * FAIL: 0 4 > * Killed by signal: 0 0 > * Bad configuration: 0 0 > * Expected FAIL: 0 0 > * Unexpected PASS: 0 0 > * Strange test result: 0 0 > > 26 hugepages-16384kB/free_hugepages > 26 hugepages-16384kB/nr_hugepages > 26 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 1 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > After test cleanup: > umount -a -t hugetlbfs > hugeadm --pool-pages-max ${HPSIZE}:0 > > 1 hugepages-16384kB/free_hugepages > 1 hugepages-16384kB/nr_hugepages > 1 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 1 hugepages-16384kB/resv_hugepages > 1 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > I am guessing the leaked reserve page is which is triggered by running the test you isolated corrupt-by-cow-opt. > --- > > 2) Only stress tests > System boot > After setup: > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 0 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > After stress tests: > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 17 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > After cleanup: > 17 hugepages-16384kB/free_hugepages > 17 hugepages-16384kB/nr_hugepages > 17 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 17 hugepages-16384kB/resv_hugepages > 17 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > This looks worse than the summary after running the functional tests. > --- > > 3) only corrupt-by-cow-opt > > System boot > After setup: > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 0 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh > Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298 > Write s to 0x3effff000000 via shared mapping > Write p to 0x3effff000000 via private mapping > Read s from 0x3effff000000 via shared mapping > PASS > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 1 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages Leaked one reserve page > > # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh > Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312 > Write s to 0x3effff000000 via shared mapping > Write p to 0x3effff000000 via private mapping > Read s from 0x3effff000000 via shared mapping > PASS > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 2 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages It is pretty consistent that we leak a reserve page every time this test is run. The interesting thing is that corrupt-by-cow-opt is a very simple test case. commit 67961f9db8c4 potentially changes the return value of the functions vma_has_reserves() and vma_needs/commit_reservation() for the owner (HPAGE_RESV_OWNER) of private mappings. running the test with and without the commit results in the same return values for these routines on x86. And, no leaked reserve pages. Is it possible to revert this commit and run the libhugetlbs tests (func and stress) again while monitoring the counts in /sys? The counts should go to zero after cleanup as you describe above. I just want to make sure that this commit is causing all the problems you are seeing. If it is, then we can consider reverting and I can try to think of another way to address the original issue. Thanks for your efforts on this. I can not reproduce on x86 or sparc and do not see any similar symptoms on these architectures. -- Mike Kravetz > > (... output cut from ~17 iterations ...) > > # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh > Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686 > Write s to 0x3effff000000 via shared mapping > Bus error > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 19 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh > Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700 > Write s to 0x3effff000000 via shared mapping > FAIL mmap() 2: Cannot allocate memory > 20 hugepages-16384kB/free_hugepages > 20 hugepages-16384kB/nr_hugepages > 20 hugepages-16384kB/nr_hugepages_mempolicy > 0 hugepages-16384kB/nr_overcommit_hugepages > 19 hugepages-16384kB/resv_hugepages > 0 hugepages-16384kB/surplus_hugepages > 0 hugepages-16777216kB/free_hugepages > 0 hugepages-16777216kB/nr_hugepages > 0 hugepages-16777216kB/nr_hugepages_mempolicy > 0 hugepages-16777216kB/nr_overcommit_hugepages > 0 hugepages-16777216kB/resv_hugepages > 0 hugepages-16777216kB/surplus_hugepages > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org >