Date: Fri, 16 Dec 2016 16:07:30 +0300
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Michal Hocko <mhocko@kernel.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org, Rik van Riel <riel@redhat.com>,
        Matthew Wilcox <mawilcox@microsoft.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Al Viro <viro@zeniv.linux.org.uk>, Ingo Molnar <mingo@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: crash during oom reaper
Message-ID: <20161216130730.GF27758@node>
References: <20161216082202.21044-1-vegard.nossum@oracle.com>
 <20161216082202.21044-4-vegard.nossum@oracle.com>
 <20161216090157.GA13940@dhcp22.suse.cz>
 <d944e3ca-07d4-c7d6-5025-dc101406b3a7@oracle.com>
 <20161216101113.GE13940@dhcp22.suse.cz>
 <20161216104438.GD27758@node>
 <20161216114243.GG13940@dhcp22.suse.cz>
 <20161216123555.GE27758@node>
 <20161216125650.GJ13940@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161216125650.GJ13940@dhcp22.suse.cz>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3848
Lines: 99

On Fri, Dec 16, 2016 at 01:56:50PM +0100, Michal Hocko wrote:
> On Fri 16-12-16 15:35:55, Kirill A. Shutemov wrote:
> > On Fri, Dec 16, 2016 at 12:42:43PM +0100, Michal Hocko wrote:
> > > On Fri 16-12-16 13:44:38, Kirill A. Shutemov wrote:
> > > > On Fri, Dec 16, 2016 at 11:11:13AM +0100, Michal Hocko wrote:
> > > > > On Fri 16-12-16 10:43:52, Vegard Nossum wrote:
> > > > > [...]
> > > > > > I don't think it's a bug in the OOM reaper itself, but either of the
> > > > > > following two patches will fix the problem (without my understand how or
> > > > > > why):
> > > > > > 
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index ec9f11d4f094..37b14b2e2af4 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -485,7 +485,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk,
> > > > > > struct mm_struct *mm)
> > > > > >  	 */
> > > > > >  	mutex_lock(&oom_lock);
> > > > > > 
> > > > > > -	if (!down_read_trylock(&mm->mmap_sem)) {
> > > > > > +	if (!down_write_trylock(&mm->mmap_sem)) {
> > > > > 
> > > > > __oom_reap_task_mm is basically the same thing as MADV_DONTNEED and that
> > > > > doesn't require the exlusive mmap_sem. So this looks correct to me.
> > > > 
> > > > BTW, shouldn't we filter out all VM_SPECIAL VMAs there? Or VM_PFNMAP at
> > > > least.
> > > > 
> > > > MADV_DONTNEED doesn't touch VM_PFNMAP, but I don't see anything matching
> > > > on __oom_reap_task_mm() side.
> > > 
> > > I guess you are right and we should match the MADV_DONTNEED behavior
> > > here. Care to send a patch?
> > 
> > Below. Testing required.
> > 
> > > > Other difference is that you use unmap_page_range() witch doesn't touch
> > > > mmu_notifiers. MADV_DONTNEED goes via zap_page_range(), which invalidates
> > > > the range. Not sure if it can make any difference here.
> > > 
> > > Which mmu notifier would care about this? I am not really familiar with
> > > those users so I might miss something easily.
> > 
> > No idea either.
> > 
> > Is there any reason not to use zap_page_range here too?
> 
> Yes, zap_page_range is much more heavy and performs operations which
> might lock AFAIR which I really would like to prevent from.

What exactly can block there? I don't see anything with that potential.

> > Few more notes:
> > 
> > I propably miss something, but why do we need details->ignore_dirty?
> >
> > It only appiled for non-anon pages, but since we filter out shared
> > mappings, how can we have pte_dirty() for !PageAnon()?
> 
> Why couldn't we have dirty pages on the private file mappings? The
> underlying page might be still in the page cache, right?

The check is about dirty PTE, not dirty page.

> > check_swap_entries is also sloppy: the behavior doesn't match the comment:
> > details == NULL makes it check swap entries. I removed it and restore
> > details->check_mapping test as we had before.
> 
> the reason is unmap_mapping_range which didn't use to check swap entries
> so I wanted to have it opt in AFAIR.

details == NULL would give you it in both cases.

> > @@ -531,8 +519,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> >  		 * count elevated without a good reason.
> >  		 */
> >  		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > -			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > -					 &details);
> > +			madvise_dontneed(vma, &vma, vma->vm_start, vma->vm_end);
> 
> I would rather keep the unmap_page_range because it is the bare minumum
> we have to do. Currently we are doing 
> 
> 		if (is_vm_hugetlb_page(vma))
> 			continue;
> 
> so I would rather do something like
> 		if (!can_vma_madv_dontneed(vma))
> 			continue;
> instead.

We can do that.
But let's first understand why code should differ from madvise_dontneed().
It's not obvious to me.

-- 
 Kirill A. Shutemov