Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751270AbdH1SwH (ORCPT ); Mon, 28 Aug 2017 14:52:07 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:28665 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750735AbdH1SwF (ORCPT ); Mon, 28 Aug 2017 14:52:05 -0400 Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() To: Michal Hocko Cc: Nadav Amit , Nadia Yvette Chambers , Linux Kernel Mailing List , Eric Biggers , Andrew Morton References: <20170826210905.GA21712@zzz.localdomain> <20170826191124.51642-1-namit@vmware.com> <6bf36198-0693-5735-7180-6529aa4c29e4@oracle.com> <09e63000-97fd-dbc3-6a3b-c606e0d73e15@oracle.com> <20170828180913.GA22106@dhcp22.suse.cz> From: Mike Kravetz Message-ID: Date: Mon, 28 Aug 2017 11:51:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20170828180913.GA22106@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3357 Lines: 77 On 08/28/2017 11:09 AM, Michal Hocko wrote: > On Mon 28-08-17 10:45:58, Mike Kravetz wrote: >> Adding Andrew, Michal on CC >> >> On 08/27/2017 01:08 PM, Nadav Amit wrote: >>> Mike Kravetz wrote: >>> >>>> On 08/26/2017 12:11 PM, Nadav Amit wrote: >>>>> hugetlfs_fallocate() currently performs put_page() before unlock_page(). >>>>> This scenario opens a small time window, from the time the page is added >>>>> to the page cache, until it is unlocked, in which the page might be >>>>> removed from the page-cache by another core. If the page is removed >>>>> during this time windows, it might cause a memory corruption, as the >>>>> wrong page will be unlocked. >>>>> >>>>> It is arguable whether this scenario can happen in a real system, and >>>>> there are several mitigating factors. The issue was found by code >>>>> inspection (actually grep), and not by actually triggering the flow. >>>>> Yet, since putting the page before unlocking is incorrect it should be >>>>> fixed, if only to prevent future breakage or someone copy-pasting this >>>>> code. >>>>> >>>>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()") >>>>> >>>>> cc: Eric Biggers >>>>> cc: Mike Kravetz >>>>> >>>>> Signed-off-by: Nadav Amit >>>> >>>> Thank you Nadav. >>> >>> No problem. >>> >>>> >>>> Reviewed-by: Mike Kravetz >>>> >>>> Since hugetlbfs is an in memory filesystem, the only way one 'should' be >>>> able to remove a page (file content) is through an inode operation such as >>>> truncate, hole punch, or unlink. That was the basis for my response that >>>> the inode lock would be required for page freeing. >>>> >>>> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting. >>>> I was expecting to see a check for hugetlbfs pages and exit (without >>>> modification) if encountered. A quick review of the code did not find >>>> any such checks. >>>> >>>> I'll take a closer look to determine exactly how hugetlbfs files are >>>> handled. IMO, there should be something similar to the DAX check where >>>> the routine quickly exits. >>> >>> I did not cc stable when submitting the patch, based on your previous >>> response. Let me know if you want me to send v2 which does so. >> >> I still do not believe there is a need to change this in stable. Your patch >> should be sufficient to ensure we do the right thing going forward. >> >> Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with >> hugetlbfs does indeed show a more general problem. One can use >> sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does >> not go through the special hugetlbfs page handling code, but rather the >> normal mm paths. As a result hugetlbfs accounting (like reserve counts) >> gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!! >> >> I will address this issue in a separate patch. > > I didn't check very carefully but it seems that > http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch > should help here, right? Thanks Michal. Yes, that patch addresses the above issue with hugetlbfs. I was also wondering if there were similar issues with other in memory filesystems. Looks like there are. -- Mike Kravetz