Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751396AbdH1SJT (ORCPT ); Mon, 28 Aug 2017 14:09:19 -0400 Received: from mx2.suse.de ([195.135.220.15]:54724 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751213AbdH1SJS (ORCPT ); Mon, 28 Aug 2017 14:09:18 -0400 Date: Mon, 28 Aug 2017 20:09:15 +0200 From: Michal Hocko To: Mike Kravetz Cc: Nadav Amit , Nadia Yvette Chambers , Linux Kernel Mailing List , Eric Biggers , Andrew Morton Subject: Re: [PATCH] hugetlbfs: change put_page/unlock_page order in hugetlbfs_fallocate() Message-ID: <20170828180913.GA22106@dhcp22.suse.cz> References: <20170826210905.GA21712@zzz.localdomain> <20170826191124.51642-1-namit@vmware.com> <6bf36198-0693-5735-7180-6529aa4c29e4@oracle.com> <09e63000-97fd-dbc3-6a3b-c606e0d73e15@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <09e63000-97fd-dbc3-6a3b-c606e0d73e15@oracle.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3123 Lines: 71 On Mon 28-08-17 10:45:58, Mike Kravetz wrote: > Adding Andrew, Michal on CC > > On 08/27/2017 01:08 PM, Nadav Amit wrote: > > Mike Kravetz wrote: > > > >> On 08/26/2017 12:11 PM, Nadav Amit wrote: > >>> hugetlfs_fallocate() currently performs put_page() before unlock_page(). > >>> This scenario opens a small time window, from the time the page is added > >>> to the page cache, until it is unlocked, in which the page might be > >>> removed from the page-cache by another core. If the page is removed > >>> during this time windows, it might cause a memory corruption, as the > >>> wrong page will be unlocked. > >>> > >>> It is arguable whether this scenario can happen in a real system, and > >>> there are several mitigating factors. The issue was found by code > >>> inspection (actually grep), and not by actually triggering the flow. > >>> Yet, since putting the page before unlocking is incorrect it should be > >>> fixed, if only to prevent future breakage or someone copy-pasting this > >>> code. > >>> > >>> Fixes: 70c3547e36f5c ("hugetlbfs: add hugetlbfs_fallocate()") > >>> > >>> cc: Eric Biggers > >>> cc: Mike Kravetz > >>> > >>> Signed-off-by: Nadav Amit > >> > >> Thank you Nadav. > > > > No problem. > > > >> > >> Reviewed-by: Mike Kravetz > >> > >> Since hugetlbfs is an in memory filesystem, the only way one 'should' be > >> able to remove a page (file content) is through an inode operation such as > >> truncate, hole punch, or unlink. That was the basis for my response that > >> the inode lock would be required for page freeing. > >> > >> Eric's question about sys_fadvise64(POSIX_FADV_DONTNEED) is interesting. > >> I was expecting to see a check for hugetlbfs pages and exit (without > >> modification) if encountered. A quick review of the code did not find > >> any such checks. > >> > >> I'll take a closer look to determine exactly how hugetlbfs files are > >> handled. IMO, there should be something similar to the DAX check where > >> the routine quickly exits. > > > > I did not cc stable when submitting the patch, based on your previous > > response. Let me know if you want me to send v2 which does so. > > I still do not believe there is a need to change this in stable. Your patch > should be sufficient to ensure we do the right thing going forward. > > Looking at and testing the sys_fadvise64(POSIX_FADV_DONTNEED) code with > hugetlbfs does indeed show a more general problem. One can use > sys_fadvise64() to remove a huge page from a hugetlbfs file. :( This does > not go through the special hugetlbfs page handling code, but rather the > normal mm paths. As a result hugetlbfs accounting (like reserve counts) > gets out of sync and the hugetlbfs filesystem may become unusable. Sigh!!! > > I will address this issue in a separate patch. I didn't check very carefully but it seems that http://ozlabs.org/~akpm/mmotm/broken-out/mm-fadvise-avoid-fadvise-for-fs-without-backing-device.patch should help here, right? -- Michal Hocko SUSE Labs