Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752114AbbKKDj7 (ORCPT ); Tue, 10 Nov 2015 22:39:59 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:24608 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751621AbbKKDj6 (ORCPT ); Tue, 10 Nov 2015 22:39:58 -0500 Subject: Re: [PATCH v2] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes To: Naoya Horiguchi References: <1447205881-11629-1-git-send-email-mike.kravetz@oracle.com> <20151111025802.GA18535@hori1.linux.bs1.fc.nec.co.jp> From: Mike Kravetz Cc: "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Hugh Dickins , Andrew Morton , Dave Hansen , Davidlohr Bueso Message-ID: <5642B881.4030401@oracle.com> Date: Tue, 10 Nov 2015 19:39:45 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20151111025802.GA18535@hori1.linux.bs1.fc.nec.co.jp> Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Source-IP: aserv0022.oracle.com [141.146.126.234] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7074 Lines: 212 On 11/10/2015 06:58 PM, Naoya Horiguchi wrote: > Hello Mike, > > On Tue, Nov 10, 2015 at 05:38:01PM -0800, Mike Kravetz wrote: >> This is against linux-stable 4.3. Will send to stable@vger.kernel.org >> when Ack'ed here. > > This is not what stable stuff works, please see > Documentation/stable_kernel_rules.txt. Ok. I'll resend with the Cc. > >> Hugh Dickins pointed out problems with the new hugetlbfs fallocate >> hole punch code. These problems are in the routine remove_inode_hugepages >> and mostly occur in the case where there are holes in the range of >> pages to be removed. These holes could be the result of a previous hole >> punch or simply sparse allocation. >> >> remove_inode_hugepages handles both hole punch and truncate operations. >> Page index handling was fixed/cleaned up so that the loop index always >> matches the page being processed. The code now only makes a single pass >> through the range of pages as it was determined page faults could not >> race with truncate. A cond_resched() was added after removing up to >> PAGEVEC_SIZE pages. >> >> Some totally unnecessary code in hugetlbfs_fallocate() that remained from >> early development was also removed. >> >> v2: >> Make remove_inode_hugepages simpler after verifying truncate can not >> race with page faults here. >> >> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") > > Cc: stable@vger.kernel.org [4.3] Will add. > >> Signed-off-by: Mike Kravetz >> --- >> fs/hugetlbfs/inode.c | 57 ++++++++++++++++++++++++++-------------------------- >> 1 file changed, 29 insertions(+), 28 deletions(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index 316adb9..8290f39 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -332,12 +332,15 @@ static void remove_huge_page(struct page *page) >> * truncation is indicated by end of range being LLONG_MAX >> * In this case, we first scan the range and release found pages. >> * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv >> - * maps and global counts. >> + * maps and global counts. Page faults can not race with truncation >> + * in this routine. hugetlb_no_page() prevents page faults in the >> + * truncated range. > > Could you be specific about how/why? Maybe hugetlb_fault_mutex_hash and/or > i_size check should be mentioned, because it's not so obvious. The long explanation is here: http://marc.info/?l=linux-mm&m=144719585221409&w=2 I will include a brief summary here. > >> * hole punch is indicated if end is not LLONG_MAX >> * In the hole punch case we scan the range and release found pages. >> * Only when releasing a page is the associated region/reserv map >> * deleted. The region/reserv map for ranges without associated >> - * pages are not modified. >> + * pages are not modified. Page faults can race with hole punch. >> + * This is indicated if we find a mapped page. >> * Note: If the passed end of range value is beyond the end of file, but >> * not LLONG_MAX this routine still performs a hole punch operation. >> */ >> @@ -361,44 +364,38 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, >> next = start; >> while (next < end) { >> /* >> - * Make sure to never grab more pages that we >> - * might possibly need. >> + * Don't grab more pages than the number left in the range. >> */ >> if (end - next < lookup_nr) >> lookup_nr = end - next; >> >> /* >> - * This pagevec_lookup() may return pages past 'end', >> - * so we must check for page->index > end. >> + * When no more pages are found, we are done. >> */ >> - if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) { >> - if (next == start) >> - break; >> - next = start; >> - continue; >> - } >> + if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) >> + break; >> >> for (i = 0; i < pagevec_count(&pvec); ++i) { >> struct page *page = pvec.pages[i]; >> u32 hash; >> >> + /* >> + * The page (index) could be beyond end. This is >> + * only possible in the punch hole case as end is >> + * max page offset in the truncate case. >> + */ >> + next = page->index; >> + if (next >= end) >> + break; >> + >> hash = hugetlb_fault_mutex_hash(h, current->mm, >> &pseudo_vma, >> mapping, next, 0); >> mutex_lock(&hugetlb_fault_mutex_table[hash]); >> >> lock_page(page); >> - if (page->index >= end) { >> - unlock_page(page); >> - mutex_unlock(&hugetlb_fault_mutex_table[hash]); >> - next = end; /* we are done */ >> - break; >> - } >> - >> /* >> - * If page is mapped, it was faulted in after being >> - * unmapped. Do nothing in this race case. In the >> - * normal case page is not mapped. >> + * In the normal case the page is not mapped. >> */ >> if (!page_mapped(page)) { > > I feel that doing like "likely(!page_mapped(page))" without comment is enough > and self-descriptive. > Ok, makes sense >> bool rsv_on_error = !PagePrivate(page); >> @@ -421,17 +418,24 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, >> hugetlb_fix_reserve_counts( >> inode, rsv_on_error); >> } >> + } else { >> + /* >> + * If page is mapped, it was faulted in after >> + * being unmapped. It indicates a race between >> + * hole punch and page fault. Do nothing in >> + * this case. Getting here in a truncate >> + * operation is a bug. >> + */ >> + BUG_ON(truncate_op); >> } >> >> - if (page->index > next) >> - next = page->index; >> - >> ++next; > > My comment was ignored for some reason? > http://marc.info/?l=linux-kernel&m=144705235903057&w=2 My apologies. I somehow overlooked that e-mail. It was not my intention to ignore your comments. >From that comment, I agree than the ++next should be moved outside the for look. > > Anyway, I think the patch's idea is OK, so > > Reviewed-by: Naoya Horiguchi Thanks for your comments. I'll respin shortly and incorporate your comments. -- Mike Kravetz > > Thanks, > Naoya Horiguchi > >> unlock_page(page); >> >> mutex_unlock(&hugetlb_fault_mutex_table[hash]); >> } >> huge_pagevec_release(&pvec); >> + cond_resched(); >> } >> >> if (truncate_op) >> @@ -647,9 +651,6 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, >> if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) >> i_size_write(inode, offset + len); >> inode->i_ctime = CURRENT_TIME; >> - spin_lock(&inode->i_lock); >> - inode->i_private = NULL; >> - spin_unlock(&inode->i_lock); >> out: >> mutex_unlock(&inode->i_mutex); >> return error; >> -- >> 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/