Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422969AbbEONb7 (ORCPT ); Fri, 15 May 2015 09:31:59 -0400 Received: from mta-out1.inet.fi ([62.71.2.227]:35152 "EHLO johanna1.rokki.sonera.fi" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1161039AbbEONb5 (ORCPT ); Fri, 15 May 2015 09:31:57 -0400 RazorGate-KAS: Rate: 5 RazorGate-KAS: {RECEIVED: dynamic ip detected} RazorGate-KAS: Envelope from: RazorGate-KAS: Version: 5.5.3 RazorGate-KAS: LuaCore: 80 2014-11-10_18-01-23 260f8afb9361da3c7edfd3a8e3a4ca908191ad29 RazorGate-KAS: Method: none RazorGate-KAS: Lua profiles 69136 [Nov 12 2014] RazorGate-KAS: Status: not_detected Date: Fri, 15 May 2015 16:31:24 +0300 From: "Kirill A. Shutemov" To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Andrew Morton , Andrea Arcangeli , Hugh Dickins , Dave Hansen , Mel Gorman , Rik van Riel , Christoph Lameter , Naoya Horiguchi , Steve Capper , "Aneesh Kumar K.V" , Johannes Weiner , Michal Hocko , Jerome Marchand , Sasha Levin , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCHv5 00/28] THP refcounting redesign Message-ID: <20150515133124.GB6625@node.dhcp.inet.fi> References: <1429823043-157133-1-git-send-email-kirill.shutemov@linux.intel.com> <5555B49B.3050901@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5555B49B.3050901@suse.cz> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4588 Lines: 104 On Fri, May 15, 2015 at 10:55:55AM +0200, Vlastimil Babka wrote: > On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote: > >Hello everybody, > > > >Here's reworked version of my patchset. All known issues were addressed. > > > >The goal of patchset is to make refcounting on THP pages cheaper with > >simpler semantics and allow the same THP compound page to be mapped with > >PMD and PTEs. This is required to get reasonable THP-pagecache > >implementation. > > > >With the new refcounting design it's much easier to protect against > >split_huge_page(): simple reference on a page will make you the deal. > >It makes gup_fast() implementation simpler and doesn't require > >special-case in futex code to handle tail THP pages. > > > >It should improve THP utilization over the system since splitting THP in > >one process doesn't necessary lead to splitting the page in all other > >processes have the page mapped. > > > >The patchset drastically lower complexity of get_page()/put_page() > >codepaths. I encourage reviewers look on this code before-and-after to > >justify time budget on reviewing this patchset. > > > >= Changelog = > > > >v5: > > - Tested-by: Sasha Levin!™ > > - re-split patchset in hope to improve readability; > > - rebased on top of page flags and ->mapping sanitizing patchset; > > - uncharge compound_mapcount rather than mapcount for hugetlb pages > > during removing from rmap; > > - differentiate page_mapped() from page_mapcount() for compound pages; > > - rework deferred_split_huge_page() to use shrinker interface; > > - fix race in page_remove_rmap(); > > - get rid of __get_page_tail(); > > - few random bug fixes; > >v4: > > - fix sizes reported in smaps; > > - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND}; > > - skip THP pages on munlock_vma_pages_range(): they are never mlocked; > > - properly handle huge zero page on FOLL_SPLIT; > > - fix lock_page() slow path on tail pages; > > - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED; > > - fix split_huge_page() on huge page with unmapped head page; > > - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd; > > - call page_remove_rmap() in unfreeze_page under ptl. > > > >= Design overview = > > > >The main reason why we can't map THP with 4k is how refcounting on THP > >designed. It built around two requirements: > > > > - split of huge page should never fail; > > - we can't change interface of get_user_page(); > > > >To be able to split huge page at any point we have to track which tail > >page was pinned. It leads to tricky and expensive get_page() on tail pages > >and also occupy tail_page->_mapcount. > > > >Most split_huge_page*() users want PMD to be split into table of PTEs and > >don't care whether compound page is going to be split or not. > > > >The plan is: > > > > - allow split_huge_page() to fail if the page is pinned. It's trivial to > > split non-pinned page and it doesn't require tail page refcounting, so > > tail_page->_mapcount is free to be reused. > > > > - introduce new routine -- split_huge_pmd() -- to split PMD into table of > > PTEs. It splits only one PMD, not touching other PMDs the page is > > mapped with or underlying compound page. Unlike new split_huge_page(), > > split_huge_pmd() never fails. > > > >Fortunately, we have only few places where split_huge_page() is needed: > >swap out, memory failure, migration, KSM. And all of them can handle > >split_huge_page() fail. > > > >In new scheme we use page->_mapcount is used to account how many time > >the page is mapped with PTEs. We have separate compound_mapcount() to > >count mappings with PMD. page_mapcount() returns sum of PTE and PMD > >mappings of the page. > > It would be very beneficial to describe the scheme in full, both before in > after. The latter goes also for the Documentation patch, where you fixed > what wasn't true anymore, but I think the picture wasn't complete neither > before, nor is it now. There's the lwn article [1] which helps a lot, but we > shouldn't rely on that exclusively. > > So the full scheme should include at least: > - where were/are pins and mapcounts stored > - what exactly get_page()/put_page() did/does now > - etc. > > [1] https://lwn.net/Articles/619738/ Okay. Will do. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/