Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753266AbcCWK3v (ORCPT ); Wed, 23 Mar 2016 06:29:51 -0400 Received: from e32.co.us.ibm.com ([32.97.110.150]:50798 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751039AbcCWK3t convert rfc822-to-8bit (ORCPT ); Wed, 23 Mar 2016 06:29:49 -0400 X-IBM-Helo: d03dlp03.boulder.ibm.com X-IBM-MailFrom: aneesh.kumar@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org From: "Aneesh Kumar K.V" To: Jerome Glisse Cc: =?utf-8?B?SsOpcsO0bWU=?= Glisse , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Linus Torvalds , joro@8bytes.org, Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Christophe Harle , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Leonid Shamis , Laurent Morichetti , Alexander Deucher , Jatin Kumar Subject: Re: [PATCH v12 08/29] HMM: add device page fault support v6. In-Reply-To: <20160323100919.GA2888@gmail.com> References: <1457469802-11850-1-git-send-email-jglisse@redhat.com> <1457469802-11850-9-git-send-email-jglisse@redhat.com> <87h9fxu1nc.fsf@linux.vnet.ibm.com> <20160323100919.GA2888@gmail.com> User-Agent: Notmuch/0.20.2 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Wed, 23 Mar 2016 15:59:32 +0530 Message-ID: <87egb1trlf.fsf@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16032310-0005-0000-0000-00001F712D32 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5546 Lines: 162 Jerome Glisse writes: > [ text/plain ] > On Wed, Mar 23, 2016 at 12:22:23PM +0530, Aneesh Kumar K.V wrote: >> Jérôme Glisse writes: >> >> > [ text/plain ] >> > This patch add helper for device page fault. Thus helpers will fill >> > the mirror page table using the CPU page table and synchronizing >> > with any update to CPU page table. >> > >> > Changed since v1: >> > - Add comment about directory lock. >> > >> > Changed since v2: >> > - Check for mirror->hmm in hmm_mirror_fault() >> > >> > Changed since v3: >> > - Adapt to HMM page table changes. >> > >> > Changed since v4: >> > - Fix PROT_NONE, ie do not populate from protnone pte. >> > - Fix huge pmd handling (start address may != pmd start address) >> > - Fix missing entry case. >> > >> > Signed-off-by: Jérôme Glisse >> > Signed-off-by: Sherry Cheung >> > Signed-off-by: Subhash Gutti >> > Signed-off-by: Mark Hairgrove >> > Signed-off-by: John Hubbard >> > Signed-off-by: Jatin Kumar >> > --- >> >> >> .... >> .... >> >> +static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror, >> > + struct hmm_event *event, >> > + struct vm_area_struct *vma, >> > + struct hmm_pt_iter *iter, >> > + pmd_t *pmdp, >> > + struct hmm_mirror_fault *mirror_fault, >> > + unsigned long start, >> > + unsigned long end) >> > +{ >> > + struct page *page; >> > + unsigned long addr, pfn; >> > + unsigned flags = FOLL_TOUCH; >> > + spinlock_t *ptl; >> > + int ret; >> > + >> > + ptl = pmd_lock(mirror->hmm->mm, pmdp); >> > + if (unlikely(!pmd_trans_huge(*pmdp))) { >> > + spin_unlock(ptl); >> > + return -EAGAIN; >> > + } >> > + flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0; >> > + page = follow_trans_huge_pmd(vma, start, pmdp, flags); >> > + pfn = page_to_pfn(page); >> > + spin_unlock(ptl); >> > + >> > + /* Just fault in the whole PMD. */ >> > + start &= PMD_MASK; >> > + end = start + PMD_SIZE - 1; >> > + >> > + if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT) >> > + return -ENOENT; >> > + >> > + for (ret = 0, addr = start; !ret && addr < end;) { >> > + unsigned long i, next = end; >> > + dma_addr_t *hmm_pte; >> > + >> > + hmm_pte = hmm_pt_iter_populate(iter, addr, &next); >> > + if (!hmm_pte) >> > + return -ENOMEM; >> > + >> > + i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel); >> > + >> > + /* >> > + * The directory lock protect against concurrent clearing of >> > + * page table bit flags. Exceptions being the dirty bit and >> > + * the device driver private flags. >> > + */ >> > + hmm_pt_iter_directory_lock(iter); >> > + do { >> > + if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) { >> > + hmm_pte[i] = hmm_pte_from_pfn(pfn); >> > + hmm_pt_iter_directory_ref(iter); >> >> I looked at that and it is actually >> static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter) >> { >> BUG_ON(!iter->ptd[iter->pt->llevel - 1]); >> hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]); >> } >> >> static inline void hmm_pt_directory_ref(struct hmm_pt *pt, >> struct page *ptd) >> { >> if (!atomic_inc_not_zero(&ptd->_mapcount)) >> /* Illegal this should not happen. */ >> BUG(); >> } >> >> what is the mapcount update about ? > > Unlike regular CPU page table we do not rely on unmap to prune HMM mirror > page table. Rather we free/prune it aggressively once the device no longer > have anything mirror in a given range. Which patch does this ? > > As such mapcount is use to keep track of any many valid entry there is per > directory. > > Moreover mapcount is also use to protect from concurrent pruning when > you walk through the page table you increment refcount by one along your > way. When you done walking you decrement refcount. > > Because of that last aspect, the mapcount can never reach zero because we > unmap page, it can only reach zero once we cleanup the page table walk. > >> >> > + } >> > + BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn); >> > + if (pmd_write(*pmdp)) >> > + hmm_pte_set_write(&hmm_pte[i]); >> > + } while (addr += PAGE_SIZE, pfn++, i++, addr != next); >> > + hmm_pt_iter_directory_unlock(iter); >> > + mirror_fault->addr = addr; >> > + } >> > + >> >> So we don't have huge page mapping in hmm page table ? > > No we don't right now. First reason is that i wanted to keep things simple for > device driver. Second motivation is to keep first patchset simpler especialy > the page migration code. > > Memory overhead is 2MB per GB of virtual memory mirrored. There is no TLB here. > I believe adding huge page can be done as part of a latter patchset if it makes > sense. > One of the thing I am wondering is can we do the patch series in such a way that we move the page table mirror to device driver. That is an hmm fault will look at cpu page table and call into a device driver callback with the pte entry details. It is upto the device driver to maintain a mirror table if needed. Similarly for cpu fault we call into hmm callback to find per pte dma_addr and do a migrate using copy_from_device callback. I haven't fully looked at how easy this would be, but I guess lot of the code in this series got to do with mirror table and I wondering is there a simpler version we can get upstream that hides it within a driver. Also does it simply to have interfaces that operates on one pte than an array of ptes ? -aneesh