Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754809AbcCUL2B (ORCPT ); Mon, 21 Mar 2016 07:28:01 -0400 Received: from e17.ny.us.ibm.com ([129.33.205.207]:33840 "EHLO e17.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753690AbcCUL1x convert rfc822-to-8bit (ORCPT ); Mon, 21 Mar 2016 07:27:53 -0400 X-IBM-Helo: d01dlp01.pok.ibm.com X-IBM-MailFrom: aneesh.kumar@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org From: "Aneesh Kumar K.V" To: =?utf-8?B?SsOpcsO0bWU=?= Glisse , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Linus Torvalds , joro@8bytes.org, Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Christophe Harle , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Haggai Eran , Shachar Raindel , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Leonid Shamis , Laurent Morichetti , Alexander Deucher , =?utf-8?B?SsOpcsO0bWU=?= Glisse Subject: Re: [PATCH v12 21/29] HMM: mm add helper to update page table when migrating memory back v2. In-Reply-To: <1457469802-11850-22-git-send-email-jglisse@redhat.com> References: <1457469802-11850-1-git-send-email-jglisse@redhat.com> <1457469802-11850-22-git-send-email-jglisse@redhat.com> User-Agent: Notmuch/0.20.2 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Mon, 21 Mar 2016 16:57:32 +0530 Message-ID: <877fgwul3v.fsf@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16032111-0041-0000-0000-000003A6FDC8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6473 Lines: 219 Jérôme Glisse writes: > [ text/plain ] > To migrate memory back we first need to lock HMM special CPU page > table entry so we know no one else might try to migrate those entry > back. Helper also allocate new page where data will be copied back > from the device. Then we can proceed with the device DMA operation. > > Once DMA is done we can update again the CPU page table to point to > the new page that holds the content copied back from device memory. > > Note that we do not need to invalidate the range are we are only > modifying non present CPU page table entry. > > Changed since v1: > - Save memcg against which each page is precharge as it might > change along the way. > > Signed-off-by: Jérôme Glisse > --- > include/linux/mm.h | 12 +++ > mm/memory.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 269 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index c5c062e..1cd060f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2392,6 +2392,18 @@ static inline void hmm_mm_init(struct mm_struct *mm) > { > mm->hmm = NULL; > } > + > +int mm_hmm_migrate_back(struct mm_struct *mm, > + struct vm_area_struct *vma, > + pte_t *new_pte, > + unsigned long start, > + unsigned long end); > +void mm_hmm_migrate_back_cleanup(struct mm_struct *mm, > + struct vm_area_struct *vma, > + pte_t *new_pte, > + dma_addr_t *hmm_pte, > + unsigned long start, > + unsigned long end); > #else /* !CONFIG_HMM */ > static inline void hmm_mm_init(struct mm_struct *mm) > { > diff --git a/mm/memory.c b/mm/memory.c > index 3cb3653..d917911a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3513,6 +3513,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > } > EXPORT_SYMBOL_GPL(handle_mm_fault); > > + > +#ifdef CONFIG_HMM > +/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page. > + * > + * @mm: The mm struct. > + * @vma: The vm area struct the range is in. > + * @new_pte: Array of new CPU page table entry value. > + * @start: Start address of the range (inclusive). > + * @end: End address of the range (exclusive). > + * > + * This function will lock HMM page table entry and allocate new page for entry > + * it successfully locked. > + */ Can you add more comments around this ? > +int mm_hmm_migrate_back(struct mm_struct *mm, > + struct vm_area_struct *vma, > + pte_t *new_pte, > + unsigned long start, > + unsigned long end) > +{ > + pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked()); > + unsigned long addr, i; > + int ret = 0; > + > + VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); > + > + if (unlikely(anon_vma_prepare(vma))) > + return -ENOMEM; > + > + start &= PAGE_MASK; > + end = PAGE_ALIGN(end); > + memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT)); > + > + for (addr = start; addr < end;) { > + unsigned long cstart, next; > + spinlock_t *ptl; > + pgd_t *pgdp; > + pud_t *pudp; > + pmd_t *pmdp; > + pte_t *ptep; > + > + pgdp = pgd_offset(mm, addr); > + pudp = pud_offset(pgdp, addr); > + /* > + * Some other thread might already have migrated back the entry > + * and freed the page table. Unlikely thought. > + */ > + if (unlikely(!pudp)) { > + addr = min((addr + PUD_SIZE) & PUD_MASK, end); > + continue; > + } > + pmdp = pmd_offset(pudp, addr); > + if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) || > + pmd_trans_huge(*pmdp))) { > + addr = min((addr + PMD_SIZE) & PMD_MASK, end); > + continue; > + } > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); > + for (cstart = addr, i = (addr - start) >> PAGE_SHIFT, > + next = min((addr + PMD_SIZE) & PMD_MASK, end); > + addr < next; addr += PAGE_SIZE, ptep++, i++) { > + swp_entry_t entry; > + > + entry = pte_to_swp_entry(*ptep); > + if (pte_none(*ptep) || pte_present(*ptep) || > + !is_hmm_entry(entry) || > + is_hmm_entry_locked(entry)) > + continue; > + > + set_pte_at(mm, addr, ptep, hmm_entry); > + new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr), > + vma->vm_page_prot)); > + } > + pte_unmap_unlock(ptep - 1, ptl); I guess this is fixing all the ptes in the cpu page table mapping a pmd entry. But then what is below ? > + > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT; > + addr < next; addr += PAGE_SIZE, i++) { Your use of vairable addr with multiple loops updating then is also making it complex. We should definitely add more comments here. I guess we are going through the same range we iterated above here. > + struct mem_cgroup *memcg; > + struct page *page; > + > + if (!pte_present(new_pte[i])) > + continue; What is that checking for ?. We set that using pte_mkspecial above ? > + > + page = alloc_zeroed_user_highpage_movable(vma, addr); > + if (!page) { > + ret = -ENOMEM; > + break; > + } > + __SetPageUptodate(page); > + if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, > + &memcg)) { > + page_cache_release(page); > + ret = -ENOMEM; > + break; > + } > + /* > + * We can safely reuse the s_mem/mapping field of page > + * struct to store the memcg as the page is only seen > + * by HMM at this point and we can clear it before it > + * is public see mm_hmm_migrate_back_cleanup(). > + */ > + page->s_mem = memcg; > + new_pte[i] = mk_pte(page, vma->vm_page_prot); > + if (vma->vm_flags & VM_WRITE) { > + new_pte[i] = pte_mkdirty(new_pte[i]); > + new_pte[i] = pte_mkwrite(new_pte[i]); > + } Why mark it dirty if vm_flags is VM_WRITE ? > + } > + > + if (!ret) > + continue; > + > + hmm_entry = swp_entry_to_pte(make_hmm_entry()); > + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); Again we loop through the same range ? > + for (addr = cstart, i = (addr - start) >> PAGE_SHIFT; > + addr < next; addr += PAGE_SIZE, ptep++, i++) { > + unsigned long pfn = pte_pfn(new_pte[i]); > + > + if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn)) > + continue; What is that checking for ? > + > + set_pte_at(mm, addr, ptep, hmm_entry); > + pte_clear(mm, addr, &new_pte[i]); what is that pte_clear for ?. Handling of new_pte needs more code comments. > + } > + pte_unmap_unlock(ptep - 1, ptl); > + break; > + } > + return ret; > +} > +EXPORT_SYMBOL(mm_hmm_migrate_back); > + -aneesh