Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp3814786rdh; Tue, 28 Nov 2023 04:51:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IFmjtexaHpsgCD05K1yHQ9AI4fC3PEjiJPspTAhut6bkmdkCyHplkolI4Q7/0Y16tyqwI0f X-Received: by 2002:a17:902:ee82:b0:1cc:1490:e7bf with SMTP id a2-20020a170902ee8200b001cc1490e7bfmr16076755pld.30.1701175903626; Tue, 28 Nov 2023 04:51:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701175903; cv=none; d=google.com; s=arc-20160816; b=sgk40jrdTcoALJ39rLq8kw5vZe2bMnsfVKKDOJFSaUY9FiVB5uHaYNBotMG5iwDWII nAE7BttSd0pMzhuQku+L4FGqSByR1sBJ/s5IUKoOlOUfCfggzReP3U9ZYhYOdldbYRY/ Nkj+cYo2hAG+PeZuFgmhNQGfzVwBUfZrrpWtwNWW/bZfiDtDH5Cia2hc8+OZkPB3s/iO ZD0mqocZjB63Rly8v7btW3NIFhw2RK415pkOHQuuGEXzdq6ANz/h7OovyspD/8InsBfj Jz6QZBbXM1eQ68NwM7u4nwwhMbmRD37UBifjlnhts2tK+oqX1BwQPqGuQF6O+rnaUg0g +LMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=pE0TA+w4PCwjgumE8lOzhs8EzqFCLBuNqOOP+jVvqlc=; fh=p0bwCLj4P8ft7U/cqCqSiMx5LPepDG79kBiaKf22ILo=; b=n4r8Nbevw7fsqvRfeN9sSEm1TGyhKLJUDGh84XzXTBT3ob5PdtzVno0JwcxEPOgfWJ mvrS29uHSqzX3FVN1YY/bV7xGGpW6LYU2Rjn7Y3F64xOlpMKQwCtFFwZlN4GwiJJhDBE M4Z3r3qGO1lcz35Z80HqT0rr3ct5B3WWqkfWD67pCOVEHWqOvytQ0XHK94d3uQzpcpHj NyxcYhh0t6LhtmBdNRvRhoe3HMkHNoZaxrwEsJMTr0xWc8g9oRp2dysWPlAPfu5NFTkP 0MsZzGrP28sLILZbpR6mbtzOryvteH8VPeNgl2AHsMadC7MJ0R/8/fJ4NqTEb2SN/IcT EOuQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id k6-20020a170902c40600b001cff290413bsi2065492plk.390.2023.11.28.04.51.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Nov 2023 04:51:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 275AE8056995; Tue, 28 Nov 2023 04:51:40 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344781AbjK1MvG (ORCPT + 99 others); Tue, 28 Nov 2023 07:51:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344754AbjK1MvB (ORCPT ); Tue, 28 Nov 2023 07:51:01 -0500 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A108B10C1 for ; Tue, 28 Nov 2023 04:50:53 -0800 (PST) Received: from kwepemm000018.china.huawei.com (unknown [172.30.72.53]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Sfj021mG3z1P8qX; Tue, 28 Nov 2023 20:47:14 +0800 (CST) Received: from DESKTOP-RAUQ1L5.china.huawei.com (10.174.179.172) by kwepemm000018.china.huawei.com (7.193.23.4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 28 Nov 2023 20:50:49 +0800 From: Weixi Zhu To: , , CC: , , , , , , , , , , , , , , , , , , , , , , , , , Weixi Zhu Subject: [RFC PATCH 6/6] mm/gmem: extending Linux core MM to support unified virtual address space Date: Tue, 28 Nov 2023 20:50:25 +0800 Message-ID: <20231128125025.4449-7-weixi.zhu@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231128125025.4449-1-weixi.zhu@huawei.com> References: <20231128125025.4449-1-weixi.zhu@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.174.179.172] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To kwepemm000018.china.huawei.com (7.193.23.4) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Tue, 28 Nov 2023 04:51:40 -0800 (PST) This patch extends Linux core MM to support unified virtual address space. A unified virtual address space provides a coherent view of memory for the CPU and devices. This is achieved by maintaining coherent page tables for the CPU and any attached devices for each process, without assuming that the underlying interconnect between the CPU and peripheral device is cache-coherent. Specifically, for each mm_struct that is attached with one or more device computing contexts, a per-process logical page table is utilized to track the mapping status of anonymous memory allocated via mmap(MAP_PRIVATE | MAP_PEER_SHARED). The CPU page fault handling path is modified to examine whether a faulted virtual page has already been faulted elsewhere, e.g. on a device, by looking up the logical page table in vm_object. If so, a page migration operation should be orchestrated by the core MM to prepare the CPU physical page, instead of zero-filling. This is achieved by invoking gm_host_fault_locked(). The logical page table must also be updated once the CPU page table gets modified. Ideally, the logical page table should always be looked up or modified first if the CPU page table is changed, but the currently implementation is reverse. Also, current implementation only considers anonymous memory, while a device may want to operate on a disk-file directly via mmap(fd). In the future, logical page table is planned to play a more generic role for anonymous memory, folios/huge pages and file-backed memory, as well as to provide a clean abstraction for CPU page table functions (including these stage-2 functions). More, the page fault handler path will be enhanced to deal with cache-coherent buses as well, since it might be desirable for devices to operate sparse data remotely instead of migration data at page granules. Signed-off-by: Weixi Zhu --- kernel/fork.c | 1 + mm/huge_memory.c | 85 +++++++++++++++++++++++++++++++++++++++++++----- mm/memory.c | 42 +++++++++++++++++++++--- mm/mmap.c | 2 ++ mm/oom_kill.c | 2 ++ mm/vm_object.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 203 insertions(+), 13 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index eab96cdb25a6..06130c73bf2e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -543,6 +543,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head) void vm_area_free(struct vm_area_struct *vma) { + free_gm_mappings(vma); #ifdef CONFIG_PER_VMA_LOCK call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb); #else diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4f542444a91f..590000f63f04 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include @@ -684,6 +685,10 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, pgtable_t pgtable; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; vm_fault_t ret = 0; + struct gm_mapping *gm_mapping = NULL; + + if (vma_is_peer_shared(vma)) + gm_mapping = vm_object_lookup(vma->vm_mm->vm_obj, haddr); VM_BUG_ON_FOLIO(!folio_test_large(folio), folio); @@ -691,7 +696,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, folio_put(folio); count_vm_event(THP_FAULT_FALLBACK); count_vm_event(THP_FAULT_FALLBACK_CHARGE); - return VM_FAULT_FALLBACK; + ret = VM_FAULT_FALLBACK; + goto gm_mapping_release; } folio_throttle_swaprate(folio, gfp); @@ -701,7 +707,14 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, goto release; } - clear_huge_page(page, vmf->address, HPAGE_PMD_NR); + /* + * Skip zero-filling page if the logical mapping indicates + * that page contains valid data of the virtual address. This + * could happen if the page was a victim of device memory + * oversubscription. + */ + if (!(vma_is_peer_shared(vma) && gm_mapping_cpu(gm_mapping))) + clear_huge_page(page, vmf->address, HPAGE_PMD_NR); /* * The memory barrier inside __folio_mark_uptodate makes sure that * clear_huge_page writes become visible before the set_pmd_at() @@ -726,7 +739,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, pte_free(vma->vm_mm, pgtable); ret = handle_userfault(vmf, VM_UFFD_MISSING); VM_BUG_ON(ret & VM_FAULT_FALLBACK); - return ret; + goto gm_mapping_release; } entry = mk_huge_pmd(page, vma->vm_page_prot); @@ -734,6 +747,13 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, folio_add_new_anon_rmap(folio, vma, haddr); folio_add_lru_vma(folio, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); + if (vma_is_peer_shared(vma) && gm_mapping_device(gm_mapping)) { + vmf->page = page; + ret = gm_host_fault_locked(vmf, PMD_ORDER); + if (ret) + goto unlock_release; + } + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); @@ -741,6 +761,11 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + if (vma_is_peer_shared(vma)) { + gm_mapping_flags_set(gm_mapping, GM_PAGE_CPU); + gm_mapping->page = page; + mutex_unlock(&gm_mapping->lock); + } } return 0; @@ -750,6 +775,9 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, if (pgtable) pte_free(vma->vm_mm, pgtable); folio_put(folio); +gm_mapping_release: + if (vma_is_peer_shared(vma)) + mutex_unlock(&gm_mapping->lock); return ret; } @@ -808,17 +836,41 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; gfp_t gfp; - struct folio *folio; + struct folio *folio = NULL; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + vm_fault_t ret = 0; + struct gm_mapping *gm_mapping; + + if (vma_is_peer_shared(vma)) { + struct vm_object *vm_obj = vma->vm_mm->vm_obj; - if (!transhuge_vma_suitable(vma, haddr)) - return VM_FAULT_FALLBACK; - if (unlikely(anon_vma_prepare(vma))) - return VM_FAULT_OOM; + xa_lock(vm_obj->logical_page_table); + gm_mapping = vm_object_lookup(vm_obj, haddr); + if (!gm_mapping) { + vm_object_mapping_create(vm_obj, haddr); + gm_mapping = vm_object_lookup(vm_obj, haddr); + } + xa_unlock(vm_obj->logical_page_table); + mutex_lock(&gm_mapping->lock); + if (unlikely(!pmd_none(*vmf->pmd))) { + mutex_unlock(&gm_mapping->lock); + goto gm_mapping_release; + } + } + + if (!transhuge_vma_suitable(vma, haddr)) { + ret = VM_FAULT_FALLBACK; + goto gm_mapping_release; + } + if (unlikely(anon_vma_prepare(vma))) { + ret = VM_FAULT_OOM; + goto gm_mapping_release; + } khugepaged_enter_vma(vma, vma->vm_flags); if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && + !vma_is_peer_shared(vma) && transparent_hugepage_use_zero_page()) { pgtable_t pgtable; struct page *zero_page; @@ -857,12 +909,27 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) return ret; } gfp = vma_thp_gfp_mask(vma); + + if (vma_is_peer_shared(vma) && gm_mapping_cpu(gm_mapping)) + folio = page_folio(gm_mapping->page); + if (!folio) { + if (vma_is_peer_shared(vma)) + gfp = GFP_TRANSHUGE; + folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true); + } folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true); + if (unlikely(!folio)) { count_vm_event(THP_FAULT_FALLBACK); - return VM_FAULT_FALLBACK; + ret = VM_FAULT_FALLBACK; + goto gm_mapping_release; } return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp); + +gm_mapping_release: + if (vma_is_peer_shared(vma)) + mutex_unlock(&gm_mapping->lock); + return ret; } static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index 1f18ed4a5497..d6cc278dc39b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -78,6 +78,7 @@ #include #include #include +#include #include @@ -1695,8 +1696,10 @@ static void unmap_single_vma(struct mmu_gather *tlb, __unmap_hugepage_range(tlb, vma, start, end, NULL, zap_flags); } - } else + } else { unmap_page_range(tlb, vma, start, end, details); + unmap_gm_mappings_range(vma, start, end); + } } } @@ -4126,7 +4129,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; - struct folio *folio; + struct gm_mapping *gm_mapping; + bool skip_put_page = false; + struct folio *folio = NULL; vm_fault_t ret = 0; pte_t entry; @@ -4141,8 +4146,25 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; + if (vma_is_peer_shared(vma)) { + xa_lock(vma->vm_mm->vm_obj->logical_page_table); + gm_mapping = vm_object_lookup(vma->vm_mm->vm_obj, vmf->address); + if (!gm_mapping) { + vm_object_mapping_create(vma->vm_mm->vm_obj, vmf->address); + gm_mapping = vm_object_lookup(vma->vm_mm->vm_obj, vmf->address); + } + xa_unlock(vma->vm_mm->vm_obj->logical_page_table); + mutex_lock(&gm_mapping->lock); + + if (gm_mapping_cpu(gm_mapping)) { + folio = page_folio(gm_mapping->page); + skip_put_page = true; + } + } + /* Use the zero-page for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && + !vma_is_peer_shared(vma) && !mm_forbids_zeropage(vma->vm_mm)) { entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), vma->vm_page_prot)); @@ -4168,7 +4190,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + if (!folio) + folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); if (!folio) goto oom; @@ -4211,6 +4234,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) inc_mm_counter(vma->vm_mm, MM_ANONPAGES); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); + if (vma_is_peer_shared(vma)) { + if (gm_mapping_device(gm_mapping)) { + vmf->page = &folio->page; + gm_host_fault_locked(vmf, 0); + } + gm_mapping_flags_set(gm_mapping, GM_PAGE_CPU); + gm_mapping->page = &folio->page; + } setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); @@ -4221,9 +4252,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); + if (vma_is_peer_shared(vma)) + mutex_unlock(&gm_mapping->lock); return ret; release: - folio_put(folio); + if (!skip_put_page) + folio_put(folio); goto unlock; oom_free_page: folio_put(folio); diff --git a/mm/mmap.c b/mm/mmap.c index 55d43763ea49..8b8faa007dbc 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2616,6 +2616,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, #endif } for_each_vma_range(*vmi, next, end); + munmap_in_peer_devices(mm, start, end); + #if defined(CONFIG_DEBUG_VM_MAPLE_TREE) /* Make sure no VMAs are about to be lost. */ { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 9e6071fde34a..31ec027e98c7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include "internal.h" @@ -547,6 +548,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) continue; } unmap_page_range(&tlb, vma, range.start, range.end, NULL); + unmap_gm_mappings_range(vma, range.start, range.end); mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb); } diff --git a/mm/vm_object.c b/mm/vm_object.c index 5432930d1226..e0d1b558df31 100644 --- a/mm/vm_object.c +++ b/mm/vm_object.c @@ -142,6 +142,9 @@ void free_gm_mappings(struct vm_area_struct *vma) struct gm_mapping *gm_mapping; struct vm_object *obj; + if (vma_is_peer_shared(vma)) + return; + obj = vma->vm_mm->vm_obj; if (!obj) return; @@ -223,3 +226,84 @@ void gm_release_vma(struct mm_struct *mm, struct list_head *head) kfree(node); } } + +static int munmap_in_peer_devices_inner(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long start, unsigned long end, + int page_size) +{ + struct vm_object *obj = mm->vm_obj; + struct gm_mapping *gm_mapping; + struct gm_fault_t gmf = { + .mm = mm, + .copy = false, + }; + int ret; + + start = start > vma->vm_start ? start : vma->vm_start; + end = end < vma->vm_end ? end : vma->vm_end; + + for (; start < end; start += page_size) { + xa_lock(obj->logical_page_table); + gm_mapping = vm_object_lookup(obj, start); + if (!gm_mapping) { + xa_unlock(obj->logical_page_table); + continue; + } + xa_unlock(obj->logical_page_table); + + mutex_lock(&gm_mapping->lock); + if (!gm_mapping_device(gm_mapping)) { + mutex_unlock(&gm_mapping->lock); + continue; + } + + gmf.va = start; + gmf.size = page_size; + gmf.dev = gm_mapping->dev; + ret = gm_mapping->dev->mmu->peer_unmap(&gmf); + if (ret != GM_RET_SUCCESS) { + pr_err("%s: call dev peer_unmap error %d\n", __func__, + ret); + mutex_unlock(&gm_mapping->lock); + continue; + } + mutex_unlock(&gm_mapping->lock); + } + + return 0; +} + +void munmap_in_peer_devices(struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + struct vm_object *obj = mm->vm_obj; + struct vm_area_struct *vma; + + if (!gmem_is_enabled()) + return; + + if (!obj) + return; + + if (!mm->gm_as) + return; + + mmap_read_lock(mm); + do { + vma = find_vma_intersection(mm, start, end); + if (!vma) { + pr_debug("gmem: there is no valid vma\n"); + break; + } + + if (!vma_is_peer_shared(vma)) { + pr_debug("gmem: not peer-shared vma, skip dontneed\n"); + start = vma->vm_end; + continue; + } + + munmap_in_peer_devices_inner(mm, vma, start, end, HPAGE_SIZE); + } while (start < end); + mmap_read_unlock(mm); +} -- 2.25.1