Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp2176099rwi; Fri, 21 Oct 2022 00:11:30 -0700 (PDT) X-Google-Smtp-Source: AMsMyM70PUSXtMVvOv3nqkCwr0WNuFbWc5zNvz5ikDL6QaB7sMopHX0vPhKrhl/DBL33Jaa0F6ON X-Received: by 2002:a17:907:dac:b0:78e:17a0:d1cb with SMTP id go44-20020a1709070dac00b0078e17a0d1cbmr14390367ejc.618.1666336290680; Fri, 21 Oct 2022 00:11:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666336290; cv=none; d=google.com; s=arc-20160816; b=XUUfChp576xN9c1bay7EFkTupB4vCldk7ye7uMxbUdmEukVthXlG+Ln+vVquJbp1pQ W3pLcYT7rdyrR6w03Z12HS1F4qQJpMKvKUxvwN8HxT4T2RGYUmrLxX00BcpltZD2htkQ clunWT94YZzZKrkGeZC0JQ+QVGW/cAitWBXUtKwBhUPW2UlGkjiL2+0JAzpEz3wxKlpd n6rl3lJQIka+q+cdU5Kud8Zklpyt8SDuZSawpf2fQzGYHawzJ2X/t7qWpORI6tedWzKI miCCbKC66pXm1cknk/tC7hKTcOZfx2yYpUdSLvLCHTZihxFzHUt2VdagX/LCsQ/Fqq+f IG3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=aYEk0h2T9iMUlgLcLsmfRDFMgv4l+8DcbVeL65P+SpE=; b=Nx7ypX1pTTvmorSyi6XBODqT0DUOfmk8yJ1dbDQpJycQdjAlDrPhcPxdbwKP6QZiIJ /v+NlG7o+Gnp3RNWNVPfVPXFXndns6ARL+60pgYgnPCLF/gTSK5xyvI6qQAyTdLbZVun Xcx74oOm1jBqixkrYEICMfHPEnMTl4EiRUEiVLE/XQAfC95eL6cUCLkamxPQhxs8v/NQ pqmSu/FY1/00xGD56t20xs9u1e8dHLQ9t/H+ZH7eP6EptmPcSq4BdA+z7C71gHb3+nQ4 C8POcp9EBQdmorGVGP6y2pmmsHv2mXsftDZucLyVjBCzyF6AmQP5ulv0Fbnd3p+FOexZ m67A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o19-20020a170906975300b0079330b37fb6si7234539ejy.880.2022.10.21.00.11.04; Fri, 21 Oct 2022 00:11:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230071AbiJUGPk (ORCPT + 99 others); Fri, 21 Oct 2022 02:15:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229763AbiJUGPi (ORCPT ); Fri, 21 Oct 2022 02:15:38 -0400 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1FDB077E95 for ; Thu, 20 Oct 2022 23:15:35 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0VSiAvfU_1666332931; Received: from 30.97.48.58(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VSiAvfU_1666332931) by smtp.aliyun-inc.com; Fri, 21 Oct 2022 14:15:32 +0800 Message-ID: Date: Fri, 21 Oct 2022 14:15:26 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [PATCH 2/2] mm: migrate: Try again if THP split is failed due to page refcnt To: Yang Shi Cc: "Huang, Ying" , akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, jingshan@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <87mt9qnbrf.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/21/2022 3:21 AM, Yang Shi wrote: > On Thu, Oct 20, 2022 at 2:33 AM Baolin Wang > wrote: >> >> >> >> On 10/20/2022 4:24 PM, Huang, Ying wrote: >>> Baolin Wang writes: >>> >>>> When creating a virtual machine, we will use memfd_create() to get >>>> a file descriptor which can be used to create share memory mappings >>>> using the mmap function, meanwhile the mmap() will set the MAP_POPULATE >>>> flag to allocate physical pages for the virtual machine. >>>> >>>> When allocating physical pages for the guest, the host can fallback to >>>> allocate some CMA pages for the guest when over half of the zone's free >>>> memory is in the CMA area. >>>> >>>> In guest os, when the application wants to do some data transaction with >>>> DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and >>>> create IOMMU mappings for the DMA pages. However, when calling >>>> VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be >>>> failed to longterm-pin sometimes. >>>> >>>> After some invetigation, we found the pages used to do DMA mapping can >>>> contain some CMA pages, and these CMA pages will cause a possible >>>> failure of the longterm-pin, due to failed to migrate the CMA pages. >>>> The reason of migration failure may be temporary reference count or >>>> memory allocation failure. So that will cause the VFIO_IOMMU_MAP_DMA >>>> ioctl returns error, which makes the application failed to start. >>>> >>>> I observed one migration failure case (which is not easy to reproduce) is >>>> that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed' >>>> count is also 1. >>>> >>>> That means when migrating a THP which is in CMA area, but can not allocate >>>> a new THP due to memory fragmentation, so it will split the THP. However >>>> THP split is also failed, probably the reason is temporary reference count >>>> of this THP. And the temporary reference count can be caused by dropping >>>> page caches (I observed the drop caches operation in the system), but we >>>> can not drop the shmem page caches due to they are already dirty at that time. >>>> >>>> Especially for THP split failure, which is caused by temporary reference >>>> count, we can try again to mitigate the failure of migration in this case >>>> according to previous discussion [1]. >>> >>> Does the patch solved your problem? >> >> The problem is not easy to reproduce and I will test this patch on our >> products. However I think this is a likely case to fail the migration, >> which need to be addressed to mitigate the failure. > > You may try to trace all migrations across your fleet (or just pick > some sample machines, this should make data analysis easier) and > filter the migration by reasons, for example, MR_LONGTERM_PIN, then > compare the migration success rate before and after the patch. It > should be a good justification. But it may need some work on data > aggregation, process and analysis, not sure how feasible it is. IMO the migration of MR_LONGTERM_PIN is very rare in this case, so we can obeserve the migraion failure of longterm pin, once obeserved, the application will be aborted. However like I said before, the problem is not easy to reproduce :( Anyway we'll test this 2 patches on our products. >>>> [1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/ >>>> Signed-off-by: Baolin Wang >>>> --- >>>> mm/huge_memory.c | 4 ++-- >>>> mm/migrate.c | 18 +++++++++++++++--- >>>> 2 files changed, 17 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index ad17c8d..a79f03b 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -2666,7 +2666,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >>>> * split PMDs >>>> */ >>>> if (!can_split_folio(folio, &extra_pins)) { >>>> - ret = -EBUSY; >>>> + ret = -EAGAIN; >>>> goto out_unlock; >>>> } >>>> >>>> @@ -2716,7 +2716,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >>>> xas_unlock(&xas); >>>> local_irq_enable(); >>>> remap_page(folio, folio_nr_pages(folio)); >>>> - ret = -EBUSY; >>>> + ret = -EAGAIN; >>>> } >>>> >>>> out_unlock: >>>> diff --git a/mm/migrate.c b/mm/migrate.c >>>> index 8e5eb6e..55c7855 100644 >>>> --- a/mm/migrate.c >>>> +++ b/mm/migrate.c >>>> @@ -1506,9 +1506,21 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, >>>> if (is_thp) { >>>> nr_thp_failed++; >>>> /* THP NUMA faulting doesn't split THP to retry. */ >>>> - if (!nosplit && !try_split_thp(page, &thp_split_pages)) { >>>> - nr_thp_split++; >>>> - break; >>>> + if (!nosplit) { >>>> + rc = try_split_thp(page, &thp_split_pages); >>>> + if (!rc) { >>>> + nr_thp_split++; >>>> + break; >>>> + } else if (reason == MR_LONGTERM_PIN && >>>> + rc == -EAGAIN) { >>> >>> In case reason != MR_LONGTERM_PIN, you change the return value of >>> migrate_pages(). So you need to use another variable for return value. >> >> Good catch, will fix in next version. Thanks for your comments.