Received: by 2002:ab2:6991:0:b0:1f7:f6c3:9cb1 with SMTP id v17csp507314lqo; Wed, 8 May 2024 06:44:48 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXzbka67KH6UJPQBO3k1U8Z7KzW1JIfoq+kyIGXgNEfqlQmSUlO35Grs5yGpmgfcQYzclc6eMaJh9idcxd/SLV0RutQL47VSMIbRIDHTg== X-Google-Smtp-Source: AGHT+IEb9yrZtLvYXyY9yDxF8WiQgT5G0SUZWbBMAjRGaQrzaD3LiaBs+W1K2QIC/BuSYpaFoRZn X-Received: by 2002:a05:6214:412:b0:6a0:ce2c:a1d7 with SMTP id 6a1803df08f44-6a15143c20bmr31269806d6.10.1715175887952; Wed, 08 May 2024 06:44:47 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715175887; cv=pass; d=google.com; s=arc-20160816; b=hYQXqvbfcUgBxims3pQGI+1OdF/kOLjor2VYYPd2jsiOD/GSO4hIJ7RapL/3FA+kPU MLddR7qmMrZEGz0QqHAqPh7KnoPWNJQcMYKgbO6C9guaMdnOTgsDZyAm36Ok21GqVVzM 27tsm3reKfjGhX5xtqplNelxBI2qVQrFKZlZ4sH2axng3sPc48IvtPdyiHgKNaPB+wdl 6hR7ZADpT2cv9YPQ25FfKSujXZhAW/AJnUhx3L9C/pHu8SRd8OFoGVm6WaCFTp4lvO/M o06PiiUQ1C7mBIvehl0bW/+MwzDzLq1k9RgiWph67Yr1Y6nuRgabc5iamkSFCsjZfz6E qSGw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=3/1kTK0V+w/pg4F/BIEwIRA/cpcZ53dWDPUvhLaSoPs=; fh=TqVp9YgMJtX9TlEyVUNAXiPd2CQZ1fTs7A2VvtGxV/4=; b=GSy3LKXfMG/RwLlTOJ2qZqzjLjXeJJff1uXFg5YAFF6A56cgl6pNoIDdS+NZ+96sce HLCgR4jsJ0jcldyd/4IMNXnACONI21hIjq/kGHqYXnxB1LxCsTu5vzVjIhEeyohrvxin OJAwUwM+9lUXzfbhwXMazjo3KZfAlGgBm2jpnPIe25D5PvmWgIRbh/1jgE7I3VU2tAxc afOcr/C3UQoaRsFxhOcaJEGd2/P5Yp47JoMeVnxG7dntucP0GJ/8JN8toIcMAo2tBCPr cXYJEOlLhFfJ/sdAVYkXf4/mXMxQrkMB91F7vq5nBV3FGGHasj1EJ4lvh1VgCrYyTYUJ 4z/A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-173343-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-173343-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id 22-20020ac85716000000b0043ddbe06d77si691248qtw.788.2024.05.08.06.44.47 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 May 2024 06:44:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-173343-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-173343-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-173343-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 988501C2121E for ; Wed, 8 May 2024 13:44:47 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5A16712836E; Wed, 8 May 2024 13:42:03 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C1A208593D for ; Wed, 8 May 2024 13:41:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715175722; cv=none; b=I0s6dEQRpbWPd0TS65dC+ldNtzlVbn7dfWxncWlBVpmkAbDn1nmev0YPSWqoLjEfd0AnPcsXalHheAOb+/Dal3b/2itx1TiMnk+trWW9U/zxhn4xCiwA9Vjs918Tkatq6roud+ZXTfCYBScmtjCUM73Fe1njOztkfFfXkrrMxnw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715175722; c=relaxed/simple; bh=0zLqaZotKMv5PLSJmqI1oGTAtVYf4Qx2C5RPP5X329M=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=AHxZtRC71WLZuLTBA7xdzQggIDoaEZ1cgkSO/8huWLTmFYU7shvyrHGnhMiV5nxKzOhElTYlI0BcniRXNp4UkbK4F9H+FiQTfgCDCPtJxQnu+U05pp20LFntQJ1P+k/V0e5x4i5YRHFGODeIfrgMokzxEWnebWckIzy4RiQORaI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 26B771007; Wed, 8 May 2024 06:42:23 -0700 (PDT) Received: from [10.57.67.194] (unknown [10.57.67.194]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E9D523F6A8; Wed, 8 May 2024 06:41:55 -0700 (PDT) Message-ID: <42733616-5f8f-47ce-a861-b00701069221@arm.com> Date: Wed, 8 May 2024 14:41:54 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries Content-Language: en-GB To: Kefeng Wang , Yang Shi Cc: David Hildenbrand , Matthew Wilcox , Yang Shi , riel@surriel.com, cl@linux.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ze Zuo References: <20231214223423.1133074-1-yang@os.amperecomputing.com> <1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com> <1dc9a561-55f7-4d65-8b86-8a40fa0e84f9@arm.com> <6016c0e9-b567-4205-8368-1f1c76184a28@huawei.com> <2c14d9ad-c5a3-4f29-a6eb-633cdf3a5e9e@redhat.com> <2b403705-a03c-4cfe-8d95-b38dd83fca52@arm.com> <281aebf1-0bff-4858-b479-866eb05b9e94@huawei.com> <219cb8e3-a77b-468b-9d69-0b3e386f93f6@arm.com> <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> From: Ryan Roberts In-Reply-To: <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 08/05/2024 14:37, Kefeng Wang wrote: > > > On 2024/5/8 16:36, Ryan Roberts wrote: >> On 08/05/2024 08:48, Kefeng Wang wrote: >>> >>> >>> On 2024/5/8 1:17, Yang Shi wrote: >>>> On Tue, May 7, 2024 at 8:53 AM Ryan Roberts wrote: >>>>> >>>>> On 07/05/2024 14:53, Kefeng Wang wrote: >>>>>> >>>>>> >>>>>> On 2024/5/7 19:13, David Hildenbrand wrote: >>>>>>> >>>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>>>>>>> >>>>>>>>> suggest. If you want to try something semi-randomly; it might be useful >>>>>>>>> to rule >>>>>>>>> out the arm64 contpte feature. I don't see how that would be interacting >>>>>>>>> here if >>>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable >>>>>>>>> with >>>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>>>>>>> but will have a try. >>>>>> >>>>>> After ARM64_CONTPTE disabled, memory read latency is similar with >>>>>> ARM64_CONTPTE >>>>>> enabled(default 6.9-rc7), still larger than align anon reverted. >>>>> >>>>> OK thanks for trying. >>>>> >>>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and >>>>> using >>>>> that for all sizes. That will presumably be considered "large" by malloc and >>>>> will be allocated using mmap. So with the patch, it will be 2M aligned. >>>>> Without >>>>> it, it probably won't. I'm still struggling to understand why not aligning >>>>> it in >>>>> virtual space would make it more performant though... >>>> >>>> Yeah, I'm confused too. >>> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference >>> for anon shows below, and all attached. >> >> OK, a bit more insight; during initialization, the test makes 2 big malloc >> calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas >> below (malloc is adding an extra page to the allocation, presumably for >> management structures). >> >> With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned >> address. All of its pages are populated (see permutation() which allocates and >> writes it) but none of them are THP (obviously - its only 1M and THP is only >> enabled for 2M). But the 512M region is allocated at a THP-aligned address. And >> the first page is populated with a THP (presumably faulted when malloc writes to >> its control structure page before the application even sees the allocated buffer. >> >> In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned, >> and therefore the 512M region abutts the 1M region and the vmas are merged in >> the kernel. So we end up with the single 525328 kB region. There are no THPs >> allocated here (due to alignment constraiints) so we end up with the 1M region >> fully populated with 4K pages as before, and only the malloc control page plus >> the parts of the buffer that the application actually touches being populated in >> the 512M region. >> >> As far as I can tell, the application never touches the 1M region during the >> test so it should be cache-cold. It only touches the first part of the 512M >> buffer it needs for the size of the test (96K here?). The latency of allocating >> the THP will have been consumed during test setup so I doubt we are seeing that >> in the test results and I don't see why having a single TLB entry vs 96K/4K=24 >> entries would make it slower. > > It is strange, and even more stranger, I got another machine(old machine > 128 core and the new machine 96 core, but with same L1/L2 cache size > per-core), the new machine without this issue, will contact with our > hardware team, maybe some different configurations(prefetch or some > other similar hardware configurations) , thank for all the suggestion > and analysis! No problem, you're welcome! > > >> >> It would be interesting to know the address that gets returned from malloc for >> the 512M region if that's possible to get (in both cases)? I guess it is offset >> into the first page. Perhaps it is offset such that with the THP alignment case >> the 96K of interest ends up straddling 3 cache lines (cache line is 64K I >> assume?), but for the unaligned case, it ends up nicely packed in 2? > > CC zuoze, please help to check this. > > Thank again. >> >> Thanks, >> Ryan >> >>> >>> 1) with efa7df3e3bb5 smaps >>> >>> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 >>> Size:             524300 kB >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                2048 kB >>> Pss:                2048 kB >>> Pss_Dirty:          2048 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      2048 kB >>> Referenced:         2048 kB >>> Anonymous:          2048 kB // we have 1 anon thp >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:      2048 kB >> >> Yes one 2M THP shown here. >> >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           1 >>> VmFlags: rd wr mr mw me ac >>> ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 >>> Size:               1028 kB >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                1028 kB >>> Pss:                1028 kB >>> Pss_Dirty:          1028 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1028 kB >>> Referenced:         1028 kB >>> Anonymous:          1028 kB // another large anon >> >> This is not THP, since you only have 2M THP enabled. This will be 1M of 4K page >> allocations + 1 4K page malloc control structure, allocated and accessed by >> permutation() during test setup. >> >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           0 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup >>> >>> 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] >>> Rss:                4724 kB >>> Pss:                3408 kB >>> Pss_Dirty:          3338 kB >>> Pss_Anon:           3338 kB >>> Pss_File:             70 kB >>> Pss_Shmem:             0 kB >>> Shared_Clean:       1176 kB >>> Shared_Dirty:        420 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      3128 kB >>> Referenced:         4344 kB >>> Anonymous:          3548 kB >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:      2048 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> >>> 2) without efa7df3e3bb5 smaps >>> >>> ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 >>> Size:             525328 kB >> >> This is a merged-vma version of the above 2 regions. >> >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                1128 kB >>> Pss:                1128 kB >>> Pss_Dirty:          1128 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1128 kB >>> Referenced:         1128 kB >>> Anonymous:          1128 kB // only large anon >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           1 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup, >>> >>> 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] >>> Rss:                2600 kB >>> Pss:                1472 kB >>> Pss_Dirty:          1388 kB >>> Pss_Anon:           1388 kB >>> Pss_File:             84 kB >>> Pss_Shmem:             0 kB >>> Shared_Clean:       1000 kB >>> Shared_Dirty:        424 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1176 kB >>> Referenced:         2220 kB >>> Anonymous:          1600 kB >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>>