Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1467192pxb; Wed, 10 Feb 2021 08:59:49 -0800 (PST) X-Google-Smtp-Source: ABdhPJwSln69zkp9zjBhomp3bZ7Hc/7vsHQg4eWo1NwoogYuydFPFzzEoDxfL0k8nt6I7cOLcMy/ X-Received: by 2002:a17:906:b217:: with SMTP id p23mr3979874ejz.126.1612976389282; Wed, 10 Feb 2021 08:59:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612976389; cv=none; d=google.com; s=arc-20160816; b=VF308WTfgt/rVMhOd9+jFmjWnLGGC4Qu0Zew/1FJqde6oJNADQOgOgN42J+sqOXY3K 7wwMxGu/C2CQCc3kHX+j/CF9KOyhjFWjA2yWaXGhX6SskmudEl35qmk2AGqJsXV9C0v9 wHnn5wDfO2uK1Nnns2B0JblEzxnHuURZUTBTGiBzkACTLXq+ZeSwRFud5IPpM057/v0H zSheyz8nN1+dToWnfct5Mr+ampOxKEmAjlKJDUgEJ4iAu8X03SeHCc1dmHppKf3q8lPq HTi0VHa2CMaSC1OFqQ59NEBUxE5CFIMn6BbVahKXa/SSmLSM7sCpxCZPn6PoiZRdtXQe 6e0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=kJpv/JRQaWu2kKJM37i4Mr3SQE516sRCChl999wh7wI=; b=T6Q6R+alMoa1Eq3FvulEa4FCcgmBnjuFaEzMi5FFE0Lxfjx7A1MjB1b4gOm1cPhu+I tyq0Y9LhljK3KrEqCtf6x2wYvNibZ1Jl8ow399RZmDnPWRlhIIyBZDJMoo/DKGftOLUd bT5T829d3ALZC+9zWesQFC1zMrzs3DxNWiYQ5lpIllyXo5x836XCwg2+TMzN3vnhM+ac xtdoU2WrwBraAVTYFi4i57HEZEuscPg6rulGhH4Mzjgv+wgUHrgG7au4D6CY3CQsBAot ry6m41VUBXjm28VgzGQsCYfRQL52evGS9zy1/iIr80ImPx/N7w7HisbbRWRXP2xWG+t2 6q+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=jFBI75Mf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r13si1881395edd.388.2021.02.10.08.59.21; Wed, 10 Feb 2021 08:59:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=jFBI75Mf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233074AbhBJQ6w (ORCPT + 99 others); Wed, 10 Feb 2021 11:58:52 -0500 Received: from mx2.suse.de ([195.135.220.15]:47196 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232313AbhBJQ6O (ORCPT ); Wed, 10 Feb 2021 11:58:14 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612976248; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kJpv/JRQaWu2kKJM37i4Mr3SQE516sRCChl999wh7wI=; b=jFBI75MfEPfgyYHAp0/q+oJZIPUjvMNQuqDwuh3C+DHK5yafuWaF1I5QWQ5hVoL88eet8O 69XQzktJmbIJWJbYHs3EtlaQaG0p0ZGwZWbhX5FKQEs29nWzZjayY7rtK6EqN5o20KrAIp bY9JDmLfznssCdCwhgfUsVaiAf5HV5E= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id CEE62AF2C; Wed, 10 Feb 2021 16:57:28 +0000 (UTC) Date: Wed, 10 Feb 2021 17:57:13 +0100 From: Michal Hocko To: Vlastimil Babka Cc: Milan Broz , linux-mm@kvack.org, Linux Kernel Mailing List , Mikulas Patocka Subject: Re: Very slow unlockall() Message-ID: References: <70885d37-62b7-748b-29df-9e94f3291736@gmail.com> <20210108134140.GA9883@dhcp22.suse.cz> <9474cd07-676a-56ed-1942-5090e0b9a82f@suse.cz> <6eebb858-d517-b70d-9202-f4e84221ed89@suse.cz> <273db3a6-28b1-6605-1743-ef86e7eb2b72@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <273db3a6-28b1-6605-1743-ef86e7eb2b72@suse.cz> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 10-02-21 16:18:50, Vlastimil Babka wrote: > On 2/1/21 8:19 PM, Milan Broz wrote: > > On 01/02/2021 19:55, Vlastimil Babka wrote: > >> On 2/1/21 7:00 PM, Milan Broz wrote: > >>> On 01/02/2021 14:08, Vlastimil Babka wrote: > >>>> On 1/8/21 3:39 PM, Milan Broz wrote: > >>>>> On 08/01/2021 14:41, Michal Hocko wrote: > >>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code > >>>>>>> and someone tried to use it with hardened memory allocator library. > >>>>>>> > >>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem > >>>>>>> is in munlockall(). > >>>>>>> > >>>>>>> Here is a plain reproducer for the core without any external code - it takes > >>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds! > >>>>>>> I can reproduce it on 5.10 kernels and Linus' git. > >>>>>>> > >>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used). > >>>>>>> The real code of course does something more useful but the problem is the same. > >>>>>>> > >>>>>>> #include > >>>>>>> #include > >>>>>>> #include > >>>>>>> #include > >>>>>>> > >>>>>>> int main (int argc, char *argv[]) > >>>>>>> { > >>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > >> > >> So, this is 2TB memory area, but PROT_NONE means it's never actually populated, > >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | > >> PROT_WRITE there, the mlockall() starts taking ages. > >> > >> So does that reflect your use case? munlockall() with large PROT_NONE areas? If > >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would > >> expect such scenario to be uncommon, so better clarify first. > > > > It is just a simple reproducer of the underlying problem, as suggested here > > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 > > > > We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly. > > (For the real case problem please read the whole issue report above.) > > OK, finally read through the bug report, and learned two things: > > 1) the PROT_NONE is indeed intentional part of the reproducer > 2) Linux mailing lists still have a bad reputation and people avoid them. That's > sad :( Well, thanks for overcoming that :) > > Daniel there says "I think the Linux kernel implementation of mlockall is quite > broken and tries to lock all the reserved PROT_NONE regions in advance which > doesn't make any sense." > > >From my testing this doesn't seem to be the case, as the mlockall() part is very > fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts > to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is > slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably > can't just skip them, as they might actually contain mlocked pages if those were > faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE. Mlock code is quite easy to misunderstand but IIRC the mlock part should be rather straightforward. It will mark VMAs as locked, do some merging/splitting where appropriate and finally populate the range by gup. This should fail because VMA doesn't allow neither read nor write, right? And mlock should report that. mlockall will not bother because it will ignore errors on population. So there is no page table walk happening. > And the munlock (munlock_vma_pages_range()) is slow, because it uses > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's > always traversing all levels of page tables from scratch. Funnily enough, > speeding this up was my first linux-mm series years ago. But the speedup only > works if pte's are present, which is not the case for unpopulated PROT_NONE > areas. That use case was unexpected back then. We should probably convert this > code to a proper page table walk. If there are large areas with unpopulated pmd > entries (or even higher levels) we would traverse them very quickly. Yes, this is a good idea. I suspect it will be little bit tricky without duplicating a large part of gup page table walker. -- Michal Hocko SUSE Labs