Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp527910imm; Wed, 20 Jun 2018 02:18:05 -0700 (PDT) X-Google-Smtp-Source: ADUXVKK16PCf3Ji1q78N7T1VgqyoovxrVnYMwOuxNY6Iai51t+15VKQW5d1cIudsJjNqOuGhNNhk X-Received: by 2002:a17:902:9690:: with SMTP id n16-v6mr10819165plp.94.1529486285555; Wed, 20 Jun 2018 02:18:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529486285; cv=none; d=google.com; s=arc-20160816; b=xm7UAvqmikU8sdcGeA7q6vBrGU6WSxzIOaW27jJUS/Rg46ERqrSSO0Zm/mCBP8unt3 Ww3T7vFexCtrvCIhOfv6/sGx8ki6ziIPfoc0O4DYTl3osLE1rfwbiMKx/rvMB6Hwmt9c d9PHkvvzcv47sVZ96xLAOAe0nc3SVhCQnJVZlNQ/B+NOJLMOFyLOvtaekeaDj6KFY8xh 3hv0KW4SAb6zDNgjGyoEtV91kaB8Nfc4aaGnTr5pYBmcsz6Rm0rZ/XsmAN+FH/6EngFT JaGdgU8QepCRbZJY2izQP+/Zyu1+PPATtVw9RY7XBOWTjv083Cr81NhSgekKwKS2s5w7 vNSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=fXom+r1DGfxNLwQckAyDP3yvC6jPJ+gksgQAql/JAPA=; b=FRH2w+LghP2sc4CJxmKmoVwUB4ut/hTfMBlwZdrE4V80T/7hsIJiHmRunD8skTS0BT ocB8Qoje8qaWm305AG0YhovhML5/akEMbKb7WMrlTsJ31Yja1usRjFFqhj11TmoKZaOZ PHnwk7AGkJ+2Ta9mJRuhL30HHUYE5EeK0R4k7ci2xj2qqU3OsCEMSeIES+cKbkl/JRNy /RjFocUD+dcWqDdNXhU6UC1u8SLcsIkc8HFUQsKc4K/pMXxPy8DNDx9In9UTVKDtba3J HeRe2JQpv/9xP0xQtIO17ViQXb/l5jrK38o8Zf+vYlijN4d12H0JEV3HbtcwKZXYDIkG HtOQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d65-v6si2001737pfg.142.2018.06.20.02.17.51; Wed, 20 Jun 2018 02:18:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754490AbeFTJPf (ORCPT + 99 others); Wed, 20 Jun 2018 05:15:35 -0400 Received: from mx2.suse.de ([195.135.220.15]:57426 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754291AbeFTIdp (ORCPT ); Wed, 20 Jun 2018 04:33:45 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id C4169AC71; Wed, 20 Jun 2018 07:18:17 +0000 (UTC) Date: Wed, 20 Jun 2018 09:18:17 +0200 From: Michal Hocko To: Nadav Amit Cc: Yang Shi , Matthew Wilcox , ldufour@linux.vnet.ibm.com, Andrew Morton , Peter Zijlstra , Ingo Molnar , acme@kernel.org, alexander.shishkin@linux.intel.com, jolsa@redhat.com, namhyung@kernel.org, "open list:MEMORY MANAGEMENT" , linux-kernel@vger.kernel.org Subject: Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping Message-ID: <20180620071817.GJ13685@dhcp22.suse.cz> References: <1529364856-49589-1-git-send-email-yang.shi@linux.alibaba.com> <1529364856-49589-3-git-send-email-yang.shi@linux.alibaba.com> <3DDF2672-FCC4-4387-9624-92F33C309CAE@gmail.com> <158a4e4c-d290-77c4-a595-71332ede392b@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 19-06-18 17:31:27, Nadav Amit wrote: > at 4:08 PM, Yang Shi wrote: > > > > > > > On 6/19/18 3:17 PM, Nadav Amit wrote: > >> at 4:34 PM, Yang Shi > >> wrote: > >> > >> > >>> When running some mmap/munmap scalability tests with large memory (i.e. > >>> > >>>> 300GB), the below hung task issue may happen occasionally. > >>>> > >>> INFO: task ps:14018 blocked for more than 120 seconds. > >>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > >>> message. > >>> ps D 0 14018 1 0x00000004 > >>> > >>> > >> (snip) > >> > >> > >>> Zapping pages is the most time consuming part, according to the > >>> suggestion from Michal Hock [1], zapping pages can be done with holding > >>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > >>> mmap_sem to manipulate vmas. > >>> > >> Does munmap() == MADV_DONTNEED + munmap() ? > > > > Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > > > >> > >> For example, what happens with userfaultfd in this case? Can you get an > >> extra #PF, which would be visible to userspace, before the munmap is > >> finished? > >> > > > > userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. > > Right. I see it now. > > > > >> > >> In addition, would it be ok for the user to potentially get a zeroed page in > >> the time window after the MADV_DONTNEED finished removing a PTE and before > >> the munmap() is done? > >> > > > > This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. > > Thanks for the reference. > > Reading the man page I see: "All pages containing a part of the indicated > range are unmapped, and subsequent references to these pages will generate > SIGSEGV.” Yes, this is true but I guess what Yang Shi meant was that an userspace access racing with munmap is not well defined. You never know whether you get your data, #PTF or SEGV because it depends on timing. The user visible change might be that you lose content and get zero page instead if you hit the race window while we are unmapping which was not possible before. But whouldn't such an access pattern be buggy anyway? You need some form of external synchronization AFAICS. But maybe some userspace depends on "getting right data or get SEGV" semantic. If we have to preserve that then we can come up with a VM_DEAD flag set before we tear it down and force the SEGV on the #PF path. Something similar we already do for MMF_UNSTABLE. -- Michal Hocko SUSE Labs