Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp544893imm; Wed, 18 Jul 2018 06:44:03 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfOvgCdzRhRD2IaC7wFDPIoqHFAq2clJaMxEHlciZeqmrbiy8ois6L6fSOvuc4OAuRMkxYL X-Received: by 2002:a63:68c1:: with SMTP id d184-v6mr5843200pgc.239.1531921443530; Wed, 18 Jul 2018 06:44:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531921443; cv=none; d=google.com; s=arc-20160816; b=xAMnjd//fX1CPscdGY/0pacbmj6h8PbIZJ5V9puL8ggHbVo8VBzS+OLQM+GSq+idyc bzoG5T+HBMXZxd/L4Ry+bRfXy1AMJa01/L6r3btOzrUcklLLKu006wwgi1BRjqpfDZPv Sgimc6dR4n9ZTpuQ31mGz0sk9MJLHKZ/3Zu6mD+K59XNHrH04fm3fNCOdoti8fQgHgAN xK4sz61rMVStdp/si47dMYVlQkYxEdGCFKEQnB+tPQF8Vy9ytyTjCug0Zd11NaP1Pb8z v7dT4Rk6qdud/mfklqswAWPVyiT6ixTEpuI3e4N8YzzKINurGN8WcL6J2DTaPwLhlUAn SWNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=cHfJqmyOCf5vEXfYdNa3pHzCqrrVfIduKsRfxCL23ag=; b=Y9VTrMEy2aVi8QrY2jJ8n7ox1lR5wrwe+cuXvVYJMncx84JqmWTwSjmsyfcQ6v2Nug HRSsjTjlMtpaMXTFxd4JPIHngYQPVfCpsKZrUO9rQmFZ+/LRsUDxZK43Kgwve0tjCeaz hct79haVBi06fMUIm1wDlzqiIu9WC7ADebQpC9JQMJCA1lXc3eJ76EI7ZmshZAQnOy6L rTBwWY1UJy7abio9Ub8XOUjMRrcvCBarpZQhqc3EB2xlCwfPjiMfZJo1dbECsVFhrk5l 4SJpHxJhlKLPE25kTRAEK2Dk+7SCSrwIwWhvli6mCswiRvJF+RAKMVLaV3G1AqrIQ1F/ DgXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x3-v6si3424242pgg.347.2018.07.18.06.43.48; Wed, 18 Jul 2018 06:44:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731369AbeGROVQ (ORCPT + 99 others); Wed, 18 Jul 2018 10:21:16 -0400 Received: from mx2.suse.de ([195.135.220.15]:39530 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730262AbeGROVP (ORCPT ); Wed, 18 Jul 2018 10:21:15 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 704E6AE20; Wed, 18 Jul 2018 13:43:12 +0000 (UTC) Date: Wed, 18 Jul 2018 15:43:08 +0200 From: Michal Hocko To: David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?iso-8859-1?B?Suly9G1l?= Glisse , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Message-ID: <20180718134308.GF7193@dhcp22.suse.cz> References: <20180524075327.GU20441@dhcp22.suse.cz> <14d79dad-ad47-f090-2ec0-c5daf87ac529@redhat.com> <20180524093121.GZ20441@dhcp22.suse.cz> <20180524120341.GF20441@dhcp22.suse.cz> <1a03ac4e-9185-ce8e-a672-c747c3e40ff2@redhat.com> <20180524142241.GJ20441@dhcp22.suse.cz> <819e45c5-6ae3-1dff-3f1d-c0411b6e2e1d@redhat.com> <20180718131905.GB7193@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-07-18 15:39:29, David Hildenbrand wrote: > On 18.07.2018 15:19, Michal Hocko wrote: > > [got back to this really late. Sorry about that] > > > > On Thu 24-05-18 23:07:23, David Hildenbrand wrote: > >> On 24.05.2018 16:22, Michal Hocko wrote: > >>> I will go over the rest of the email later I just wanted to make this > >>> point clear because I suspect we are talking past each other. > >> > >> It sounds like we are now talking about how to solve the problem. I like > >> that :) > >> > >>> > >>> On Thu 24-05-18 16:04:38, David Hildenbrand wrote: > >>> [...] > >>>> The point I was making is: I cannot allocate 8MB/128MB using the buddy > >>>> allocator. All I want to do is manage the memory a virtio-mem device > >>>> provides as flexible as possible. > >>> > >>> I didn't mean to use the page allocator to isolate pages from it. We do > >>> have other means. Have a look at the page isolation framework and have a > >>> look how the current memory hotplug (ab)uses it. In short you mark the > >>> desired physical memory range as isolated (nobody can allocate from it) > >>> and then simply remove it from the page allocator. And you are done with > >>> it. Your particular range is gone, nobody will ever use it. If you mark > >>> those struct pages reserved then pfn walkers should already ignore them. > >>> If you keep those pages with ref count 0 then even hotplug should work > >>> seemlessly (I would have to double check). > >>> > >>> So all I am arguing is that whatever your driver wants to do can be > >>> handled without touching the hotplug code much. You would still need > >>> to add new ranges in the mem section units and manage on top of that. > >>> You need to do that anyway to keep track of what parts are in use or > >>> offlined anyway right? Now the mem sections. You have to do that anyway > >>> for memmaps. Our sparse memory model simply works in those units. Even > >>> if you make a part of that range unavailable then the section will still > >>> be there. > >>> > >>> Do I make at least some sense or I am completely missing your point? > >>> > >> > >> I think we're heading somewhere. I understand that you want to separate > >> this "semi" offline part from the general offlining code. If so, we > >> should definitely enforce segment alignment for online_pages/offline_pages. > >> > >> Importantly, what I need is: > >> > >> 1. Indicate and prepare memory sections to be used for adding memory > >> chunks (right now add_memory()) > > > > Yes, this is section based. So you will always get memmap (struct page) > > for the whole section. > > > >> 2. Make memory chunks of a section available to the system (right now > >> online_pages()) > > > > Yes, this doesn't have to be section based. All you need is to mark > > remaining pages as offline. They are reserved at this moment so nobody > > should touch tehem. > > > >> 3. Remove memory chunks of a section from the system (right now > >> offline_pages()) > > > > Yes. All we need is to note that those reserved pages are actually good > > to offline. I have mentioned that reserved pages are yours at this stage > > so you can note the special state without an additional page flag. > > > > The generic hotplug code just have to learn about this new state. > > has_unmovable_pages sounds like a proper place to do that. You simply > > clear the offline state and the PageReserved and you are done with the > > page. > > > > I agree. This would be minimal invassive - notifiers are still called on > whole segment. That shouldn't matter because notifiers should never step on pages they do not manage or own. > >> 4. Remove memory sections from the system (right now remove_memory()) > > > > no change needed > > > >> 5. Hinder dumping tools from reading memory chunks that are logically > >> offline (right now PageOffline()) > > > > I still fail to see why do we even care about some dumping tools. Pages > > are reserved so they simply shouldn't touch that memory at all. > > > > Thanks for having a look! > > I wonder why reserved pages never got excluded by dump tools. So I > assume there is some kind of magic hidden in it. > > `git grep SetPageReserved` returns a number of buffers that are not to > be swapped. So "reserved" there is used for: > "PG_reserved is set for special pages, which can never be swapped out" That was an ancient menaing of the flag. The flag in general means that you shouldn't touch it unless you own it. > And my point would be that these pages are still to be dumped (just as > it is being done now). They are valid memory. Then fix kdump or what ever is touching them. > It seems like this bit is used for two different purposes. My take would > be then to have another way of indicating "don't swap" vs. "page not > accessible / offline". And that's why I propose PageOffline. > > I would even go one step further and rename "reserved" to "dontswap". No, it really doesn't have that meaning for years. -- Michal Hocko SUSE Labs