Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp4900116ybi; Tue, 28 May 2019 04:29:23 -0700 (PDT) X-Google-Smtp-Source: APXvYqxWB3CKjYjDsib+JnoRWoOjIwaZmatE6SD5WG0vJgURoius5EnEvILLx/SPcxKER4oQ04i+ X-Received: by 2002:a63:1b1e:: with SMTP id b30mr39278724pgb.180.1559042963850; Tue, 28 May 2019 04:29:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559042963; cv=none; d=google.com; s=arc-20160816; b=uMdhE42u7J2+c14iqZo+wzLfzBVffblHuB3oJjDtrXP6PS8JXdeQFiHPEUe/151SIF XZFJ5rnKz/DkJ+k2U3Ywc65yaiw1+dE9QkD1B4RjJKVd6QUoThWLLVYbSi/+xtkWgsgh iA7kDzscG/scRNmT6X0cACW7mcLZNw6WF1Fh9gd3Ni+ULRGkWrGHZ/ffNzJ6qMc45DZ5 W1G0K7Oba2UGtb8o0rvQ/IRPeY5pneKiwFT0AQD3EMtMZ+39bxN7Znev8cN8oQZKrd5l NrzE2KzUtmLVMvg7WFqj4JtGRpRooTl+s/3h5B4WTziKF1BR5RPDMBNYgW6euaQtYnue OKWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=CJjr4QcKIDXguG3swKrHpGv8Kgka4wgBSkcZKXispCo=; b=Ct4sMQOIwCK7sI7UzVS64s8m1QX70ux3nVireU0HiWRBSmGgW3iKu+dzWkWQ+6/7Wi ++2i6RbET3rkX01M7cl7bIhURXsjbeq6ptyfeVsjii1G3kkAcWkPq/SzIy9FjQtiPges CGWiCWarkg+iatbTbgh6rZ+9OB0kb6NQM4ZtWMRHfHQiDD4/fFSDXpX9Ygv1dtAIAviz r/B1+/zJt4cJ3OtV1g3RQClSRZwmdGT2cDEUAXG596cQ+DXrbe9m119VO5E0Q4lRBvh2 mLmbZDu5huyzLAa2DwC78lyMQH6Xm9jhVS/UlHUeDnEPmvDQlyG+sqtQkTBQPWV7CC0o 8TLg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Lsbugko9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a15si25265315pfa.85.2019.05.28.04.29.07; Tue, 28 May 2019 04:29:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Lsbugko9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726776AbfE1JjS (ORCPT + 99 others); Tue, 28 May 2019 05:39:18 -0400 Received: from mail-vs1-f66.google.com ([209.85.217.66]:33738 "EHLO mail-vs1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726715AbfE1JjR (ORCPT ); Tue, 28 May 2019 05:39:17 -0400 Received: by mail-vs1-f66.google.com with SMTP id y6so12373002vsb.0 for ; Tue, 28 May 2019 02:39:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CJjr4QcKIDXguG3swKrHpGv8Kgka4wgBSkcZKXispCo=; b=Lsbugko9hzn5ewzyguy2BAClCZdJLimpQyRyG90whoByeaMaAbQ46O6FU6dSmcxqWr hKs67qWNIX/MEZ5z1nJ8+Ksn1CId0jT+DXvO7chx1Nh55zzdl+PLuy9zmM1VR8j+zfEc EuB/yLzbKUMNh/Zu2IbNK8g1b2vnDKp5UKzPBJFKtX2O8dWK8ZwEL2iFae9VXiFD4hvo Pa2M85EaQxc1ekoh2VG937kS9xAK34c5VnQQcpcQaPLlS7HM64Ht8zTlLckMfI/PqOoO 239V28lfxy+EYvBP8PL0Ujdo+8pF0hZTKy/1TneaWQhdDJfqiNbI3TlDhBxEfZcp1l4U XJRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CJjr4QcKIDXguG3swKrHpGv8Kgka4wgBSkcZKXispCo=; b=nTJEK3mTqO1UIRqxipAxgY/CGcWjzwguod7L1vjWNH1s13Az489UXD36DfZ0mXfO+t K8DFL8BAbiDp3tGDEsgMOtXEzW/HtqCLcFk3iGtmfbWjQ5d0hAWgB4Vl/XaLkRdUTMp4 4Aw+IrWzUnXqIaxwokzzflFoN9uEXomoiEThsR76IRGeM1m34RsRYHK6zwdQ4HW/MwM5 1nKJSNzRrj1x/fJGHnGon067ukfrmSTQBLHNGx8nec3BpDekuiybQrS8uy72xGWS2bCi z31N+wO2YBaK00yIaLV3bbBO0g6HXya3Om40JEFikcuotji0jR5LJmiuzFmogTzAAdWB XJ9Q== X-Gm-Message-State: APjAAAWJ+naXheuxW0UxhsCe+rX2+VYEuVqe9VpDocK/JgCrsIHkA3OF 8inEcPRhO1EGEltdd87DvODHiKru7jojfuJNcX9kpw== X-Received: by 2002:a67:dd8e:: with SMTP id i14mr30137012vsk.149.1559036355824; Tue, 28 May 2019 02:39:15 -0700 (PDT) MIME-Version: 1.0 References: <20190520092801.GA6836@dhcp22.suse.cz> <20190521025533.GH10039@google.com> <20190521062628.GE32329@dhcp22.suse.cz> <20190527075811.GC6879@google.com> <20190527124411.GC1658@dhcp22.suse.cz> <20190528032632.GF6879@google.com> <20190528062947.GL1658@dhcp22.suse.cz> <20190528081351.GA159710@google.com> <20190528084927.GB159710@google.com> <20190528090821.GU1658@dhcp22.suse.cz> In-Reply-To: <20190528090821.GU1658@dhcp22.suse.cz> From: Daniel Colascione Date: Tue, 28 May 2019 02:39:03 -0700 Message-ID: Subject: Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER To: Michal Hocko Cc: Minchan Kim , Andrew Morton , LKML , linux-mm , Johannes Weiner , Tim Murray , Joel Fernandes , Suren Baghdasaryan , Shakeel Butt , Sonny Rao , Brian Geffon , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 28, 2019 at 2:08 AM Michal Hocko wrote: > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim wrote: > > > > if we went with the per vma fd approach then you would get this > > > > > feature automatically because map_files would refer to file backed > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > so map_anon wouldn't be helpful. > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > to suggest that providing an efficient binary interfaces for pulling > > > memory map information out of processes. Some single-system-call > > > method for retrieving a binary snapshot of a process's address space > > > complete with attributes (selectable, like statx?) for each VMA would > > > reduce complexity and increase performance in a variety of areas, > > > e.g., Android memory map debugging commands. > > > > I agree it's the best we can get *generally*. > > Michal, any opinion? > > I am not really sure this is directly related. I think the primary > question that we have to sort out first is whether we want to have > the remote madvise call process or vma fd based. This is an important > distinction wrt. usability. I have only seen pid vs. pidfd discussions > so far unfortunately. I don't think the vma fd approach is viable. We have some processes with a *lot* of VMAs --- system_server had 4204 when I checked just now (and that's typical) --- and an FD operation per VMA would be excessive. VMAs also come and go pretty easily depending on changes in protections and various faults. It's also not entirely clear what the semantics of vma FDs should be over address space mutations, while the semantics of address ranges are well-understood. I would much prefer an interface operating on address ranges to one operating on VMA FDs, both for efficiency and for consistency with other memory management APIs. > An interface to query address range information is a separate but > although a related topic. We have /proc//[s]maps for that right > now and I understand it is not a general win for all usecases because > it tends to be slow for some. I can see how /proc//map_anons could > provide per vma information in a binary form via a fd based interface. > But I would rather not conflate those two discussions much - well except > if it could give one of the approaches more justification but let's > focus on the madvise part first. I don't think it's a good idea to focus on one feature in a multi-feature change when the interactions between features can be very important for overall design of the multi-feature system and the design of each feature. Here's my thinking on the high-level design: I'm imagining an address-range system that would work like this: we'd create some kind of process_vm_getinfo(2) system call [1] that would accept a statx-like attribute map and a pid/fd parameter as input and return, on output, two things: 1) an array [2] of VMA descriptors containing the requested information, and 2) a VMA configuration sequence number. We'd then have process_madvise() and other cross-process VM interfaces accept both address ranges and this sequence number; they'd succeed only if the VMA configuration sequence number is still current, i.e., the target process hasn't changed its VMA configuration (implicitly or explicitly) since the call to process_vm_getinfo(). This way, a process A that wants to perform some VM operation on process B can slurp B's VMA configuration using process_vm_getinfo(), figure out what it wants to do, and attempt to do it. If B modifies its memory map in the meantime, If A finds that its local knowledge of B's memory map has become invalid between the process_vm_getinfo() and A taking some action based on the result, A can retry [3]. While A could instead ptrace or otherwise suspend B, *then* read B's memory map (knowing B is quiescent), *then* operate on B, the optimistic approach I'm describing would be much lighter-weight in the typical case. It's also pretty simple, IMHO. If the "operate on B" step is some kind of vectorized operation over multiple address ranges, this approach also gets us all-or-nothing semantics. Or maybe the whole sequence number thing is overkill and we don't need atomicity? But if there's a concern that A shouldn't operate on B's memory without knowing what it's operating on, then the scheme I've proposed above solves this knowledge problem in a pretty lightweight way. [1] or some other interface [2] or something more complicated if we want the descriptors to contain variable-length elements, e.g., strings [3] or override the sequence number check if it's feeling bold?