Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp36741ybl; Mon, 12 Aug 2019 11:16:08 -0700 (PDT) X-Google-Smtp-Source: APXvYqw8yea9kTOSnyHm4WyKtkpe4DeNJN8mfQPFlLg/UFjYwCPsDu1XJObhx3J9Ax6/HgUnSWY6 X-Received: by 2002:a63:3112:: with SMTP id x18mr31476149pgx.385.1565633767894; Mon, 12 Aug 2019 11:16:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565633767; cv=none; d=google.com; s=arc-20160816; b=zWNTp35SbPKSc0oUq1nB70S9T6wyYUoFbnmAvHsHdQUsHDSX08cd3szprgeYII+fNg 21hpO6/PQd2rKK6qSRSk416Vq3v6Ol/Jtn1FMO/Bmez0My9JusPaMCcJe7XDzRSQrEW4 beFMWx5AgUQdPW6hvMG/++4BmQLYyN3Aq4UyY4XrvGysb9BDMGJEqhdmlIYCQ4i7zWkJ EuvJ55eo0QuWR3hfMbIrG6BcHZFwmEtJTZ/a3Ka05OnkuHM1V29eE4ciAfjYCOsr918v r3wXxkAQKoLIHfb6p7VVJPC8Ak8ZVUqPQgu/AsYzeE1MBoyN4bJnnUgRv+pfJCn4QPnD YRnw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=zPb/gs6AfTEohzM47OUhzZWZx9HN5LoNHZy9Vz7zeGQ=; b=nCUWNm5OieByBGO2ZbkH7jVBdcs0xXDevwSfAnZW2P9Sd2oihey7S8vbvY4A1y7o5L xI3LmJ6KEQ8cPzpGYiSq0wkq3T4oZDuOivsgd3iUcD3naFP1+r44cWJIkdhupvgmivRL wLPXl+hDOibhKrZ2B36d9qUEzwEqGJT6XbuSlzyJN9cuAUjtr8V7uYRvPZknMfHXeGqX MbTBkvFerkIXHXHlAM+pjllYqmbZVs+m8qy56+UEjFKoFk8suWfc6slB6gSVx7ANvZKW 7my1YtlvEFI6aG7qUxGWzm+WRXt2HaQVLYYD0j48pmOcp0GETgyBTecsSoZuBtoRJZej TmLg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=FoLztCHZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m5si55828151pll.439.2019.08.12.11.15.49; Mon, 12 Aug 2019 11:16:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=FoLztCHZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726539AbfHLSPG (ORCPT + 99 others); Mon, 12 Aug 2019 14:15:06 -0400 Received: from mail-ot1-f65.google.com ([209.85.210.65]:33705 "EHLO mail-ot1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726090AbfHLSPG (ORCPT ); Mon, 12 Aug 2019 14:15:06 -0400 Received: by mail-ot1-f65.google.com with SMTP id q20so9463849otl.0 for ; Mon, 12 Aug 2019 11:15:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zPb/gs6AfTEohzM47OUhzZWZx9HN5LoNHZy9Vz7zeGQ=; b=FoLztCHZdO9NKDb0KRgLBbqw3LgRtHBnaG3vel+Sg4wz3J5qKqdjr3df9IXmrMURSK 8S9oDZO+hmHURH5QMo1RdI/FO3L8KpUOLL37dtNtkhu1W+Vm+bLz+2r4al7MYMLdVMJr nYSMviZj/p+hQ30H2uaKOZTgXNVEdyeyBghkdG8j38wRRd4eUVEwW6+7HNSLCbXGUChh 1OtEVK17+qJfOu/BI/EoIbSbNFX1VBzOJCAGp0DLgZj7dcjJb0x/cjbOsfLtBBcOi6Jw IY2MDAh2vbNV0JXQACHQWLC/q7e5/IuAwUjq1rVpUEDXFTGQWIGbbh5aCDh+Nir4Fg/O Fglg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zPb/gs6AfTEohzM47OUhzZWZx9HN5LoNHZy9Vz7zeGQ=; b=M8d+yvdDnn8uD26QVg9+++hGIlfA3DIR8DjGduRuRCIiVATy/mwyMt01M5/JjFcTxq VsN8831hOcQxybDDQme/KnlU7G/M9dZImHc3V/HuQxrM4amQV2wBG+rD6gU7L0bqEjOK JO0itYS3NlAApYPMFHQg4djC98bHOFNxzc3AMKzcbmaTx383O69quuFzPxNTZSxvDqdH 3XFvAy9axahlZ7X47hVWbVdPEw6eJFC5B7JRk1TouA6lVJDA9hJ2aJWvGXU2B8zcvjO1 cQuvNYQY/qGOMvGb3fXk3/jMtRXTQmToHAQHJEFSsgOVy+CRRI3fzexTEdDbE3eaek1f DTwA== X-Gm-Message-State: APjAAAUG+KqWGRb6qaAno/E/bfiEpUwEdHqVVBQX6PSXY9OeriPJBjAb alSQR8oWdpACQVSUnWz5U+uPCDBdaU8FMOHL1OD1tPmnx6pDUg== X-Received: by 2002:a9d:774a:: with SMTP id t10mr27954353otl.228.1565633704610; Mon, 12 Aug 2019 11:15:04 -0700 (PDT) MIME-Version: 1.0 References: <20190807171559.182301-1-joel@joelfernandes.org> In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org> From: Jann Horn Date: Mon, 12 Aug 2019 20:14:38 +0200 Message-ID: Subject: Re: [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index To: "Joel Fernandes (Google)" Cc: kernel list , Alexey Dobriyan , Andrew Morton , Borislav Petkov , Brendan Gregg , Catalin Marinas , Christian Hansen , Daniel Colascione , fmayer@google.com, "H. Peter Anvin" , Ingo Molnar , Joel Fernandes , Jonathan Corbet , Kees Cook , kernel-team , Linux API , linux-doc@vger.kernel.org, linux-fsdevel , Linux-MM , Michal Hocko , Mike Rapoport , Minchan Kim , namhyung@google.com, "Paul E. McKenney" , Robin Murphy , Roman Gushchin , Stephen Rothwell , Suren Baghdasaryan , Thomas Gleixner , Todd Kjos , Vladimir Davydov , Vlastimil Babka , Will Deacon Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 7, 2019 at 7:16 PM Joel Fernandes (Google) wrote: > The page_idle tracking feature currently requires looking up the pagemap > for a process followed by interacting with /sys/kernel/mm/page_idle. > Looking up PFN from pagemap in Android devices is not supported by > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. > > This patch adds support to directly interact with page_idle tracking at > the PID level by introducing a /proc//page_idle file. It follows > the exact same semantics as the global /sys/kernel/mm/page_idle, but now > looking up PFN through pagemap is not needed since the interface uses > virtual frame numbers, and at the same time also does not require > SYS_ADMIN. > > In Android, we are using this for the heap profiler (heapprofd) which > profiles and pin points code paths which allocates and leaves memory > idle for long periods of time. This method solves the security issue > with userspace learning the PFN, and while at it is also shown to yield > better results than the pagemap lookup, the theory being that the window > where the address space can change is reduced by eliminating the > intermediate pagemap look up stage. In virtual address indexing, the > process's mmap_sem is held for the duration of the access. What happens when you use this interface on shared pages, like memory inherited from the zygote, library file mappings and so on? If two profilers ran concurrently for two different processes that both map the same libraries, would they end up messing up each other's data? Can this be used to observe which library pages other processes are accessing, even if you don't have access to those processes, as long as you can map the same libraries? I realize that there are already a bunch of ways to do that with side channels and such; but if you're adding an interface that allows this by design, it seems to me like something that should be gated behind some sort of privilege check. If the heap profiler is only interested in anonymous, process-private memory, that might be an easy way out? Limit (unprivileged) use of this interface to pages that aren't shared with any other processes? > +/* Helper to get the start and end frame given a pos and count */ > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, > + unsigned long *start, unsigned long *end) > +{ > + unsigned long max_frame; > + > + /* If an mm is not given, assume we want physical frames */ > + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; > + > + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) > + return -EINVAL; > + > + *start = pos * BITS_PER_BYTE; > + if (*start >= max_frame) > + return -ENXIO; > + > + *end = *start + count * BITS_PER_BYTE; > + if (*end > max_frame) > + *end = max_frame; > + return 0; > +} You could add some overflow checks for the multiplications. I haven't seen any place where it actually matters, but it seems unclean; and in particular, on a 32-bit architecture where the maximum user address is very high (like with a 4G:4G split), it looks like this function might theoretically return with `*start > *end`, which could be confusing to callers. [...] > for (; pfn < end_pfn; pfn++) { > bit = pfn % BITMAP_CHUNK_BITS; > if (!bit) > *out = 0ULL; > - page = page_idle_get_page(pfn); > - if (page) { > - if (page_is_idle(page)) { > - /* > - * The page might have been referenced via a > - * pte, in which case it is not idle. Clear > - * refs and recheck. > - */ > - page_idle_clear_pte_refs(page); > - if (page_is_idle(page)) > - *out |= 1ULL << bit; > - } > + page = page_idle_get_page_pfn(pfn); > + if (page && page_idle_pte_check(page)) { > + *out |= 1ULL << bit; > put_page(page); > } The `page && !page_idle_pte_check(page)` case looks like it's missing a put_page(); you probably intended to write something like this? page = page_idle_get_page_pfn(pfn); if (page) { if (page_idle_pte_check(page)) *out |= 1ULL << bit; put_page(page); } [...] > +/* page_idle tracking for /proc//page_idle */ > + > +struct page_node { > + struct page *page; > + unsigned long addr; > + struct list_head list; > +}; > + > +struct page_idle_proc_priv { > + unsigned long start_addr; > + char *buffer; > + int write; > + > + /* Pre-allocate and provide nodes to pte_page_idle_proc_add() */ > + struct page_node *page_nodes; > + int cur_page_node; > + struct list_head *idle_page_list; > +}; A linked list is a weird data structure to use if the list elements are just consecutive array elements. > +/* > + * Add page to list to be set as idle later. > + */ > +static void pte_page_idle_proc_add(struct page *page, > + unsigned long addr, struct mm_walk *walk) > +{ > + struct page *page_get = NULL; > + struct page_node *pn; > + int bit; > + unsigned long frames; > + struct page_idle_proc_priv *priv = walk->private; > + u64 *chunk = (u64 *)priv->buffer; > + > + if (priv->write) { > + VM_BUG_ON(!page); > + > + /* Find whether this page was asked to be marked */ > + frames = (addr - priv->start_addr) >> PAGE_SHIFT; > + bit = frames % BITMAP_CHUNK_BITS; > + chunk = &chunk[frames / BITMAP_CHUNK_BITS]; > + if (((*chunk >> bit) & 1) == 0) > + return; This means that BITMAP_CHUNK_SIZE is UAPI on big-endian systems, right? My opinion is that it would be slightly nicer to design the UAPI such that incrementing virtual addresses are mapped to incrementing offsets in the buffer (iow, either use bytewise access or use little-endian), but I'm not going to ask you to redesign the UAPI this late. [...] > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff, > + size_t count, loff_t *pos, int write) > +{ [...] > + down_read(&mm->mmap_sem); [...] > + > + if (!write && !walk_error) > + ret = copy_to_user(ubuff, buffer, count); > + > + up_read(&mm->mmap_sem); I'd move the up_read() above the copy_to_user(); copy_to_user() can block, and there's no reason to hold the mmap_sem across copy_to_user(). Sorry about only chiming in at v5 with all this.