Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp897251imu; Wed, 16 Jan 2019 09:16:23 -0800 (PST) X-Google-Smtp-Source: ALg8bN4sr4espD9aw/9TRqtAsPWe0DoxkjbtffxcCD/Fldg+zJCGJfGVQWTIYvbZIAlncLpHhF+S X-Received: by 2002:a63:4706:: with SMTP id u6mr9349693pga.95.1547658982975; Wed, 16 Jan 2019 09:16:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547658982; cv=none; d=google.com; s=arc-20160816; b=Y2eq7DbtS+fCo0dMjaiZyJJU4FgKbPaeyIEQ7SxsxPp0uwQZ5IZimRe9B8vNGRM42d aLFJ6+XeUhx5Iktc37NkiXP2DsNkmf4ie+xB5Csa3er+hLl1XywR9I+KXHiNJmOGba2t R3ESUUEXJwrr00SN68Yg2PN06dCJII58mjG7m/Y04QHYhgOQsEUIN3Hbz3Fl8MI4B69P dtgeBi2rja8+MLwP5b0jgIgBtB3pOIAtsU93awzZR62PEIW/lOk4pm7JMcye3fxX612F 90D61VAhceZAbCHt4LnXcRhGYSZVAZ2V8gozUOAL/fXIzQThWhz7NRpOL5sIQ0u37ARt 4kNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:cc:to:references:in-reply-to:from :date:message-id:dkim-signature; bh=4QK0fcL3IswM3mZdm/QqWZe5ovNHXs5WUgtTQ4HPZfc=; b=Aj8YS0v7nzXhCw+1P09mLNdJoudSoY5bGBSniQ4ZmLhUph/KHQ0/sNP24xBfuFUQRh 8YCvvOOWsJcrLY3SyG0DTog/GbxidACX9lFObMdlYgrAxl7LrmnBxJSi24ncm/GvQydZ yNBKgWjOeVevc98gBeMyvo3tXIRIBx1HAB7oeWk7Z1fkOlrgWNWVdJW4mpy0jaFuMqls 0LgSlfP7grwrmapSVrrq96WaP/w3wX8M+mneG8SWu5sx7L4iSq/w+sJY56/9QvGrEpO0 CRd0ITlhp+vPH3ImI/X5BgmfHCe55pn1P9f2q0kSvl82DtmStg8F35jdGn8KeBkcRgVx 9F+w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@netflix.com header.s=google header.b=JkA9zzHi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=netflix.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j20si6181383pgb.520.2019.01.16.09.16.05; Wed, 16 Jan 2019 09:16:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@netflix.com header.s=google header.b=JkA9zzHi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=netflix.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727073AbfAPAmS (ORCPT + 99 others); Tue, 15 Jan 2019 19:42:18 -0500 Received: from mail-ed1-f67.google.com ([209.85.208.67]:41737 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727017AbfAPAmR (ORCPT ); Tue, 15 Jan 2019 19:42:17 -0500 Received: by mail-ed1-f67.google.com with SMTP id a20so4059003edc.8 for ; Tue, 15 Jan 2019 16:42:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=message-id:date:from:in-reply-to:references:to:cc:subject; bh=4QK0fcL3IswM3mZdm/QqWZe5ovNHXs5WUgtTQ4HPZfc=; b=JkA9zzHilq/qF7HAk3+PdQLQLF8nB4EQNxCc7nUyuyAKWg3Y1bwPrf8+Ip8YZeoaOc 13/pnmaw352+8xhoj3sWpUlcoclQ01YoxeDs6jYdVx6LCsdGvzHyuH/x4m30osWPGNfo kG3vJuDNAPm1Q5QL32HtXUzKW+kRHtkmaVqjc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:date:from:in-reply-to:references:to :cc:subject; bh=4QK0fcL3IswM3mZdm/QqWZe5ovNHXs5WUgtTQ4HPZfc=; b=KksqPAe547fj6HmxVNTHS5JhL0jw7XL8UFzjrerZQkIyxq0PNQgC4ZZfYcS6FQdUak 8jcC48um2Q0TKHMi3WS4J+q7gLobdhDjU2GgfWtVVQM3zUrUwDBvPP8L0Y4Wm3TQ1h+C d3VZX3iO1DFVh7b4qZSVYXkR6/zHf3v++e6EDl+pXnj91hEaSuiA6y7V1618ojp1DUTl KShpUK0pEgJp7KWAPAsPjaYFE6+aIoOsqexWl6Ff6MSe3UtPfpNTOLmYoOsAexlB15wm Vm7agocgeicSn3W8jPWsh58CniAFzsh8qpiHhJlTr3GlhYcMZnuTxov7vG/8aRXMDsvX pb9A== X-Gm-Message-State: AJcUukfyEMVisdSEqJ74NbbR3REXNZJfyfnhxQ5kqe3tmRv0vebt31S1 mWP+zVDbYWyDhrNRvg0qAd4kOQ== X-Received: by 2002:a17:906:4e82:: with SMTP id v2-v6mr4770196eju.149.1547599335822; Tue, 15 Jan 2019 16:42:15 -0800 (PST) Received: from mailer ([69.53.245.255]) by smtp.gmail.com with ESMTPSA id by5-v6sm3069283ejb.7.2019.01.15.16.42.10 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 15 Jan 2019 16:42:14 -0800 (PST) Message-ID: <5c3e7de6.1c69fb81.4aebb.3fec@mx.google.com> Received: by mailer (sSMTP sendmail emulation); Tue, 15 Jan 2019 16:42:08 -0800 Date: Tue, 15 Jan 2019 16:42:08 -0800 From: Josh Snyder In-Reply-To: References: <20190108044336.GB27534@dastard> <20190109022430.GE27534@dastard> <20190109043906.GF27534@dastard> <20190110004424.GH27534@dastard> <20190110070355.GJ27534@dastard> <20190110122442.GA21216@nautica> To: Linus Torvalds Cc: Dominique Martinet , Dave Chinner , Jiri Kosina , Matthew Wilcox , Jann Horn , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Linus Torvalds wrote on Thu, Jan 10, 2019: > So right now, I consider the mincore change to be a "try to probe the > state of mincore users", and we haven't really gotten a lot of > information back yet. For Netflix, losing accurate information from the mincore syscall would lengthen database cluster maintenance operations from days to months. We rely on cross-process mincore to migrate the contents of a page cache from machine to machine, and across reboots. To do this, I wrote and maintain happycache [1], a page cache dumper/loader tool. It is quite similar in architecture to pgfincore, except that it is agnostic to workload. The gist of happycache's operation is "produce a dump of residence status for each page, do some operation, then reload exactly the same pages which were present before." happycache is entirely dependent on accurate reporting of the in-core status of file-backed pages, as accessed by another process. We primarily use happycache with Cassandra, which (like Postgres + pgfincore) relies heavily on OS page cache to reduce disk accesses. Because our workloads never experience a cold page cache, we are able to provision hardware for a peak utilization level that is far lower than the hypothetical "every query is a cache miss" peak. A database warmed by happycache can be ready for service in seconds (bounded only by the performance of the drives and the I/O subsystem), with no period of in-service degradation. By contrast, putting a database in service without a page cache entails a potentially unbounded period of degradation (at Netflix, the time to populate a single node's cache via natural cache misses varies by workload from hours to weeks). If a single node upgrade were to take weeks, then upgrading an entire cluster would take months. Since we want to apply security upgrades (and other things) on a somewhat tighter schedule, we would have to develop more complex solutions to provide the same functionality already provided by mincore. At the bottom line, happycache is designed to benignly exploit the same information leak documented in the paper [2]. I think it makes perfect sense to remove cross-process mincore functionality from unprivileged users, but not to remove it entirely. Josh Snyder Netflix Cloud Database Engineering [1] https://github.com/hashbrowncipher/happycache [2] https://arxiv.org/abs/1901.01161