Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp6134337yba; Tue, 14 May 2019 02:14:15 -0700 (PDT) X-Google-Smtp-Source: APXvYqwtmiD7CrToP8jQlGEc6/fLAzMMAMhgcvBe312AVfaDMsWxDmtyGGKgBI2/knQCsTw0shBM X-Received: by 2002:a63:530d:: with SMTP id h13mr36265891pgb.77.1557825255124; Tue, 14 May 2019 02:14:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557825255; cv=none; d=google.com; s=arc-20160816; b=c/PiSkvjW6sOuwTJ4o6HgVNyGl+YyhGK6iVg1tcUyINGz7elZyPBDbhPc1Z+fYaWuL rC7wP8oAJaCQzEQudvPnL49uVb+2u93E99yzViGK2+c4J21p8MjfXWCKp41ozEE+4tnS 8Zx4uTKNhmzrdjVMeEFZN9Yr21EkZPerHeQN3A5LTtfOKE74jB/4EuCX+Z4y7qfr4JHR ouyXDq7je0NOhYZ9msflgdvONebge8Zr3lch6fweD3AyWh1Gv1jLSghToGdshniXYx4j Ua9iEQmfxPGuUiT1YUXFayGDqMg7CMGD5wyCFHOg/UBgYOb6+ASgXJKTsiMXYv70tbtU zhQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=RB8yoXW30Wow1zTg31J23rJgqH4zq4zDFZziGNsf208=; b=wqGSvc/mOMJsfkOeblyRay3MMSevjeVqL8aVaQexuXgwmbqwnzbj2g8O+pdGKqqejp ZmHw4tmd9EPv7PlcANj84FwCK1pG2GlIeKpVbtqKT97bjL+kphraKN9GWsQmMph1F8Gs UmUfl1hQ8Y1d7Fvz8yUUm+Kr2jiRBBeEhCH+ZNUcaU5hBkCmFdsZQAZtRP6jfQ/fOfz2 k94hko0R8j0IvtUTxTk5oLiH7A2LovIUUK/vju0+8BI4Mn/rCEqpmpiTqysG2npp0B2T BMo2XTlUZPpl8YNBi+3EmjlIUmcb5wp4W194CQlk5AXi6d+Gdcmna1rfYD6f96/Cybgr C95w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e3si18474674pgc.98.2019.05.14.02.14.00; Tue, 14 May 2019 02:14:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726466AbfENJMb (ORCPT + 99 others); Tue, 14 May 2019 05:12:31 -0400 Received: from relay.sw.ru ([185.231.240.75]:58800 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725916AbfENJMa (ORCPT ); Tue, 14 May 2019 05:12:30 -0400 Received: from [172.16.25.169] by relay.sw.ru with esmtp (Exim 4.91) (envelope-from ) id 1hQTTt-0001sf-Bj; Tue, 14 May 2019 12:12:17 +0300 Subject: Re: [PATCH RFC 0/4] mm/ksm: add option to automerge VMAs To: Oleksandr Natalenko Cc: linux-kernel@vger.kernel.org, Vlastimil Babka , Michal Hocko , Matthew Wilcox , Pavel Tatashin , Timofey Titovets , Aaron Tomlin , linux-mm@kvack.org References: <20190510072125.18059-1-oleksandr@redhat.com> <36a71f93-5a32-b154-b01d-2a420bca2679@virtuozzo.com> <20190513113314.lddxv4kv5ajjldae@butterfly.localdomain> <20190514063043.ojhsb6d3ohxx4wur@butterfly.localdomain> From: Kirill Tkhai Message-ID: <8f146863-5963-81b2-ed20-6428d1da353c@virtuozzo.com> Date: Tue, 14 May 2019 12:12:16 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190514063043.ojhsb6d3ohxx4wur@butterfly.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 14.05.2019 09:30, Oleksandr Natalenko wrote: > Hi. > > On Mon, May 13, 2019 at 03:37:56PM +0300, Kirill Tkhai wrote: >>> Yes, I get your point. But the intention is to avoid another hacky trick >>> (LD_PRELOAD), thus *something* should *preferably* be done on the >>> kernel level instead. >> >> I don't think so. Does userspace hack introduce some overhead? It does not >> look so. Why should we think about mergeable VMAs in page fault handler?! >> This is the last thing we want to think in page fault handler. >> >> Also, there is difficult synchronization in page fault handlers, and it's >> easy to make a mistake. So, there is a mistake in [3/4], and you call >> ksm_enter() with mmap_sem read locked, while normal way is to call it >> with write lock (see madvise_need_mmap_write()). >> >> So, let's don't touch this path. Small optimization for unlikely case will >> introduce problems in optimization for likely case in the future. > > Yup, you're right, I've missed the fact that write lock is needed there. > Re-vamping locking there is not my intention, so lets find another > solution. > >>> Also, just for the sake of another piece of stats here: >>> >>> $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc >>> 526 >> >> This all requires attentive analysis. The number looks pretty big for me. >> What are the pages you get merged there? This may be just zero pages, >> you have identical. >> >> E.g., your browser want to work fast. It introduces smart schemes, >> and preallocates many pages in background (mmap + write 1 byte to a page), >> so in further it save some time (no page fault + alloc), when page is >> really needed. But your change merges these pages and kills this >> optimization. Sounds not good, does this? >> >> I think, we should not think we know and predict better than application >> writers, what they want from kernel. Let's people decide themselves >> in dependence of their workload. The only exception is some buggy >> or old applications, which impossible to change, so force madvise >> workaround may help. But only in case there are really such applications... >> >> I'd researched what pages you have duplicated in these 526 MB. Maybe >> you find, no action is required or a report to userspace application >> to use madvise is needed. > > OK, I agree, this is a good argument to move decision to userspace. > >>> 2) what kinds of opt-out we should maintain? Like, what if force_madvise >>> is called, but the task doesn't want some VMAs to be merged? This will >>> required new flag anyway, it seems. And should there be another >>> write-only file to unmerge everything forcibly for specific task? >> >> For example, >> >> Merge: >> #echo $task > /sys/kernel/mm/ksm/force_madvise > > Immediate question: what should be actually done on this? I see 2 > options: > > 1) mark all VMAs as mergeable + set some flag for mmap() to mark all > further allocations as mergeable as well; > 2) just mark all the VMAs as mergeable; userspace can call this > periodically to mark new VMAs. > > My prediction is that 2) is less destructive, and the decision is > preserved predominantly to userspace, thus it would be a desired option. Let's see, how we use KSM now. It's good for virtual machines: people install the same distribution in several VMs, and they have the same packages and the same files. When you read a file inside VM, its pages are file cache for the VM, but they are anonymous pages for host kernel. Hypervisor marks VM memory as mergeable, and host KSM merges the same anonymous pages together. Many of file cache inside VM is constant content, so we have good KSM compression on such the file pages. The result we have is explainable and expected. But we don't know anything about pages, you have merged on your laptop. We can't make any assumptions before analysis of applications, which produce such the pages. Let's check what happens before we try to implement some specific design (if we really need something to implement). The rest is just technical details. We may implement everything we need on top of this (even implement a polling of /proc/[pid]/maps and write a task and address of vma to force_madvise or similar file). Kirill