Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759368Ab0D3S0p (ORCPT ); Fri, 30 Apr 2010 14:26:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:28878 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758607Ab0D3S01 (ORCPT ); Fri, 30 Apr 2010 14:26:27 -0400 Message-ID: <4BDB2069.4000507@redhat.com> Date: Fri, 30 Apr 2010 21:24:41 +0300 From: Avi Kivity User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Thunderbird/3.0.4 MIME-Version: 1.0 To: Jeremy Fitzhardinge CC: Dan Magenheimer , Dave Hansen , Pavel Machek , linux-kernel@vger.kernel.org, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk, ngupta@vflare.org, JBeulich@novell.com, chris.mason@oracle.com, kurt.hackel@oracle.com, dave.mccracken@oracle.com, npiggin@suse.de, akpm@linux-foundation.org, riel@redhat.com Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview References: <4BD16D09.2030803@redhat.com>> > <4BD1A74A.2050003@redhat.com>> <4830bd20-77b7-46c8-994b-8b4fa9a79d27@default>> <4BD1B427.9010905@redhat.com> <4BD1B626.7020702@redhat.com>> <5fa93086-b0d7-4603-bdeb-1d6bfca0cd08@default>> <4BD3377E.6010303@redhat.com>> <1c02a94a-a6aa-4cbb-a2e6-9d4647760e91@default4BD43033.7090706@redhat.com>> > <20100428055538.GA1730@ucw.cz> <1272591924.23895.807.camel@nimitz 4BDA8324.7090409@redhat.com> <084f72bf-21fd-4721-8844-9d10cccef316@default> <4BDB026E.1030605@redhat.com> <4BDB18CE.2090608@goop.org> In-Reply-To: <4BDB18CE.2090608@goop.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5157 Lines: 117 On 04/30/2010 08:52 PM, Jeremy Fitzhardinge wrote: > On 04/30/2010 09:16 AM, Avi Kivity wrote: > >> Given that whenever frontswap fails you need to swap anyway, it is >> better for the host to never fail a frontswap request and instead back >> it with disk storage if needed. This way you avoid a pointless vmexit >> when you're out of memory. Since it's disk backed it needs to be >> asynchronous and batched. >> > I'd argue the opposite. There's no point in having the host do swapping > on behalf of guests if guests can do it themselves; it's just a > duplication of functionality. The problem with relying on the guest to swap is that it's voluntary. The guest may not be able to do it. When the hypervisor needs memory and guests don't cooperate, it has to swap. But I'm not suggesting that the host swap on behalf on the guest. Rather, the guest swaps to (what it sees as) a device with a large write-back cache; the host simply manages that cache. > You end up having two IO paths for each > guest, and the resulting problems in trying to account for the IO, > rate-limit it, etc. If you can simply say "all guest disk IO happens > via this single interface", its much easier to manage. > With tmem you have to account for that memory, make sure it's distributed fairly, claim it back when you need it (requiring guest cooperation), live migrate and save/restore it. It's a much larger change than introducing a write-back device for swapping (which has the benefit of working with unmodified guests). > If frontswap has value, it's because its providing a new facility to > guests that doesn't already exist and can't be easily emulated with > existing interfaces. > > It seems to me the great strengths of the synchronous interface are: > > * it matches the needs of an existing implementation (tmem in Xen) > * it is simple to understand within the context of the kernel code > it's used in > > Simplicity is important, because it allows the mm code to be understood > and maintained without having to have a deep understanding of > virtualization. If we use the existing paths, things are even simpler, and we match more needs (hypervisors with dma engines, the ability to reclaim memory without guest cooperation). > One of the problems with CMM2 was that it puts a lot of > intricate constraints on the mm code which can be easily broken, which > would only become apparent in subtle edge cases in a CMM2-using > environment. An addition async frontswap-like interface - while not as > complex as CMM2 - still makes things harder for mm maintainers. > No doubt CMM2 is hard to swallow. > The downside is that it may not match some implementation in which the > get/put operations could take a long time (ie, physical IO to a slow > mechanical device). But a general Linux principle is not to overdesign > interfaces for hypothetical users, only for real needs. > > Do you think that you would be able to use frontswap in kvm if it were > an async interface, but not otherwise? Or are you arguing a hypothetical? > For kvm (or Xen, with some modifications) all of the benefits of frontswap/tmem can be achieved with the ordinary swap. It would need trim/discard support to avoid writing back freed data, but that's good for flash as well. The advantages are: - just works - old guests - <1 exit/page (since it's batched) - no extra overhead if no free memory - can use dma engine (since it's asynchronous) >> At this point we're back with the ordinary swap API. Simply have your >> host expose a device which is write cached by host memory, you'll have >> all the benefits of frontswap with none of the disadvantages, and with >> no changes to guest code. >> > Yes, that's comfortably within the "guests page themselves" model. > Setting up a block device for the domain which is backed by pagecache > (something we usually try hard to avoid) is pretty straightforward. But > it doesn't work well for Xen unless the blkback domain is sized so that > it has all of Xen's free memory in its pagecache. > Could be easily achieved with ballooning? > That said, it does concern me that the host/hypervisor is left holding > the bag on frontswapped pages. A evil/uncooperative/lazy can just pump > a whole lot of pages into the frontswap pool and leave them there. I > guess this is mitigated by the fact that the API is designed such that > they can't update or read the data without also allowing the hypervisor > to drop the page (updates can fail destructively, and reads are also > destructive), so the guest can't use it as a clumsy extension of their > normal dedicated memory. > Eventually you'll have to swap frontswap pages, or kill uncooperative guests. At which point all of the simplicity is gone. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/