Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755178AbbHQJLP (ORCPT ); Mon, 17 Aug 2015 05:11:15 -0400 Received: from mail-pa0-f52.google.com ([209.85.220.52]:34753 "EHLO mail-pa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755140AbbHQJLM (ORCPT ); Mon, 17 Aug 2015 05:11:12 -0400 Subject: Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning To: Vlastimil Babka , linux-mm@kvack.org References: <1438762094-17747-1-git-send-email-aik@ozlabs.ru> <55D1910C.7070006@suse.cz> Cc: Alexander Duyck , Andrew Morton , Benjamin Herrenschmidt , David Gibson , Johannes Weiner , Joonsoo Kim , Mel Gorman , Michal Hocko , Paul Mackerras , Sasha Levin , linux-kernel@vger.kernel.org, Alex Williamson , Alexander Graf , Paolo Bonzini , "Aneesh Kumar K . V" , Peter Zijlstra From: Alexey Kardashevskiy Message-ID: <55D1A525.5090706@ozlabs.ru> Date: Mon, 17 Aug 2015 19:11:01 +1000 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <55D1910C.7070006@suse.cz> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5109 Lines: 133 On 08/17/2015 05:45 PM, Vlastimil Babka wrote: > On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote: >> This is about VFIO aka PCI passthrough used from QEMU. >> KVM is irrelevant here. >> >> QEMU is a machine emulator. It allocates guest RAM from anonymous memory >> and these pages are movable which is ok. They may happen to be allocated >> from the contiguous memory allocation zone (CMA). Which is also ok as >> long they are movable. >> >> However if the guest starts using VFIO (which can be hotplugged into >> the guest), in most cases it involves DMA which requires guest RAM pages >> to be pinned and not move once their addresses are programmed to >> the hardware for DMA. >> >> So we end up in a situation when quite many pages in CMA are not movable >> anymore. And we get bunch of these: >> >> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy >> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy >> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy > > IIRC CMA was for mobile devices and their camera/codec drivers and you > don't use QEMU on those? What do you need CMA for in your case? I do not want QEMU to get memory from CMA, this is my point. It just happens sometime that the kernel allocates movable pages from there. > >> This is a very rough patch to start the conversation about how to move >> pages properly. mm/page_alloc.c does this and >> arch/powerpc/mm/mmu_context_iommu.c exploits it. > > OK such conversation should probably start by mentioning the VM_PINNED > effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345 > > It's more general approach to dealing with pinned pages, and moving them > out of CMA area (and compacting them in general) prior pinning is one of > the things that should be done within that framework. And I assume these patches did not go anywhere, right?... > Then there's the effort to enable migrating pages other than LRU during > compaction (and thus CMA allocation): https://lwn.net/Articles/650864/ > I don't know if that would be applicable in your use case, i.e. are the > pins for DMA short-lived and can the isolation/migration code wait a bit > for the transfer to finish so it can grab them, or something? Pins for DMA are long-lived, pretty much as long as the guest is running. So this "compaction" is too late. >> >> Please do not comment on the style and code placement, >> this is just to give some context :) >> >> Obviously, this does not work well - it manages to migrate only few pages >> and crashes as it is missing locks/disabling interrupts and I probably >> should not just remove pages from LRU list (normally, I guess, only these >> can migrate) and a million of other things. >> >> The questions are: >> >> - what is the correct way of telling if the page is in CMA? >> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough? > > Should be. > >> - how to tell MM to move page away? I am calling migrate_pages() with >> an get_new_page callback which allocates a page with GFP_USER but without >> GFP_MOVABLE which should allocate new page out of CMA which seems ok but >> there is a little convern that we might want to add MOVABLE back when >> VFIO device is unplugged from the guest. > > Hmm, once the page is allocated, then the migratetype is not tracked > anywhere (except in page_owner debug data). But the unmovable allocations > might exhaust available unmovable pageblocks and lead to fragmentation. So > "add MOVABLE back" would be too late. Instead we would need to tell the >allocator somehow to give us movable page but outside of CMA. It is it movable, why do we care if it is in CMA or not? > CMA's own > __alloc_contig_migrate_range() avoids this problem by allocating movable > pages, but the range has been already page-isolated and thus the allocator > won't see the pages there.You obviously can't take this approach and > isolate all CMA pageblocks like that. That smells like a new __GFP_FLAG, meh. I understood (more or less) all of it except the __GFP_FLAG - when/what would use it? >> - do I need to isolate pages by using isolate_migratepages_range, >> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does? >> I dropped them for now and the patch uses only @migratepages from >> the compact_control struct. > > You don't have to do reclaim_clean_pages_from_list(), but the isolation has > to be careful, yeah. The isolation here means the whole CMA zone isolation which I "obviously can't take this approach"? :) >> - are there any flags in madvise() to address this (could not >> locate any relevant)? > > AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE) > >> - what else is missing? disabled interrupts? locks? > > See what isolate_migratepages_block() does. Thanks for the pointers! I'll have a closer look at Peter's patchset. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/