Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751464AbbLUB0j (ORCPT ); Sun, 20 Dec 2015 20:26:39 -0500 Received: from galahad.ideasonboard.com ([185.26.127.97]:49514 "EHLO galahad.ideasonboard.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751382AbbLUB03 (ORCPT ); Sun, 20 Dec 2015 20:26:29 -0500 From: Laurent Pinchart To: Robin Murphy Cc: Doug Anderson , Russell King , Laurent Pinchart , "linux-kernel@vger.kernel.org" , Pawel Osciak , mike.looijmans@topic.nl, Lorenzo Nava , Dmitry Torokhov , Will Deacon , Tomasz Figa , David Rientjes , Carlo Caione , Andrew Morton , "linux-arm-kernel@lists.infradead.org" , Marek Szyprowski Subject: Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time Date: Mon, 21 Dec 2015 03:26:27 +0200 Message-ID: <1643621.CLgIjY2JrC@avalon> User-Agent: KMail/4.14.8 (Linux/4.1.12-gentoo; KDE/4.14.8; x86_64; ; ) In-Reply-To: <56746AA8.6010202@arm.com> References: <1450384253-1067-1-git-send-email-dianders@chromium.org> <56746AA8.6010202@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11221 Lines: 223 Hi Robin, On Friday 18 December 2015 20:20:56 Robin Murphy wrote: > On 18/12/15 18:55, Doug Anderson wrote: > > On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy wrote: > >> On 17/12/15 22:31, Doug Anderson wrote: > >>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote: > >>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty > >>>> sizeable buffers. Upon simple tests of video I saw it trying to > >>>> allocate 4,194,304 bytes. The function tries to be efficient about > >>>> this by starting out allocating large chunks and then moving to smaller > >>>> and smaller chunk sizes until it succeeds. > >>>> > >>>> The current function is very, very slow. > >>>> > >>>> One problem is the way it keeps trying and trying to allocate big > >>>> chunks. Imagine a very fragmented memory that has 4M free but no > >>>> contiguous pages at all. Further imagine allocating 4M (1024 pages). > >>>> We'll do the following memory allocations: > >>>> > >>>> - For page 1: > >>>> - Try to allocate order 10 (no retry) > >>>> - Try to allocate order 9 (no retry) > >>>> - ... > >>>> - Try to allocate order 0 (with retry, but not needed) > >>>> > >>>> - For page 2: > >>>> - Try to allocate order 9 (no retry) > >>>> - Try to allocate order 8 (no retry) > >>>> - ... > >>>> - Try to allocate order 0 (with retry, but not needed) > >>>> > >>>> - ... > >>>> - ... > >>>> > >>>> Total number of calls to alloc() calls for this case is: > >>>> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025)) > >>>> => 9228 > >>>> > >>>> The above is obviously worse case, but given how slow alloc can be we > >>>> really want to try to avoid even somewhat bad cases. I timed the old > >>>> code with a device under memory pressure and it wasn't hard to see it > >>>> take more than 24 seconds to allocate 4 megs of memory (!!). > >>>> > >>>> A second problem (and maybe even more important) is that allocating big > >>>> chunks when we don't need them is just not a good idea anyway. The > >>>> first thing we do with these big chunks is break them into smaller > >>>> chunks! If we allocate small chunks: > >>>> - The memory manager doesn't need to work so hard to give us big > >>>> chunks. > >>>> - We can save the big chunks for those that really need them and this > >>>> code can make great use of all the small chunks sitting around. > >>>> > >>>> Let's simplify by just allocating one page at a time. We may make more > >>>> total allocate calls but it works way better. In real world tests that > >>>> used to sometimes see a 24 second allocation call I can now see at most > >>>> 250 ms. > > > > One thing to note is that testing yesterday I actually managed to > > reproduce an allocation taking 120 seconds (!) with the old code. > > Yikes! That really is worth avoiding... > > >>> Off-list I talked to Dmitry about this a little bit and he pointed out > >>> that contiguous chunks actually give a benefit to the IOMMU. I don't > >>> think the benefit outweighs the cost in this case, but I'm happy to > >>> hear what others have to say. I did some quick printouts and it turns > >>> out that even when requesting page at a time the memory manager > >>> (unsurprisingly) can in many cases still give us pages that are > >>> contiguous. > >>> > >>> Also I'm happy to post up > >>> which sorts the > >>> array and could possibly give us larger chunks of contiguous memory. > >> > >> I think sorting individually-allocated pages really isn't worth the > >> effort - I'm not aware of anything that's going to be capable of using > >> larger page/section mappings without also having the necessary physical > >> alignment, and if you _can_ cobble together, say, 2MB worth of > >> contiguous pages *at 2MB alignment*, then you would have been far better > >> off just asking the slab allocator for that in the first place. > >> > >> That's the key point of the higher-order allocation - not that you get > >> some contiguous pages, but that the region you get is also naturally > >> aligned to its size physically. That we break up the CPU page tables for > >> that region into individual pages is just an inconsequential > >> implementation detail from the IOMMU side. When you _do_ have plenty of > >> unfragmented free memory it can really be a big win - here's an > >> instrumented example of what happens on my Juno with the ARM HDLCD/SMMU > >> combo setting up a framebuffer at boot time: > >> iommu_dma_alloc: alloc size 0x753000, 1875 pages > >> __iommu_dma_alloc_pages: allocated at order 10 > >> __iommu_dma_alloc_pages: allocated at order 9 > >> __iommu_dma_alloc_pages: allocated at order 8 > >> __iommu_dma_alloc_pages: allocated at order 6 > >> __iommu_dma_alloc_pages: allocated at order 4 > >> __iommu_dma_alloc_pages: allocated at order 1 > >> __iommu_dma_alloc_pages: allocated at order 0 > >> iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000 > >> iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000 > >> iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000 > >> iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000 > >> iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000 > >> iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000 > >> iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000 > >> iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000 > >> iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000 > >> iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000 > >> ... > >> > >> Since the IOVA region itself is aligned to 8MB (for the total size) and > >> the physical regions come out in optimal decreasing order, we're able to > >> map over 80% of the whole buffer with just 3 section entries, with a > >> corresponding saving on TLB pressure, page table maintenance (cache > >> flushing), etc. > >> > >> That said, unless you're in the middle of some crazy allocator-thrashing > >> race, then it's probably safe to assume that once allocation fails at a > >> given order that's going to remain the case in the near future - would > >> you mind taking the following diff for a spin under your test conditions > >> to see how it compares? > >> > >> Robin. > >> > >> ----->8----- > >> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > >> index dfb5001..95e75c4 100644 > >> --- a/arch/arm/mm/dma-mapping.c > >> +++ b/arch/arm/mm/dma-mapping.c > >> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct > >> device *dev, size_t size, > >> int count = size >> PAGE_SHIFT; > >> int array_size = count * sizeof(struct page *); > >> int i = 0; > >> + unsigned int order = MAX_ORDER; > >> > >> if (array_size <= PAGE_SIZE) > >> pages = kzalloc(array_size, GFP_KERNEL); > >> > >> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct > >> device *dev, size_t size, > >> gfp |= __GFP_NOWARN | __GFP_HIGHMEM; > >> > >> while (count) { > >> - int j, order; > >> + int j; > >> > >> - for (order = __fls(count); order > 0; --order) { > >> + for (order = min_t(unsigned int, order, __fls(count)); > >> + order > 0; --order) { > > > > Yeah, I'd been playing with things like that, though not that exact patch. > > > > I just tried it now. As should be obvious, it certainly makes a > > DRASTIC improvement in things but it still has some downsides as > > compared to my patch. > > > > 1. It's still pretty easy for arm_iommu_alloc_attrs() to take many > > seconds. I can no longer reproduce the 24 second or 120 second > > allocation call, but I still see things like "alloc 4194304 bytes: > > 3208093877 ns" (AKA an allocation taking > 3 seconds). That's > > compared with 250 ms max with my patch. > > Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps > instead of just __GFP_NORETRY we should go all the way to clearing > __GFP_RECLAIM for the opportunistic calls so they really fail fast? > > > 2. We still have the same problem that we're taking away all the > > contiguous memory that other users may want. I've got a dwc2 USB > > controller in my system and it needs to allocate bounce buffers for > > its DMA. While looking at cat videos on Facebook and running a > > program to simulate memory pressure (4 userspace programs each walking > > through 350 Megs of memory over and over) I start seeing lots of order > > 3 allocation failures in dwc2. It's true that the USB/network stack > > is resilient against these allocation failures (other than spamming my > > log), but performance will decrease. When I switch to WiFi I suddenly > > start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail, > > drop pkt port=28 len=33024". Again, it's robust, but you're affecting > > performance. > > > > I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that > > we don't try for > 64K chunks. This is might be a reasonable > > compromise. My cat video test still reproduces "alloc 4194304 bytes: > > 674318751 ns", but maybe ~700 ms is an OK? Of course, this still eats > > all the large chunks of memory that everyone else would like to have. > > > > Oh, or how about this: we start allocating of order 4. Upon the first > > failure we jump to order 1. AKA: if there's no memory pressure we're > > golden. The moment we have the first bit of memory pressure we fold. > > That's basically just a slight optimization on Marek's suggestion. I > > still see 450 ms for an allocation, but that's not too bad. It can > > still take away large chunks from other users, but maybe that's OK? > > That makes sense - there's really no benefit to be had from trying > orders which don't correspond to our relevant IOMMU page sizes - I'm not > sure off-hand how many contortions you'd have to go through to actually > get at those from here, although it might be another argument in favour > of moving the pgsize_bitmap into the iommu_domain as Will proposed some > time ago. What's the status of that ? Do we just need a volunteer to do the job ? > In lieu of an actual lookup, my general inclination would be to go 2MB->1MB- > >64K->4K to cover all the common page sizes, but Marek's probably right that > >the larger two are less relevant in the context of mobile graphics stuff, > >which lets face it is the prime concern for IOMMUs on 32-bit ARM. > > > Anyway, I'll plan to send that patch up. I'll also do a quick test to > > see if my "sort()" actually helps anything. > > Sounds good. I'm about to disappear off for holidays, but it'll be good > to see how much you've improved everything when I get back :D -- Regards, Laurent Pinchart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/