Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753204AbbDAM4l (ORCPT ); Wed, 1 Apr 2015 08:56:41 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:49400 "EHLO lgemrelse6q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752505AbbDAM4k (ORCPT ); Wed, 1 Apr 2015 08:56:40 -0400 X-Original-SENDERIP: 10.178.37.108 X-Original-MAILFROM: gioh.kim@lge.com Message-ID: <551BEB03.1050700@lge.com> Date: Wed, 01 Apr 2015 21:56:35 +0900 From: Gioh Kim User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Vlastimil Babka CC: Andrew Morton , Mel Gorman , Rik van Riel , Johannes Weiner , David Rientjes , Vladimir Davydov , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFCv3] mm: page allocation for less fragmentation References: <1427359540-14833-1-git-send-email-gioh.kim@lge.com> <551BE1B1.8060908@suse.cz> In-Reply-To: <551BE1B1.8060908@suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13681 Lines: 352 2015-04-01 오후 9:16에 Vlastimil Babka 이(가) 쓴 글: > On 03/26/2015 09:45 AM, Gioh Kim wrote: >> My platform is suffering with the external fragmentation problem. >> If I run a heavy load test for a few days in 1GB memory system, I cannot >> allocate even order=3 pages because-of the external fragmentation. >> >> I found that my driver is main reason. >> It repeats to allocate 16MB pages with alloc_page(GFP_KERNEL) and >> totally consumes 300~400MB pages of 1GB system. >> >> I thought I needed a anti-fragmentation solution for my driver. >> But there is no allocation function that considers fragmentation. >> The compaction is not helpful because it is only for movable pages, not >> unmovable pages. >> >> This patch proposes a allocation function allocates only pages in the same >> pageblock. >> >> I tested this patch like following to check that I can get high order page >> with new allocator. >> >> 1. When the driver allocates about 400MB and do "cat /proc/pagetypeinfo;cat >> /proc/buddyinfo" >> >> Free pages count per migrate type at order 0 1 2 3 4 >> 5 6 7 8 9 10 >> Node 0, zone Normal, type Unmovable 3864 728 394 216 129 >> 47 18 9 1 0 0 >> Node 0, zone Normal, type Reclaimable 902 96 68 17 3 >> 0 1 0 0 0 0 >> Node 0, zone Normal, type Movable 5146 663 178 91 43 >> 16 4 0 0 0 0 >> Node 0, zone Normal, type Reserve 1 4 6 6 2 >> 1 1 1 0 1 1 >> Node 0, zone Normal, type CMA 0 0 0 0 0 >> 0 0 0 0 0 0 >> Node 0, zone Normal, type Isolate 0 0 0 0 0 >> 0 0 0 0 0 0 >> >> Number of blocks type Unmovable Reclaimable Movable Reserve >> CMA Isolate >> Node 0, zone Normal 135 3 124 2 >> 0 0 >> Node 0, zone Normal 9880 1489 647 332 177 64 24 10 >> 1 1 1 >> >> 2. The driver allocates pages with alloc_pages_compact >> and copy page contents and free old pages. >> This is a kind of compaction of the driver. >> Following is the result of "cat /proc/pagetypeinfo;cat /proc/buddyinfo" >> >> Free pages count per migrate type at order 0 1 2 3 4 >> 5 6 7 8 9 10 >> Node 0, zone Normal, type Unmovable 8 5 1 432 272 >> 91 37 11 1 0 0 >> Node 0, zone Normal, type Reclaimable 901 96 68 17 3 >> 0 1 0 0 0 0 >> Node 0, zone Normal, type Movable 4790 776 192 91 43 >> 16 4 0 0 0 0 >> Node 0, zone Normal, type Reserve 1 4 6 6 2 >> 1 1 1 0 1 1 >> Node 0, zone Normal, type CMA 0 0 0 0 0 >> 0 0 0 0 0 0 >> Node 0, zone Normal, type Isolate 0 0 0 0 0 >> 0 0 0 0 0 0 >> >> Number of blocks type Unmovable Reclaimable Movable Reserve >> CMA Isolate >> Node 0, zone Normal 135 3 124 2 >> 0 0 >> Node 0, zone Normal 5693 877 266 544 320 108 43 12 >> 1 1 1 >> >> I found that high order pages are increased. > > Again, this test is not a good argument as explained in my reply to v2. > >> >> >> And I did another test. Following test is counting mixed blocks >> after page allocation. > > How is "mixed" defined and determined? It's my mistake not to describe the detail. I turned on pageowner feature and "mixed blocks" is in the pageowner result of /proc/pagetypeinfo like below. > >> In virtualbox system with 4-CPUs and 768MB memory I had runned kernel build >> and I allocated pages with alloc_page and alloc_pages_compact. >> >> 1. kernel build make -j8 and cat /proc/pagetypeinfo >> Number of mixed blocks Unmovable Reclaimable Movable Reserve >> Node 0, zone DMA 0 0 3 1 >> Node 0, zone Normal 8 10 89 0 >> >> 2. alloc_pages_compact(GFP_USER, 4096) X 10-times and cat /proc/pagetypeinfo >> Number of mixed blocks Unmovable Reclaimable Movable Reserve >> Node 0, zone DMA 0 0 3 1 >> Node 0, zone Normal 8 10 89 0 >> >> I found there is no more fragmentation. >> >> Following is alloc_pages test. >> >> 1. kernel build naje -j8 and cat /proc/pagetypeinfo >> >> Number of mixed blocks Unmovable Reclaimable Movable Reserve >> Node 0, zone DMA 0 0 3 1 >> Node 0, zone Normal 8 7 100 1 >> >> 2. alloc_page(GFP_USER) X 4096-times X 10-times and cat /proc/pagetypeinfo >> >> Number of mixed blocks Unmovable Reclaimable Movable Reserve >> Node 0, zone DMA 0 0 3 1 >> Node 0, zone Normal 37 7 105 1 >> >> It generates fragmentation. >> >> With above two tests I can get more high order pages and less mixed blocks. > > Please include also data for "more high order pages". I cannot make a same situation because my platform has many sub-modules and applications. So the number of high order pages are different at every time. It cannot be comparable. Therefore I attached only the number of mixed blocks. > >> The new allocator isn't to replace the common allocator alloc_pages. >> It can be applied to a certain drivers that allocates many pages and don't need >> fast allocation. > > As Mel said, this seems rather specialized, the benefits seem to be limited to a corner case, and similar to CMA, which could have some relaxed mode of operation where it doesn't guarantee to be completely contiguous, but with some best-effort approach it would give you probably more compact ranges of pages than this patch? For instance I can apply new allocator for GPU driver. A GPU has its MMU and share system memory with CPU. Therefore GPU allocates pages one by one via alloc_page because it can map non-contigous pages for its address space. GPU pages are non-movable type. If the pages are scattered it generates critical fragmentation. With new allocator I can migrate GPU pages faster than CMA and it doesn't need contigous pages. > >> When the system has serious fragmentation you can free pages and alloc pages >> via alloc_page to decrease fragmentation. But it would last short and >> fragmentation would increase soon. The new allocator can work like compaction >> so that it decrease fragmentation for long time. >> >> >> This patch is based on 3.16. >> allocflags_to_migratetype should be changed into gfpflags_to_migratetype for >> v4.0. >> >> >> Changelog since v1: >> - change argument of page order into page count >> >> Changelog since v2: >> - bug fix >> - do not allocate page in different migratetype pageblock >> - add new test result of mixed block count >> >> Signed-off-by: Gioh Kim >> CC: Andrew Morton >> CC: Mel Gorman >> CC: Rik van Riel >> CC: Johannes Weiner >> CC: David Rientjes >> CC: Vladimir Davydov >> CC: linux-mm@kvack.org >> CC: linux-kernel@vger.kernel.org >> --- >> mm/page_alloc.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 160 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 86c9a72..826618b 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6646,3 +6646,163 @@ void dump_page(struct page *page, const char *reason) >> dump_page_badflags(page, reason, 0); >> } >> EXPORT_SYMBOL(dump_page); >> + >> +static unsigned long alloc_freepages_block(unsigned long start_pfn, >> + unsigned long end_pfn, >> + int count, >> + struct list_head *freelist) >> +{ >> + int total_alloc = 0; >> + struct page *cursor, *valid_page = NULL; >> + >> + cursor = pfn_to_page(start_pfn); >> + >> + /* Isolate free pages. */ >> + for (; start_pfn < end_pfn; start_pfn++, cursor++) { >> + int alloc, i; >> + struct page *page = cursor; >> + >> + if (!pfn_valid_within(start_pfn)) >> + continue; >> + >> + if (!valid_page) >> + valid_page = page; >> + if (!PageBuddy(page)) >> + continue; >> + >> + if (!PageBuddy(page)) >> + continue; >> + >> + /* allocate only low-order pages */ >> + if (page_order(page) >= 3) { >> + start_pfn += (1 << page_order(page)) - 1; >> + cursor += (1 << page_order(page)) - 1; >> + continue; >> + } >> + >> + /* Found a free pages, break it into order-0 pages */ >> + alloc = split_free_page(page); >> + >> + total_alloc += alloc; >> + for (i = 0; i < alloc; i++) { >> + list_add(&page->lru, freelist); >> + page++; >> + } >> + >> + if (total_alloc >= count) >> + break; >> + >> + if (alloc) { >> + start_pfn += alloc - 1; >> + cursor += alloc - 1; >> + continue; >> + } >> + } >> + >> + return total_alloc; >> +} >> + >> +static int rmqueue_compact(struct zone *zone, int nr_request, >> + int migratetype, struct list_head *freepages) >> +{ >> + unsigned int current_order; >> + struct free_area *area; >> + struct page *page; >> + unsigned long block_start_pfn; /* start of current pageblock */ >> + unsigned long block_end_pfn; /* end of current pageblock */ >> + int total_alloc = 0; >> + unsigned long flags; >> + struct page *next; >> + int to_free = 0; >> + int nr_remain = nr_request; >> + int loop_count = 0; >> + >> + spin_lock_irqsave(&zone->lock, flags); >> + >> + /* Find a page of the appropriate size in the preferred list */ >> + current_order = 0; >> + page = NULL; >> + while (current_order < 3) { >> + int alloc; >> + >> + area = &(zone->free_area[current_order]); >> + >> + if (list_empty(&area->free_list[migratetype])) >> + goto next_order; >> + >> + page = list_entry(area->free_list[migratetype].next, >> + struct page, lru); >> + >> + /* >> + * check migratetype of pageblock, >> + * some pages can be set as different migratetype >> + * by rmqueue_fallback >> + */ >> + if (get_pageblock_migratetype(page) != migratetype) { >> + if (list_is_last(&page->lru, >> + &area->free_list[migratetype])) >> + goto next_order; >> + page = list_next_entry(page, lru); >> + } >> + >> + block_start_pfn = page_to_pfn(page) & ~(pageblock_nr_pages - 1); >> + block_end_pfn = min(block_start_pfn + pageblock_nr_pages, >> + zone_end_pfn(zone)); >> + >> + alloc = alloc_freepages_block(block_start_pfn, >> + block_end_pfn, >> + nr_remain, >> + freepages); >> + WARN(alloc == 0, "alloc can be ZERO????"); >> + >> + total_alloc += alloc; >> + nr_remain -= alloc; >> + >> + if (nr_remain <= 0) >> + break; >> + >> + continue; >> +next_order: >> + current_order++; >> + loop_count = 0; >> + } >> + __mod_zone_page_state(zone, NR_ALLOC_BATCH, -total_alloc); >> + __count_zone_vm_events(PGALLOC, zone, total_alloc); >> + >> + spin_unlock_irqrestore(&zone->lock, flags); >> + >> + list_for_each_entry_safe(page, next, freepages, lru) { >> + if (to_free >= nr_request) { >> + list_del(&page->lru); >> + atomic_dec(&page->_count); >> + __free_pages_ok(page, 0); >> + } >> + to_free++; >> + } >> + >> + list_for_each_entry(page, freepages, lru) { >> + arch_alloc_page(page, 0); >> + kernel_map_pages(page, 1, 1); >> + } >> + return total_alloc < nr_request ? total_alloc : nr_request; >> +} >> + >> +int alloc_pages_compact(gfp_t gfp_mask, int nr_request, >> + struct list_head *freepages) >> +{ >> + enum zone_type high_zoneidx = gfp_zone(gfp_mask); >> + struct zone *preferred_zone; >> + struct zoneref *preferred_zoneref; >> + >> + preferred_zoneref = first_zones_zonelist(node_zonelist(numa_node_id(), >> + gfp_mask), >> + high_zoneidx, >> + &cpuset_current_mems_allowed, >> + &preferred_zone); >> + if (!preferred_zone) >> + return 0; >> + >> + return rmqueue_compact(preferred_zone, nr_request, >> + allocflags_to_migratetype(gfp_mask), freepages); >> +} >> +EXPORT_SYMBOL(alloc_pages_compact); >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/