Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762177Ab3IDHvD (ORCPT ); Wed, 4 Sep 2013 03:51:03 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:42666 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762057Ab3IDHvA (ORCPT ); Wed, 4 Sep 2013 03:51:00 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v2.0.1 X-SHieldMailCheckerPolicyVersion: FJ-ISEC-20120718-3 Message-ID: <5226E624.9090908@jp.fujitsu.com> Date: Wed, 4 Sep 2013 16:49:56 +0900 From: Yasuaki Ishimatsu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: "Srivatsa S. Bhat" CC: , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists References: <20130830131221.4947.99764.stgit@srivatsabhat.in.ibm.com> <20130830131619.4947.56734.stgit@srivatsabhat.in.ibm.com> In-Reply-To: <20130830131619.4947.56734.stgit@srivatsabhat.in.ibm.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-SecurityPolicyCheck-GC: OK by FENCE-Mail Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12461 Lines: 359 (2013/08/30 22:16), Srivatsa S. Bhat wrote: > The zones' freelists need to be made region-aware, in order to influence > page allocation and freeing algorithms. So in every free list in the zone, we > would like to demarcate the pageblocks belonging to different memory regions > (we can do this using a set of pointers, and thus avoid splitting up the > freelists). > > Also, we would like to keep the pageblocks in the freelists sorted in > region-order. That is, pageblocks belonging to region-0 would come first, > followed by pageblocks belonging to region-1 and so on, within a given > freelist. Of course, a set of pageblocks belonging to the same region need > not be sorted; it is sufficient if we maintain the pageblocks in > region-sorted-order, rather than a full address-sorted-order. > > For each freelist within the zone, we maintain a set of pointers to > pageblocks belonging to the various memory regions in that zone. > > Eg: > > |<---Region0--->| |<---Region1--->| |<-------Region2--------->| > ____ ____ ____ ____ ____ ____ ____ > --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> > > ^ ^ ^ > | | | > Reg0 Reg1 Reg2 > > > Page allocation will proceed as usual - pick the first item on the free list. > But we don't want to keep updating these region pointers every time we allocate > a pageblock from the freelist. So, instead of pointing to the *first* pageblock > of that region, we maintain the region pointers such that they point to the > *last* pageblock in that region, as shown in the figure above. That way, as > long as there are > 1 pageblocks in that region in that freelist, that region > pointer doesn't need to be updated. > > > Page allocation algorithm: > ------------------------- > > The heart of the page allocation algorithm remains as it is - pick the first > item on the appropriate freelist and return it. > > > Arrangement of pageblocks in the zone freelists: > ----------------------------------------------- > > This is the main change - we keep the pageblocks in region-sorted order, > where pageblocks belonging to region-0 come first, followed by those belonging > to region-1 and so on. But the pageblocks within a given region need *not* be > sorted, since we need them to be only region-sorted and not fully > address-sorted. > > This sorting is performed when adding pages back to the freelists, thus > avoiding any region-related overhead in the critical page allocation > paths. > > Strategy to consolidate allocations to a minimum no. of regions: > --------------------------------------------------------------- > > Page allocation happens in the order of increasing region number. We would > like to do light-weight page reclaim or compaction (for the purpose of memory > power management) in the reverse order, to keep the allocated pages within > a minimum number of regions (approximately). The latter part is implemented > in subsequent patches. > > ---------------------------- Increasing region number----------------------> > > Direction of allocation---> <---Direction of reclaim/compaction > > Signed-off-by: Srivatsa S. Bhat > --- > > mm/page_alloc.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 138 insertions(+), 16 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fd6436d0..398b62c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, > return 0; > } > > +static void add_to_freelist(struct page *page, struct free_list *free_list) > +{ > + struct list_head *prev_region_list, *lru; > + struct mem_region_list *region; > + int region_id, i; > + > + lru = &page->lru; > + region_id = page_zone_region_id(page); > + > + region = &free_list->mr_list[region_id]; > + region->nr_free++; > + > + if (region->page_block) { > + list_add_tail(lru, region->page_block); > + return; > + } > + > +#ifdef CONFIG_DEBUG_PAGEALLOC > + WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__); > +#endif > + > + if (!list_empty(&free_list->list)) { > + for (i = region_id - 1; i >= 0; i--) { > + if (free_list->mr_list[i].page_block) { > + prev_region_list = > + free_list->mr_list[i].page_block; > + goto out; > + } > + } > + } > + > + /* This is the first region, so add to the head of the list */ > + prev_region_list = &free_list->list; > + > +out: > + list_add(lru, prev_region_list); > + > + /* Save pointer to page block of this region */ > + region->page_block = lru; > +} > + > +static void del_from_freelist(struct page *page, struct free_list *free_list) > +{ > + struct list_head *prev_page_lru, *lru, *p; nitpick *p is used only when enabling CONFIG_DEBUG_PAGEALLOC option. When disabling the config option and compiling kernel, the messages are shown. CC mm/page_alloc.o mm/page_alloc.c: In function ‘del_from_freelist’: mm/page_alloc.c:560: 警告: unused variable ‘p’ Thanks, Yasuaki Ishimatsu > + struct mem_region_list *region; > + int region_id; > + > + lru = &page->lru; > + region_id = page_zone_region_id(page); > + region = &free_list->mr_list[region_id]; > + region->nr_free--; > + > +#ifdef CONFIG_DEBUG_PAGEALLOC > + WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__); > + > + /* Verify whether this page indeed belongs to this free list! */ > + > + list_for_each(p, &free_list->list) { > + if (p == lru) > + goto page_found; > + } > + > + WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__); > + > +page_found: > +#endif > + > + /* > + * If we are not deleting the last pageblock in this region (i.e., > + * farthest from list head, but not necessarily the last numerically), > + * then we need not update the region->page_block pointer. > + */ > + if (lru != region->page_block) { > + list_del(lru); > +#ifdef CONFIG_DEBUG_PAGEALLOC > + WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__); > +#endif > + return; > + } > + > + prev_page_lru = lru->prev; > + list_del(lru); > + > + if (region->nr_free == 0) { > + region->page_block = NULL; > + } else { > + region->page_block = prev_page_lru; > +#ifdef CONFIG_DEBUG_PAGEALLOC > + WARN(prev_page_lru == &free_list->list, > + "%s: region->page_block points to list head\n", > + __func__); > +#endif > + } > +} > + > +/** > + * Move a given page from one freelist to another. > + */ > +static void move_page_freelist(struct page *page, struct free_list *old_list, > + struct free_list *new_list) > +{ > + del_from_freelist(page, old_list); > + add_to_freelist(page, new_list); > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_idx; > unsigned long uninitialized_var(buddy_idx); > struct page *buddy; > + struct free_area *area; > > VM_BUG_ON(!zone_is_initialized(zone)); > > @@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page, > __mod_zone_freepage_state(zone, 1 << order, > migratetype); > } else { > - list_del(&buddy->lru); > - zone->free_area[order].nr_free--; > + area = &zone->free_area[order]; > + del_from_freelist(buddy, &area->free_list[migratetype]); > + area->nr_free--; > rmv_page_order(buddy); > } > combined_idx = buddy_idx & page_idx; > @@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page, > order++; > } > set_page_order(page, order); > + area = &zone->free_area[order]; > > /* > * If this is not the largest possible page, check if the buddy > @@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page, > buddy_idx = __find_buddy_index(combined_idx, order + 1); > higher_buddy = higher_page + (buddy_idx - combined_idx); > if (page_is_buddy(higher_page, higher_buddy, order + 1)) { > - list_add_tail(&page->lru, > - &zone->free_area[order].free_list[migratetype].list); > + > + /* > + * Implementing an add_to_freelist_tail() won't be > + * very useful because both of them (almost) add to > + * the tail within the region. So we could potentially > + * switch off this entire "is next-higher buddy free?" > + * logic when memory regions are used. > + */ > + add_to_freelist(page, &area->free_list[migratetype]); > goto out; > } > } > > - list_add(&page->lru, > - &zone->free_area[order].free_list[migratetype].list); > + add_to_freelist(page, &area->free_list[migratetype]); > out: > - zone->free_area[order].nr_free++; > + area->nr_free++; > } > > static inline int free_pages_check(struct page *page) > @@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page, > continue; > } > #endif > - list_add(&page[size].lru, &area->free_list[migratetype].list); > + add_to_freelist(&page[size], &area->free_list[migratetype]); > area->nr_free++; > set_page_order(&page[size], high); > } > @@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > > page = list_entry(area->free_list[migratetype].list.next, > struct page, lru); > - list_del(&page->lru); > + del_from_freelist(page, &area->free_list[migratetype]); > rmv_page_order(page); > area->nr_free--; > expand(zone, page, order, current_order, area, migratetype); > @@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone, > { > struct page *page; > unsigned long order; > - int pages_moved = 0; > + struct free_area *area; > + int pages_moved = 0, old_mt; > > #ifndef CONFIG_HOLES_IN_ZONE > /* > @@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone, > } > > order = page_order(page); > - list_move(&page->lru, > - &zone->free_area[order].free_list[migratetype].list); > + old_mt = get_freepage_migratetype(page); > + area = &zone->free_area[order]; > + move_page_freelist(page, &area->free_list[old_mt], > + &area->free_list[migratetype]); > set_freepage_migratetype(page, migratetype); > page += 1 << order; > pages_moved += 1 << order; > @@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > struct free_area * area; > int current_order; > struct page *page; > - int migratetype, new_type, i; > + int migratetype, new_type, i, mt; > > /* Find the largest possible block of pages in the other list */ > for (current_order = MAX_ORDER-1; current_order >= order; > @@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > migratetype); > > /* Remove the page from the freelists */ > - list_del(&page->lru); > + mt = get_freepage_migratetype(page); > + del_from_freelist(page, &area->free_list[mt]); > rmv_page_order(page); > > /* > @@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order) > } > > /* Remove page from free list */ > - list_del(&page->lru); > + mt = get_freepage_migratetype(page); > + del_from_freelist(page, &zone->free_area[order].free_list[mt]); > zone->free_area[order].nr_free--; > rmv_page_order(page); > > @@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) > int order, i; > unsigned long pfn; > unsigned long flags; > + int mt; > + > /* find the first valid pfn */ > for (pfn = start_pfn; pfn < end_pfn; pfn++) > if (pfn_valid(pfn)) > @@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) > printk(KERN_INFO "remove from free list %lx %d %lx\n", > pfn, 1 << order, end_pfn); > #endif > - list_del(&page->lru); > + mt = get_freepage_migratetype(page); > + del_from_freelist(page, &zone->free_area[order].free_list[mt]); > rmv_page_order(page); > zone->free_area[order].nr_free--; > #ifdef CONFIG_HIGHMEM > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/