Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754134AbaGNJtl (ORCPT ); Mon, 14 Jul 2014 05:49:41 -0400 Received: from cantor2.suse.de ([195.135.220.15]:57825 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753305AbaGNJtd (ORCPT ); Mon, 14 Jul 2014 05:49:33 -0400 Message-ID: <53C3A7A5.9060005@suse.cz> Date: Mon, 14 Jul 2014 11:49:25 +0200 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Joonsoo Kim CC: Andrew Morton , "Kirill A. Shutemov" , Rik van Riel , Peter Zijlstra , Mel Gorman , Johannes Weiner , Minchan Kim , Yasuaki Ishimatsu , Zhang Yanfei , "Srivatsa S. Bhat" , Tang Chen , Naoya Horiguchi , Bartlomiej Zolnierkiewicz , Wen Congyang , Marek Szyprowski , Michal Nazarewicz , Laura Abbott , Heesub Shin , "Aneesh Kumar K.V" , Ritesh Harjani , t.stanislaws@samsung.com, Gioh Kim , linux-mm@kvack.org, Lisa Du , linux-kernel@vger.kernel.org Subject: Re: [PATCH 00/10] fix freepage count problems due to memory isolation References: <1404460675-24456-1-git-send-email-iamjoonsoo.kim@lge.com> <53B6C947.1070603@suse.cz> <20140707044932.GA29236@js1304-P5Q-DELUXE> <53BAAFA5.9070403@suse.cz> <20140714062222.GA11317@js1304-P5Q-DELUXE> In-Reply-To: <20140714062222.GA11317@js1304-P5Q-DELUXE> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/14/2014 08:22 AM, Joonsoo Kim wrote: > On Mon, Jul 07, 2014 at 04:33:09PM +0200, Vlastimil Babka wrote: >> On 07/07/2014 06:49 AM, Joonsoo Kim wrote: >>> Ccing Lisa, because there was bug report it may be related this >>> topic last Saturday. >>> >>> http://www.spinics.net/lists/linux-mm/msg75741.html >>> >>> On Fri, Jul 04, 2014 at 05:33:27PM +0200, Vlastimil Babka wrote: >>>> On 07/04/2014 09:57 AM, Joonsoo Kim wrote: >>>>> Hello, >>>> >>>> Hi Joonsoo, >>>> >>>> please CC me on further updates, this is relevant to me. >>> >>> Hello, Vlastimil. >>> >>> Sorry for missing you. :) >>> >>>> >>>>> This patchset aims at fixing problems due to memory isolation found by >>>>> testing my patchset [1]. >>>>> >>>>> These are really subtle problems so I can be wrong. If you find what I am >>>>> missing, please let me know. >>>>> >>>>> Before describing bugs itself, I first explain definition of freepage. >>>>> >>>>> 1. pages on buddy list are counted as freepage. >>>>> 2. pages on isolate migratetype buddy list are *not* counted as freepage. >>>> >>>> I think the second point is causing us a lot of trouble. And I wonder if it's really >>>> justified! We already have some is_migrate_isolate() checks in the fast path and now you >>>> would add more, mostly just to keep this accounting correct. >>> >>> It's not just for keeping accouting correct. There is another >>> purpose for is_migrate_isolate(). It forces freed pages to go into >>> isolate buddy list. Without it, freed pages would go into other >>> buddy list and will be used soon. So memory isolation can't work well >>> without is_migrate_isolate() checks and success rate could decrease. >> >> Well I think that could be solved also by doing a lru/pcplists drain >> right after marking pageblock as MIGRATETYPE_ISOLATE. After that >> moment, anything newly put on pcplists should observe the new >> migratetype and go to the correct pcplists. As putting stuff on >> pcplists is done with disabled interrupts, and draining is done by >> IPI, I think it should work correctly if we put the migratetype >> determination under the disabled irq part in free_hot_cold_page(). > > Hello, sorry for late. > > Yes, this can close the race on migratetype buddy list, but, there is > a problem. See the below. > >> >>> And, I just added three more is_migrate_isolate() in the fast >>> path, but, two checks are in same *unlikely* branch and I can remove >>> another one easily. Therefore it's not quite problem I guess. (It even >>> does no-op if MEMORY_ISOLATION is disabled.) >>> Moreover, I removed one unconditional get_pageblock_migratetype() in >>> free_pcppages_bulk() so, in performance point or view, freepath would >>> be improved. >> >> I haven't checked the individual patches yet, but I'll try. >> >>>> >>>> So the question is, does it have to be correct? And (admiteddly not after a completely >>>> exhaustive analysis) I think the answer is, surprisingly, that it doesn't :) >>>> >>>> Well I of course don't mean that the freepage accounts could go random completely, but >>>> what if we allowed them to drift a bit, limiting both the max error and the timeframe >>>> where errors are possible? After all, watermarks checking is already racy so I don't think >>>> it would be hurt that much. >>> >>> I understand your suggestion. I once thought like as you, but give up >>> that idea. Watermark checking is already racy, but, it's *only* >>> protection to prevent memory allocation. After passing that check, >>> there is no mean to prevent us from allocating memory. So it should >>> be accurate as much as possible. If we allow someone to get the >>> memory without considering memory isolation, free memory could be in >>> really low state and system could be broken occasionally. >> >> It would definitely require more thorough review, but I think with >> the pcplists draining changes outlined above, there shouldn't be any >> more inaccuracy happening. >> >>>> >>>> Now if we look at what both CMA and memory hot-remove does is: >>>> >>>> 1. Mark a MAX_ORDER-aligned buch of pageblocks as MIGRATE_ISOLATE through >>>> start_isolate_page_range(). As part of this, all free pages in that area are >>>> moved on the isolate freelist through move_freepages_block(). >>>> >>>> 2. Try to migrate away all non-free pages in the range. Also drain pcplists and lru_add >>>> caches. >>>> >>>> 3. Check if everything was successfully isolated by test_pages_isolated(). Restart and/or >>>> undo pageblock isolation if not. >>>> >>>> So my idea is to throw away all special-casing of is_migrate_isolate() in the buddy >>>> allocator, which would therefore account free pages on the isolate freelist as normal free >>>> pages. >>>> The accounting of isolated pages would be instead done only on the top level of CMA/hot >>>> remove in the three steps outlined above, which would be modified as follows: >>>> >>>> 1. Calculate N as the target number of pages to be isolated. Perform the actions of step 1 >>>> as usual. Calculate X as the number of pages that move_freepages_block() has moved. >>>> Subtract X from freepages (this is the same as it is done now), but also *remember the >>>> value of X* >>>> >>>> 2. Migrate and drain pcplists as usual. The new free pages will either end up correctly on >>>> isolate freelist, or not. We don't care, they will be accounted as freepages either way. >>>> This is where some inaccuracy in accounted freepages would build up. >>>> >>>> 3. If test_pages_isolated() checks pass, subtract (N - X) from freepages. The result is >>>> that we have a isolated range of N pages that nobody can steal now as everything is on >>>> isolate freelist and is MAX_ORDER aligned. And we have in total subtracted N pages (first >>>> X, then N-X). So the accounting matches reality. >>>> >>>> If we have to undo, we undo the isolation and as part of this, we use >>>> move_freepages_block() to move pages from isolate freelist to the normal ones. But we >>>> don't care how many pages were moved. We simply add the remembered value of X to the >>>> number of freepages, undoing the change from step 1. Again, the accounting matches reality. >>>> >>>> >>>> The final point is that if we do this per MAX_ORDER blocks, the error in accounting cannot >>>> be ever larger than 4MB and will be visible only during time a single MAX_ORDER block is >>>> handled. >>> >>> The 4MB error in accounting looks serious for me. min_free_kbytes is >>> 4MB in 1GB system. So this 4MB error would makes all things broken in >>> such systems. Moreover, there are some ARCH having larger >>> pageblock_order than MAX_ORDER. In this case, the error will be larger >>> than 4MB. >>> >>> In addition, I have a plan to extend CMA to work in parallel. It means >>> that there could be parallel memory isolation users rather than just >>> one user at the same time, so, we cannot easily bound the error under >>> some degree. >> >> OK. I had thought that the error would be much smaller in practice, >> but probably due to how buddy merging works, it would be enough if >> just the very last freed page in the MAX_ORDER block was misplaced, >> and that would trigger a cascading merge that will end with single >> page at MAX_ORDER size becoming misplaced. So let's probably forget >> this approach. >> >>>> As a possible improvement, we can assume during phase 2 that every page freed by migration >>>> will end up correctly on isolate free list. So we create M free pages by migration, and >>>> subtract M from freepage account. Then in phase 3 we either subtract (N - X - M), or add X >>>> + M in the undo case. (Ideally, if we succeed, X + M should be equal to N, but due to >>>> pages on pcplists and the possible races it will be less). I think with this improvement, >>>> any error would be negligible. >>>> >>>> Thoughts? >>> >>> Thanks for suggestion. :) >>> It is really good topic to think deeply. >>> >>> For now, I'd like to fix these problems without side-effect as you >>> suggested. Your suggestion changes the meaning of freepage that >>> isolated pages are included in nr_freepage and there could be possible >>> regression in success rate of memory hotplug and CMA. Possibly, it >>> is the way we have to go, but, IMHO, it isn't the time to go. Before >>> going that way, we should fix current implementation first so that >>> fixes can be backported to old kernel if someone needs. >> >> Adding the extra pcplist drains and reordering >> get_pageblock_migratetype() under disabled IRQ's should be small >> enough to be backportable and not bring any regressions to success >> rates. There will be some cost for the extra drains, but paid only >> when the memory isolation actually happens. Extending the scope of >> disabled irq's would affect fast path somewhat, but I guess less >> than extra checks? > > Your approach would close the race on migratetype buddy list. But, > I'm not sure how we can count number of freepages correctly. IIUC, > unlike my approach, this approach allows merging between isolate buddy > list and normal buddy list and it would results in miscounting number of > freepages. You probably think that move_freepages_block() could fixup > all the situation, but, I don't think so. No, I didn't think that, I simply overlooked this scenario. Good catch. > After applying your approach, > > CPU1 CPU2 > set_migratetype_isolate() - free_one_page() > - lock > - set_pageblock_migratetype() > - unlock > - drain... > - lock > - __free_one_page() with MIGRATE_ISOLATE > - merged with normal page and linked with > isolate buddy list > - skip to count freepage, because of > MIGRATE_ISOLATE > - unlock > - lock > - move_freepages_block() > try to fixup freepages count > - unlock > > We want to fixup counting in move_freepages_block(), but, we can't > because we don't have enough information whether this page is already > accounted on number of freepage or not. Could you help me to find out > the whole mechanism of your approach? I can't think of an easy fix to this situation. A non-trivial fix that comes to mind (and I might have overlooked something) is something like: - distinguish MIGRATETYPE_ISOLATING and MIGRATETYPE_ISOLATED - CPU1 first sets MIGRATETYPE_ISOLATING before the drain - when CPU2 sees MIGRATETYPE_ISOLATING, it just puts the page on special unbounded pcplist and that's it - CPU1 does the drain as usual, potentially misplacing some pages that move_freepages_block() will then fix. But no wrong merging can occur. - after move_freepages_block(), CPU1 changes MIGRATETYPE_ISOLATING to MIGRATETYPE_ISOLATED - CPU2 can then start freeing directly on isolate buddy list. There might be some pages still on the special pcplist of CPU2/CPUx but that means they won't merge yet. - CPU1 executes on all CPU's a new operation that flushes the special pcplist on isolate buddy list and merge as needed. > And, IIUC, your approach can remove only one get_pageblock_migratetype() > in free_pcppages_bulk(). free_hot_cold_page() and free_one_page() > should have is_migrate_isolate() branches to free the page to correct > list and count number of freepage correctly. Additionally, it extends > scope of disabled irq. So it doesn't look much better than mine, > because my approach also removes get_pageblock_migratetype() in > free_pcppages_bulk(). > > Please let me know what I am missing. :) > > Thanks. > >> >>> Interestingly, >>> on last Saturday, Lisa Du reported CMA accounting bugs. >>> >>> http://www.spinics.net/lists/linux-mm/msg75741.html >>> >>> I don't look at it in detail, but, maybe it is related to these >>> problems and we should fix it without side-effect. >>> >>> So, in conclusion, I think that your suggestion is beyond the scope of >>> this patchset because of following two reasons. >>> >>> 1. I'd like to fix these problems without side-effect(possible >>> regression in success rate of memory hotplug and CMA, and nr_freepage >>> meanging change) due to backport possibility. >>> 2. nr_freepage without considering memory isolation is somewhat dangerous >>> and not suitable for some systems. >>> >>> If you have any objection, please let me know. But, I will go on >>> a vacation for a week so I can't answer your further comments >>> for a week. I will reply them next week. :) >>> >>> Thanks. >>> >>>>> 3. pages on cma buddy list are counted as CMA freepage, too. >>>>> 4. pages for guard are *not* counted as freepage. >>>>> >>>>> Now, I describe problems and related patch. >>>>> >>>>> 1. Patch 2: If guard page are cleared and merged into isolate buddy list, >>>>> we should not add freepage count. >>>>> >>>>> 2. Patch 3: When the page return back from pcp to buddy, we should >>>>> account it to freepage counter. In this case, we should check the >>>>> pageblock migratetype of the page and should insert the page into >>>>> appropriate buddy list. Although we checked it in current code, we >>>>> didn't insert the page into appropriate buddy list so that freepage >>>>> counting can be wrong. >>>>> >>>>> 3. Patch 4: There is race condition so that some freepages could be >>>>> on isolate buddy list. If so, we can't use this page until next isolation >>>>> attempt on this pageblock. >>>>> >>>>> 4. Patch 5: There is race condition that page on isolate pageblock >>>>> can go into non-isolate buddy list. If so, buddy allocator would >>>>> merge pages on non-isolate buddy list and isolate buddy list, respectively, >>>>> and freepage count will be wrong. >>>>> >>>>> 5. Patch 9: move_freepages(_block) returns *not* number of moved pages. >>>>> Instead, it returns number of pages linked in that migratetype buddy list. >>>>> So accouting with this return value makes freepage count wrong. >>>>> >>>>> 6. Patch 10: buddy allocator would merge pages on non-isolate buddy list >>>>> and isolate buddy list, respectively. This leads to freepage counting >>>>> problem so fix it by stopping merging in this case. >>>>> >>>>> Without patchset [1], above problem doesn't happens on my CMA allocation >>>>> test, because CMA reserved pages aren't used at all. So there is no >>>>> chance for above race. >>>>> >>>>> With patchset [1], I did simple CMA allocation test and get below result. >>>>> >>>>> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation >>>>> - run kernel build (make -j16) on background >>>>> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval >>>>> - Result: more than 5000 freepage count are missed >>>>> >>>>> With patchset [1] and this patchset, I found that no freepage count are >>>>> missed so that I conclude that problems are solved. >>>>> >>>>> These problems can be possible on memory hot remove users, although >>>>> I didn't check it further. >>>>> >>>>> Other patches are either for the base to fix these problems or for >>>>> simple clean-up. Please see individual patches for more information. >>>>> >>>>> This patchset is based on linux-next-20140703. >>>>> >>>>> Thanks. >>>>> >>>>> [1]: Aggressively allocate the pages on cma reserved memory >>>>> https://lkml.org/lkml/2014/5/30/291 >>>>> >>>>> >>>>> Joonsoo Kim (10): >>>>> mm/page_alloc: remove unlikely macro on free_one_page() >>>>> mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC >>>>> mm/page_alloc: handle page on pcp correctly if it's pageblock is >>>>> isolated >>>>> mm/page_alloc: carefully free the page on isolate pageblock >>>>> mm/page_alloc: optimize and unify pageblock migratetype check in free >>>>> path >>>>> mm/page_alloc: separate freepage migratetype interface >>>>> mm/page_alloc: store migratetype of the buddy list into freepage >>>>> correctly >>>>> mm/page_alloc: use get_onbuddy_migratetype() to get buddy list type >>>>> mm/page_alloc: fix possible wrongly calculated freepage counter >>>>> mm/page_alloc: Stop merging pages on non-isolate and isolate buddy >>>>> list >>>>> >>>>> include/linux/mm.h | 30 +++++++-- >>>>> include/linux/mmzone.h | 5 ++ >>>>> include/linux/page-isolation.h | 8 +++ >>>>> mm/page_alloc.c | 138 +++++++++++++++++++++++++++++----------- >>>>> mm/page_isolation.c | 18 ++---- >>>>> 5 files changed, 147 insertions(+), 52 deletions(-) >>>>> >>>> >>>> -- >>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>>> the body to majordomo@kvack.org. For more info on Linux MM, >>>> see: http://www.linux-mm.org/ . >>>> Don't email: email@kvack.org >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: email@kvack.org >>> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/