Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932410AbXAWFIa (ORCPT ); Tue, 23 Jan 2007 00:08:30 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932523AbXAWFIa (ORCPT ); Tue, 23 Jan 2007 00:08:30 -0500 Received: from ug-out-1314.google.com ([66.249.92.173]:59867 "EHLO ug-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932410AbXAWFI2 (ORCPT ); Tue, 23 Jan 2007 00:08:28 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=gTo1ET14JOr2klsb6AB493w1Ve9uLTjcTEa6SAwxDDyE6EduRbrDyAwyV7Z8b+7ebnngfRH7qdFEwhyW3LeGxS6jc0ch9yD/USl6gQquMeT/BLTv98Q8hmeP8PYZSmvs0RSRlS3Wz2hhyW/lDFmoQgo7V6TMI7+CE2EzL4c1Uj4= Message-ID: <4df04b840701222108o6992933bied5fff8a525413@mail.gmail.com> Date: Tue, 23 Jan 2007 13:08:25 +0800 From: "yunfeng zhang" To: linux-kernel@vger.kernel.org Subject: Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem In-Reply-To: <4df04b840701222021w5e1aaab2if2ba7fc38d06d64b@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <4df04b840701212309l2a283357jbdaa88794e5208a7@mail.gmail.com> <200701222300.41960.a1426z@gawab.com> <4df04b840701222021w5e1aaab2if2ba7fc38d06d64b@mail.gmail.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 54521 Lines: 1714 re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang Index: linux-2.6.19/Documentation/vm_pps.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.000000000 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + zyf.zeroos@gmail.com + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- . PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) how private pages enter in/go off pps. +4) which VMA is belonging to pps. +5) new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section . +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are defined in include/linux/mm.h. +4) dftlb is done on stage 1 and 2 of vmscan.c:shrink_pvma_scan_ptes. + +The restriction of dftlb. Following conditions must be met +1) atomic cmpxchg instruction. +2) atomically set the access bit after CPU touches a pte firstly. +3) To some architectures, vma parameter of flush_tlb_range is maybe important, + if it's true, since it's possible that the vma of a TLB flushing task has + gone when a CPU starts to execute the task in timer interrupt, so don't use + dftlb. +combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks. + +dftlb increases mm_struct::mm_users to prevent the mm from being freed when +other CPU works on it. +// }])> + +// Stage Definition <([{ +The whole process of private page page-out is divided into six stages +shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar ptes/pages to +a series. +1) PTE to untouched PTE (clear access bit), append flushing tasks to dftlb. +---) Other CPUs do flushing tasks in their timer interrupt. +2) Resume from 1, convert untouched PTE to UnmappedPTE (cmpxchg). +3) Link SwapEntry to PrivatePage of every UnmappedPTE. +4) Flush PrivatePage to its disk SwapPage. +5) Reclaimed the page and shift UnmappedPTE to SwappedPTE. +6) SwappedPTE stage (Null operation). +// }])> + +// Data Definition <([{ +New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h. + +New PTE type (UnmappedPTE) is appended into PTE system in +include/asm-i386/pgtable.h. Its prototype is +struct UnmappedPTE { + int present : 1; // must be 0. + ... + int pageNum : 20; +}; +The new PTE has a feature, it keeps a link to its PrivatePage and prevent the +page from being visited by CPU, so you can use it in as a +middleware. +// }])> + +// Concurrent Racers of Shrinking pps <([{ +shrink_private_vma of mm/vmscan.c uses init_mm.mmlist to scan all swappable +mm_struct instances, during the process of scaning and reclamation, it +readlockes mm_struct::mmap_sem, which brings some potential concurrent racers +1) mm/swapfile.c pps_swapoff (swapoff API) +2) mm/memory.c do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page, + do_swap_page (page-fault) +3) mm/memory.c get_user_pages (sometimes core need share PrivatePage with us) + +There isn't new lock order defined in pps, that is, it's compliable to Linux +lock order. +// }])> + +// Others about pps <([{ +A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to +execute the stages of pps periodically, note an appropriate timeout ticks is +necessary so we can give application a chance to re-map back its PrivatePage +from UnmappedPTE to PTE, that is, show their conglomeration affinity. + +kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node +may_reclaim = 1 means starting reclamation (stage 5). reclaim_node = (node +number) is used when a memory node is low. Caller should set them to wakeup_sc, +then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to +timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy +fields are gfp_mask, may_writepage and may_swap. + +PPS statistic data is appended to /proc/meminfo entry, its prototype is in +include/linux/mm.h. +// }])> + +// Private Page Lifecycle of pps <([{ +All pages belonging to pps are called as pure private page, its PTE type is PTE +or UnmappedPTE. Note, Linux fork API potentially make PrivatePage shared by +multiple processes, so is excluded from pps. + +IN (NOTE, when a pure private page enters into pps, it's also trimmed from +Linux legacy page system by commeting lru_cache_add_active clause) +1) fs/exec.c install_arg_pages (argument pages) +2) mm/memory do_anonymous_page, do_wp_page, do_swap_page (page fault) +3) mm/swap_state.c read_swap_cache_async (swap pages) + +OUT +1) mm/vmscan.c shrink_pvma_scan_ptes (stage 5, reclaim a private page) +2) mm/memory zap_pte_range (free a page) +3) kernel/fork.c dup_mmap (if someone uses fork, migrate all pps + pages back to let Linux legacy page system manage them) + +When a pure private page is in pps, it can be visited simultaneously by +page-fault and SwapDaemon. +// }])> + +// VMA Lifecycle of pps <([{ +When a PrivateVMA enters into pps, it's or-ed a new flag -- VM_PURE_PRIVATE in +memory.c:enter_pps, you can also find which VMAs are fit with pps in it. The +flag is used mainly in the shrink_private_vma of mm/vmscan.c. Other fields are +left untouched. + +IN. +1) fs/exec.c setup_arg_pages (StackVMA) +2) mm/mmap.c do_mmap_pgoff, do_brk (DataVMA) +3) mm/mmap.c split_vma, copy_vma (in some cases, we need copy a VMA from + an exist VMA) + +OUT. +1) kernel/fork.c dup_mmap (if someone uses fork, return the vma + back to Linux legacy system) +2) mm/mmap.c remove_vma, vma_adjust (destroy VMA) +3) mm/mmap.c do_mmap_pgoff (delete VMA when some errors occur) + +The VMAs of pps can coexist with madvise, mlock, mprotect, mmap and munmap, +that is why new VMA created from mmap.c:split_vma can re-enter into pps. +// }])> + +// Postscript <([{ +Note, some circumstances aren't tested due to hardware restriction e.g. SMP +dftlb. So there is no guanrantee in my dftlb code and EVEN my idea. + +Here are some improvements about pps +1) In fact, I recommend one-to-one private model -- PrivateVMA, (PTE, + UnmappedPTE) and (PrivatePage, DiskSwapPage) which is described in my OS and + the above hyperlink of Linux kernel mail list. Current Linux core supports a + trick -- COW on PrivatePage which is used by fork API, the API should be + used rarely, POSIX thread library, vfork/execve are enough to application, + but as the result, it potentially makes PrivatePage shared, so I think it's + unnecessary to Linux, do copy-on-calling if someone really need it. If you + agree it, you will find UnmappedPTE + PrivatePage IS swap cache of Linux, + and swap_info_struct::swap_map should be bitmap other than (short int)map. + So it's a compromise to use Linux legacy SwapCache in my pps. That's why my + patch is called pps -- pure private (page) system. +2) SwapSpace should provide more flexible interfaces, shrink_pvma_scan_ptes + need allocate swap entries in batch, exactly, allocate a batch of fake + continual swap entries, see memory.c:pps_swapin_readahead. In fact, the + interface should be overloaded, that is, swap file should has a different + strategy versus swap partition. + +If Linux kernel group can't make a schedule to re-write their memory code, +however, pps maybe is the best solution until now. +// }])> +// vim: foldmarker=<([{,}])> foldmethod=marker et Index: linux-2.6.19/fs/exec.c =================================================================== --- linux-2.6.19.orig/fs/exec.c 2007-01-22 13:58:30.000000000 +0800 +++ linux-2.6.19/fs/exec.c 2007-01-23 11:32:30.000000000 +0800 @@ -321,10 +321,11 @@ pte_unmap_unlock(pte, ptl); goto out; } + atomic_inc(&pps_info.total); + atomic_inc(&pps_info.pte_count); inc_mm_counter(mm, anon_rss); - lru_cache_add_active(page); - set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte( - page, vma->vm_page_prot)))); + set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(page, + vma->vm_page_prot)))); page_add_new_anon_rmap(page, vma, address); pte_unmap_unlock(pte, ptl); @@ -437,6 +438,7 @@ kmem_cache_free(vm_area_cachep, mpnt); return ret; } + enter_pps(mm, mpnt); mm->stack_vm = mm->total_vm = vma_pages(mpnt); } Index: linux-2.6.19/fs/proc/proc_misc.c =================================================================== --- linux-2.6.19.orig/fs/proc/proc_misc.c 2007-01-22 13:58:31.000000000 +0800 +++ linux-2.6.19/fs/proc/proc_misc.c 2007-01-22 14:00:00.000000000 +0800 @@ -181,7 +181,11 @@ "Committed_AS: %8lu kB\n" "VmallocTotal: %8lu kB\n" "VmallocUsed: %8lu kB\n" - "VmallocChunk: %8lu kB\n", + "VmallocChunk: %8lu kB\n" + "PPS Total: %8d kB\n" + "PPS PTE: %8d kB\n" + "PPS Unmapped: %8d kB\n" + "PPS Swapped: %8d kB\n", K(i.totalram), K(i.freeram), K(i.bufferram), @@ -212,7 +216,11 @@ K(committed), (unsigned long)VMALLOC_TOTAL >> 10, vmi.used >> 10, - vmi.largest_chunk >> 10 + vmi.largest_chunk >> 10, + K(pps_info.total.counter), + K(pps_info.pte_count.counter), + K(pps_info.unmapped_count.counter), + K(pps_info.swapped_count.counter) ); len += hugetlb_report_meminfo(page + len); Index: linux-2.6.19/include/asm-i386/mmu_context.h =================================================================== --- linux-2.6.19.orig/include/asm-i386/mmu_context.h 2007-01-22 13:58:32.000000000 +0800 +++ linux-2.6.19/include/asm-i386/mmu_context.h 2007-01-23 11:43:00.000000000 +0800 @@ -32,6 +32,10 @@ /* stop flush ipis for the previous mm */ cpu_clear(cpu, prev->cpu_vm_mask); #ifdef CONFIG_SMP + // vmscan.c::end_tlb_tasks maybe had copied cpu_vm_mask before + // we leave prev, so let's flush the trace of prev of + // delay_tlb_tasks. + timer_flush_tlb_tasks(NULL); per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK; per_cpu(cpu_tlbstate, cpu).active_mm = next; #endif Index: linux-2.6.19/include/asm-i386/pgtable-2level.h =================================================================== --- linux-2.6.19.orig/include/asm-i386/pgtable-2level.h 2007-01-22 13:58:32.000000000 +0800 +++ linux-2.6.19/include/asm-i386/pgtable-2level.h 2007-01-23 12:50:09.905950872 +0800 @@ -48,21 +48,22 @@ } /* - * Bits 0, 6 and 7 are taken, split up the 29 bits of offset + * Bits 0, 5, 6 and 7 are taken, split up the 28 bits of offset * into this range: */ -#define PTE_FILE_MAX_BITS 29 +#define PTE_FILE_MAX_BITS 28 #define pte_to_pgoff(pte) \ - ((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 )) + ((((pte).pte_low >> 1) & 0xf ) + (((pte).pte_low >> 8) << 4 )) #define pgoff_to_pte(off) \ - ((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE }) + ((pte_t) { (((off) & 0xf) << 1) + (((off) >> 4) << 8) + _PAGE_FILE }) /* Encode and de-code a swap entry */ -#define __swp_type(x) (((x).val >> 1) & 0x1f) +#define __swp_type(x) (((x).val >> 1) & 0xf) #define __swp_offset(x) ((x).val >> 8) -#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 1) | ((offset) << 8) }) +#define __swp_entry(type, offset) ((swp_entry_t) { ((type & 0xf) << 1) |\ + ((offset) << 8) | _PAGE_SWAPPED }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) Index: linux-2.6.19/include/asm-i386/pgtable.h =================================================================== --- linux-2.6.19.orig/include/asm-i386/pgtable.h 2007-01-22 13:58:32.000000000 +0800 +++ linux-2.6.19/include/asm-i386/pgtable.h 2007-01-23 11:47:00.775687672 +0800 @@ -121,7 +121,11 @@ #define _PAGE_UNUSED3 0x800 /* If _PAGE_PRESENT is clear, we use these: */ -#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */ +#define _PAGE_UNMAPPED 0x020 /* a special PTE type, hold its page reference + even it's unmapped, see more from + Documentation/vm_pps.txt. */ +#define _PAGE_SWAPPED 0x040 /* swapped PTE. */ +#define _PAGE_FILE 0x060 /* nonlinear file mapping, saved PTE; */ #define _PAGE_PROTNONE 0x080 /* if the user mapped it with PROT_NONE; pte_present gives true */ #ifdef CONFIG_X86_PAE @@ -227,7 +231,12 @@ /* * The following only works if pte_present() is not true. */ -static inline int pte_file(pte_t pte) { return (pte).pte_low & _PAGE_FILE; } +static inline int pte_unmapped(pte_t pte) { return ((pte).pte_low & 0x60) + == _PAGE_UNMAPPED; } +static inline int pte_swapped(pte_t pte) { return ((pte).pte_low & 0x60) + == _PAGE_SWAPPED; } +static inline int pte_file(pte_t pte) { return ((pte).pte_low & 0x60) + == _PAGE_FILE; } static inline pte_t pte_rdprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; } static inline pte_t pte_exprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; } Index: linux-2.6.19/include/linux/mm.h =================================================================== --- linux-2.6.19.orig/include/linux/mm.h 2007-01-22 13:58:34.000000000 +0800 +++ linux-2.6.19/include/linux/mm.h 2007-01-23 12:27:56.171419760 +0800 @@ -168,6 +168,9 @@ #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ +#define VM_PURE_PRIVATE 0x04000000 /* Is the vma is only belonging to a + mm, see more from + Documentation/vm_pps.txt */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS @@ -1166,5 +1169,33 @@ __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma); +struct pps_info { + atomic_t total; + atomic_t pte_count; // stage 1 and 2. + atomic_t unmapped_count; // stage 3 and 4. + atomic_t swapped_count; // stage 6. +}; +extern struct pps_info pps_info; + +/* vmscan.c::delay flush TLB */ +struct delay_tlb_task +{ + struct mm_struct* mm; + cpumask_t cpu_mask; + struct vm_area_struct* vma[32]; + unsigned long start[32]; + unsigned long end[32]; +}; +extern struct delay_tlb_task delay_tlb_tasks[32]; + +// The prototype of the function is fit with the "func" of "int +// smp_call_function (void (*func) (void *info), void *info, int retry, int +// wait);" of include/linux/smp.h of 2.6.16.29. Call it with NULL. +void timer_flush_tlb_tasks(void* data /* = NULL */); + +void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma); +void leave_pps(struct vm_area_struct* vma, int migrate_flag); + +#define MAX_SERIES_LENGTH 8 #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ Index: linux-2.6.19/include/linux/swapops.h =================================================================== --- linux-2.6.19.orig/include/linux/swapops.h 2006-11-30 05:57:37.000000000 +0800 +++ linux-2.6.19/include/linux/swapops.h 2007-01-22 14:00:00.000000000 +0800 @@ -50,7 +50,7 @@ { swp_entry_t arch_entry; - BUG_ON(pte_file(pte)); + BUG_ON(!pte_swapped(pte)); arch_entry = __pte_to_swp_entry(pte); return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } @@ -64,7 +64,7 @@ swp_entry_t arch_entry; arch_entry = __swp_entry(swp_type(entry), swp_offset(entry)); - BUG_ON(pte_file(__swp_entry_to_pte(arch_entry))); + BUG_ON(!pte_swapped(__swp_entry_to_pte(arch_entry))); return __swp_entry_to_pte(arch_entry); } Index: linux-2.6.19/kernel/fork.c =================================================================== --- linux-2.6.19.orig/kernel/fork.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/kernel/fork.c 2007-01-22 14:00:00.000000000 +0800 @@ -241,6 +241,7 @@ tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); if (!tmp) goto fail_nomem; + leave_pps(mpnt, 1); *tmp = *mpnt; pol = mpol_copy(vma_policy(mpnt)); retval = PTR_ERR(pol); Index: linux-2.6.19/kernel/timer.c =================================================================== --- linux-2.6.19.orig/kernel/timer.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/kernel/timer.c 2007-01-22 14:00:00.000000000 +0800 @@ -1115,6 +1115,10 @@ rcu_check_callbacks(cpu, user_tick); scheduler_tick(); run_posix_cpu_timers(p); + +#ifdef SMP + timer_flush_tlb_tasks(NULL); +#endif } /* Index: linux-2.6.19/mm/fremap.c =================================================================== --- linux-2.6.19.orig/mm/fremap.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/fremap.c 2007-01-22 14:00:00.000000000 +0800 @@ -37,7 +37,7 @@ page_cache_release(page); } } else { - if (!pte_file(pte)) + if (pte_swapped(pte)) free_swap_and_cache(pte_to_swp_entry(pte)); pte_clear_not_present_full(mm, addr, ptep, 0); } Index: linux-2.6.19/mm/memory.c =================================================================== --- linux-2.6.19.orig/mm/memory.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/memory.c 2007-01-23 12:47:12.000000000 +0800 @@ -435,7 +435,7 @@ /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { - if (!pte_file(pte)) { + if (pte_swapped(pte)) { swp_entry_t entry = pte_to_swp_entry(pte); swap_duplicate(entry); @@ -628,6 +628,9 @@ spinlock_t *ptl; int file_rss = 0; int anon_rss = 0; + int pps_pte = 0; + int pps_unmapped = 0; + int pps_swapped = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); arch_enter_lazy_mmu_mode(); @@ -672,6 +675,13 @@ addr) != page->index) set_pte_at(mm, addr, pte, pgoff_to_pte(page->index)); + if (vma->vm_flags & VM_PURE_PRIVATE) { + if (page != ZERO_PAGE(addr)) { + if (PageWriteback(page)) + lru_cache_add_active(page); + pps_pte++; + } + } if (PageAnon(page)) anon_rss--; else { @@ -691,12 +701,31 @@ */ if (unlikely(details)) continue; - if (!pte_file(ptent)) + if (pte_unmapped(ptent)) { + struct page *page; + page = pfn_to_page(pte_pfn(ptent)); + BUG_ON(page == ZERO_PAGE(addr)); + if (PageWriteback(page)) + lru_cache_add_active(page); + pps_unmapped++; + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + tlb_remove_page(tlb, page); + anon_rss--; + continue; + } + if (pte_swapped(ptent)) { + if (vma->vm_flags & VM_PURE_PRIVATE) + pps_swapped++; free_swap_and_cache(pte_to_swp_entry(ptent)); + } pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0)); add_mm_rss(mm, file_rss, anon_rss); + atomic_sub(pps_pte + pps_unmapped, &pps_info.total); + atomic_sub(pps_pte, &pps_info.pte_count); + atomic_sub(pps_unmapped, &pps_info.unmapped_count); + atomic_sub(pps_swapped, &pps_info.swapped_count); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); @@ -955,7 +984,8 @@ if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) set_page_dirty(page); - mark_page_accessed(page); + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + mark_page_accessed(page); } unlock: pte_unmap_unlock(ptep, ptl); @@ -1606,7 +1636,12 @@ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); - lru_cache_add_active(new_page); + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + lru_cache_add_active(new_page); + else { + atomic_inc(&pps_info.total); + atomic_inc(&pps_info.pte_count); + } page_add_new_anon_rmap(new_page, vma, address); /* Free the old page.. */ @@ -1975,6 +2010,85 @@ } /* + * New read ahead code, mainly for VM_PURE_PRIVATE only. + */ +static void pps_swapin_readahead(swp_entry_t entry, unsigned long addr, struct + vm_area_struct *vma, pte_t* pte, pmd_t* pmd) +{ + struct page* page; + pte_t *prev, *next; + swp_entry_t temp; + spinlock_t* ptl = pte_lockptr(vma->vm_mm, pmd); + int swapType = swp_type(entry); + int swapOffset = swp_offset(entry); + int readahead = 1, abs; + + if (!(vma->vm_flags & VM_PURE_PRIVATE)) { + swapin_readahead(entry, addr, vma); + return; + } + + page = read_swap_cache_async(entry, vma, addr); + if (!page) + return; + page_cache_release(page); + + // read ahead the whole series, first forward then backward. + while (readahead < MAX_SERIES_LENGTH) { + next = pte++; + if (next - (pte_t*) pmd >= PTRS_PER_PTE) + break; + spin_lock(ptl); + if (!(!pte_present(*next) && pte_swapped(*next))) { + spin_unlock(ptl); + break; + } + temp = pte_to_swp_entry(*next); + spin_unlock(ptl); + if (swp_type(temp) != swapType) + break; + abs = swp_offset(temp) - swapOffset; + abs = abs < 0 ? -abs : abs; + swapOffset = swp_offset(temp); + if (abs > 8) + // the two swap entries are too far, give up! + break; + page = read_swap_cache_async(temp, vma, addr); + if (!page) + return; + page_cache_release(page); + readahead++; + } + + swapOffset = swp_offset(entry); + while (readahead < MAX_SERIES_LENGTH) { + prev = pte--; + if (prev - (pte_t*) pmd < 0) + break; + spin_lock(ptl); + if (!(!pte_present(*prev) && pte_swapped(*prev))) { + spin_unlock(ptl); + break; + } + temp = pte_to_swp_entry(*prev); + spin_unlock(ptl); + if (swp_type(temp) != swapType) + break; + abs = swp_offset(temp) - swapOffset; + abs = abs < 0 ? -abs : abs; + swapOffset = swp_offset(temp); + if (abs > 8) + // the two swap entries are too far, give up! + break; + page = read_swap_cache_async(temp, vma, addr); + if (!page) + return; + page_cache_release(page); + readahead++; + } +} + +/* * We enter with non-exclusive mmap_sem (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. * We return with mmap_sem still held, but pte unmapped and unlocked. @@ -2001,7 +2115,7 @@ page = lookup_swap_cache(entry); if (!page) { grab_swap_token(); /* Contend for token _before_ read-in */ - swapin_readahead(entry, address, vma); + pps_swapin_readahead(entry, address, vma, page_table, pmd); page = read_swap_cache_async(entry, vma, address); if (!page) { /* @@ -2021,7 +2135,8 @@ } delayacct_clear_flag(DELAYACCT_PF_SWAPIN); - mark_page_accessed(page); + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + mark_page_accessed(page); lock_page(page); /* @@ -2033,6 +2148,10 @@ if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; + if (vma->vm_flags & VM_PURE_PRIVATE) { + lru_cache_add_active(page); + mark_page_accessed(page); + } goto out_nomap; } @@ -2053,6 +2172,11 @@ if (vm_swap_full()) remove_exclusive_swap_page(page); unlock_page(page); + if (vma->vm_flags & VM_PURE_PRIVATE) { + atomic_dec(&pps_info.swapped_count); + atomic_inc(&pps_info.total); + atomic_inc(&pps_info.pte_count); + } if (write_access) { if (do_wp_page(mm, vma, address, @@ -2104,8 +2228,13 @@ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte_none(*page_table)) goto release; + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + lru_cache_add_active(page); + else { + atomic_inc(&pps_info.total); + atomic_inc(&pps_info.pte_count); + } inc_mm_counter(mm, anon_rss); - lru_cache_add_active(page); page_add_new_anon_rmap(page, vma, address); } else { /* Map the ZERO_PAGE - vm_page_prot is readonly */ @@ -2392,6 +2521,22 @@ old_entry = entry = *pte; if (!pte_present(entry)) { + if (pte_unmapped(entry)) { + BUG_ON(!(vma->vm_flags & VM_PURE_PRIVATE)); + atomic_dec(&pps_info.unmapped_count); + atomic_inc(&pps_info.pte_count); + struct page* page = pte_page(entry); + pte_t temp_pte = mk_pte(page, vma->vm_page_prot); + pte = pte_offset_map_lock(mm, pmd, address, &ptl); + if (unlikely(pte_same(*pte, entry))) { + page_add_new_anon_rmap(page, vma, address); + set_pte_at(mm, address, pte, temp_pte); + update_mmu_cache(vma, address, temp_pte); + lazy_mmu_prot_update(temp_pte); + } + pte_unmap_unlock(pte, ptl); + return VM_FAULT_MINOR; + } if (pte_none(entry)) { if (vma->vm_ops) { if (vma->vm_ops->nopage) @@ -2685,3 +2830,118 @@ return buf - old_buf; } + +static void migrate_back_pte_range(struct mm_struct* mm, pmd_t *pmd, struct + vm_area_struct *vma, unsigned long addr, unsigned long end) +{ + struct page* page; + pte_t entry; + pte_t *pte; + spinlock_t* ptl; + int pps_pte = 0; + int pps_unmapped = 0; + int pps_swapped = 0; + + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + do { + if (!pte_present(*pte) && pte_unmapped(*pte)) { + page = pte_page(*pte); + entry = mk_pte(page, vma->vm_page_prot); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + set_pte_at(mm, addr, pte, entry); + BUG_ON(page == ZERO_PAGE(addr)); + page_add_new_anon_rmap(page, vma, addr); + lru_cache_add_active(page); + pps_unmapped++; + } else if (pte_present(*pte)) { + page = pte_page(*pte); + if (page == ZERO_PAGE(addr)) + continue; + lru_cache_add_active(page); + pps_pte++; + } else if (!pte_present(*pte) && pte_swapped(*pte)) + pps_swapped++; + } while (pte++, addr += PAGE_SIZE, addr != end); + pte_unmap_unlock(pte - 1, ptl); + lru_add_drain(); + atomic_sub(pps_pte + pps_unmapped, &pps_info.total); + atomic_sub(pps_pte, &pps_info.pte_count); + atomic_sub(pps_unmapped, &pps_info.unmapped_count); + atomic_sub(pps_swapped, &pps_info.swapped_count); +} + +static void migrate_back_pmd_range(struct mm_struct* mm, pud_t *pud, struct + vm_area_struct *vma, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + migrate_back_pte_range(mm, pmd, vma, addr, next); + } while (pmd++, addr = next, addr != end); +} + +static void migrate_back_pud_range(struct mm_struct* mm, pgd_t *pgd, struct + vm_area_struct *vma, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + migrate_back_pmd_range(mm, pud, vma, addr, next); + } while (pud++, addr = next, addr != end); +} + +// migrate all pages of pure private vma back to Linux legacy memory management. +static void migrate_back_legacy_linux(struct mm_struct* mm, struct vm_area_struct* vma) +{ + pgd_t* pgd; + unsigned long next; + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + + pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + migrate_back_pud_range(mm, pgd, vma, addr, next); + } while (pgd++, addr = next, addr != end); +} + +void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma) +{ + int condition = VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \ + VM_GROWSDOWN | VM_GROWSUP | \ + VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | \ + VM_ACCOUNT | VM_PURE_PRIVATE; + if (!(vma->vm_flags & ~condition) && vma->vm_file == NULL) { + vma->vm_flags |= VM_PURE_PRIVATE; + if (list_empty(&mm->mmlist)) { + spin_lock(&mmlist_lock); + if (list_empty(&mm->mmlist)) + list_add(&mm->mmlist, &init_mm.mmlist); + spin_unlock(&mmlist_lock); + } + } +} + +void leave_pps(struct vm_area_struct* vma, int migrate_flag) +{ + struct mm_struct* mm = vma->vm_mm; + + if (vma->vm_flags & VM_PURE_PRIVATE) { + vma->vm_flags &= ~VM_PURE_PRIVATE; + if (migrate_flag) + migrate_back_legacy_linux(mm, vma); + } +} Index: linux-2.6.19/mm/mmap.c =================================================================== --- linux-2.6.19.orig/mm/mmap.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/mmap.c 2007-01-22 14:00:00.000000000 +0800 @@ -229,6 +229,7 @@ if (vma->vm_file) fput(vma->vm_file); mpol_free(vma_policy(vma)); + leave_pps(vma, 0); kmem_cache_free(vm_area_cachep, vma); return next; } @@ -620,6 +621,7 @@ fput(file); mm->map_count--; mpol_free(vma_policy(next)); + leave_pps(next, 0); kmem_cache_free(vm_area_cachep, next); /* * In mprotect's case 6 (see comments on vma_merge), @@ -1112,6 +1114,8 @@ if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT)) vma->vm_flags &= ~VM_ACCOUNT; + enter_pps(mm, vma); + /* Can addr have changed?? * * Answer: Yes, several device drivers can do it in their @@ -1138,6 +1142,7 @@ fput(file); } mpol_free(vma_policy(vma)); + leave_pps(vma, 0); kmem_cache_free(vm_area_cachep, vma); } out: @@ -1165,6 +1170,7 @@ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); charged = 0; free_vma: + leave_pps(vma, 0); kmem_cache_free(vm_area_cachep, vma); unacct_error: if (charged) @@ -1742,6 +1748,10 @@ /* most fields are the same, copy all, and then fixup */ *new = *vma; + if (new->vm_flags & VM_PURE_PRIVATE) { + new->vm_flags &= ~VM_PURE_PRIVATE; + enter_pps(mm, new); + } if (new_below) new->vm_end = addr; @@ -1950,6 +1960,7 @@ vma->vm_flags = flags; vma->vm_page_prot = protection_map[flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]; + enter_pps(mm, vma); vma_link(mm, vma, prev, rb_link, rb_parent); out: mm->total_vm += len >> PAGE_SHIFT; @@ -2073,6 +2084,10 @@ get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); + if (new_vma->vm_flags & VM_PURE_PRIVATE) { + new_vma->vm_flags &= ~VM_PURE_PRIVATE; + enter_pps(mm, new_vma); + } vma_link(mm, new_vma, prev, rb_link, rb_parent); } } Index: linux-2.6.19/mm/rmap.c =================================================================== --- linux-2.6.19.orig/mm/rmap.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/rmap.c 2007-01-22 14:00:00.000000000 +0800 @@ -618,6 +618,7 @@ spinlock_t *ptl; int ret = SWAP_AGAIN; + BUG_ON(vma->vm_flags & VM_PURE_PRIVATE); address = vma_address(page, vma); if (address == -EFAULT) goto out; @@ -676,7 +677,7 @@ #endif } set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); - BUG_ON(pte_file(*pte)); + BUG_ON(!pte_swapped(*pte)); } else #ifdef CONFIG_MIGRATION if (migration) { Index: linux-2.6.19/mm/swap_state.c =================================================================== --- linux-2.6.19.orig/mm/swap_state.c 2006-11-30 05:57:37.000000000 +0800 +++ linux-2.6.19/mm/swap_state.c 2007-01-22 14:00:00.000000000 +0800 @@ -354,7 +354,8 @@ /* * Initiate read into locked page and return. */ - lru_cache_add_active(new_page); + if (vma == NULL || !(vma->vm_flags & VM_PURE_PRIVATE)) + lru_cache_add_active(new_page); swap_readpage(NULL, new_page); return new_page; } Index: linux-2.6.19/mm/swapfile.c =================================================================== --- linux-2.6.19.orig/mm/swapfile.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/swapfile.c 2007-01-23 12:31:38.000000000 +0800 @@ -501,6 +501,166 @@ } #endif +static int pps_test_swap_type(struct mm_struct* mm, pmd_t* pmd, pte_t* pte, int + type, struct page** ret_page) +{ + spinlock_t* ptl = pte_lockptr(mm, pmd); + swp_entry_t entry; + struct page* page; + + spin_lock(ptl); + if (!pte_present(*pte) && pte_swapped(*pte)) { + entry = pte_to_swp_entry(*pte); + if (swp_type(entry) == type) { + *ret_page = NULL; + spin_unlock(ptl); + return 1; + } + } else { + page = pfn_to_page(pte_pfn(*pte)); + if (PageSwapCache(page)) { + entry.val = page_private(page); + if (swp_type(entry) == type) { + page_cache_get(page); + *ret_page = page; + spin_unlock(ptl); + return 1; + } + } + } + spin_unlock(ptl); + return 0; +} + +static int pps_swapoff_scan_ptes(struct mm_struct* mm, struct vm_area_struct* + vma, pmd_t* pmd, unsigned long addr, unsigned long end, int type) +{ + pte_t *pte; + struct page* page; + + pte = pte_offset_map(pmd, addr); + do { + while (pps_test_swap_type(mm, pmd, pte, type, &page)) { + if (page == NULL) { + switch (__handle_mm_fault(mm, vma, addr, 0)) { + case VM_FAULT_SIGBUS: + case VM_FAULT_OOM: + return -ENOMEM; + case VM_FAULT_MINOR: + case VM_FAULT_MAJOR: + break; + default: + BUG(); + } + } else { + wait_on_page_locked(page); + wait_on_page_writeback(page); + lock_page(page); + if (!PageSwapCache(page)) { + unlock_page(page); + page_cache_release(page); + break; + } + wait_on_page_writeback(page); + delete_from_swap_cache(page); + unlock_page(page); + page_cache_release(page); + break; + } + } + } while (pte++, addr += PAGE_SIZE, addr != end); + return 0; +} + +static int pps_swapoff_pmd_range(struct mm_struct* mm, struct vm_area_struct* + vma, pud_t* pud, unsigned long addr, unsigned long end, int type) +{ + unsigned long next; + int ret; + pmd_t* pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + ret = pps_swapoff_scan_ptes(mm, vma, pmd, addr, next, type); + if (ret == -ENOMEM) + return ret; + } while (pmd++, addr = next, addr != end); + return 0; +} + +static int pps_swapoff_pud_range(struct mm_struct* mm, struct vm_area_struct* + vma, pgd_t* pgd, unsigned long addr, unsigned long end, int type) +{ + unsigned long next; + int ret; + pud_t* pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + ret = pps_swapoff_pmd_range(mm, vma, pud, addr, next, type); + if (ret == -ENOMEM) + return ret; + } while (pud++, addr = next, addr != end); + return 0; +} + +static int pps_swapoff_pgd_range(struct mm_struct* mm, struct vm_area_struct* + vma, int type) +{ + unsigned long next; + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + int ret; + pgd_t* pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + ret = pps_swapoff_pud_range(mm, vma, pgd, addr, next, type); + if (ret == -ENOMEM) + return ret; + } while (pgd++, addr = next, addr != end); + return 0; +} + +static int pps_swapoff(int type) +{ + struct vm_area_struct* vma; + struct list_head *pos; + struct mm_struct *prev, *mm; + int ret = 0; + + prev = mm = &init_mm; + pos = &init_mm.mmlist; + atomic_inc(&prev->mm_users); + spin_lock(&mmlist_lock); + while ((pos = pos->next) != &init_mm.mmlist) { + mm = list_entry(pos, struct mm_struct, mmlist); + if (!atomic_inc_not_zero(&mm->mm_users)) + continue; + spin_unlock(&mmlist_lock); + mmput(prev); + prev = mm; + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) { + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + continue; + if (vma->vm_flags & VM_LOCKED) + continue; + ret = pps_swapoff_pgd_range(mm, vma, type); + if (ret == -ENOMEM) + break; + } + up_read(&mm->mmap_sem); + spin_lock(&mmlist_lock); + } + spin_unlock(&mmlist_lock); + mmput(prev); + return ret; +} + /* * No need to decide whether this PTE shares the swap entry with others, * just let do_wp_page work it out if a write is requested later - to @@ -694,6 +854,12 @@ int reset_overflow = 0; int shmem; + // Let's first read all pps pages back! Note, it's one-to-one mapping. + retval = pps_swapoff(type); + if (retval == -ENOMEM) // something was wrong. + return -ENOMEM; + // Now, the remain pages are shared pages, go ahead! + /* * When searching mms for an entry, a good strategy is to * start at the first mm we freed the previous entry from @@ -914,16 +1080,20 @@ */ static void drain_mmlist(void) { - struct list_head *p, *next; + // struct list_head *p, *next; unsigned int i; for (i = 0; i < nr_swapfiles; i++) if (swap_info[i].inuse_pages) return; + /* + * Now, init_mm.mmlist list not only is used by SwapDevice but also is + * used by PPS, see Documentation/vm_pps.txt. spin_lock(&mmlist_lock); list_for_each_safe(p, next, &init_mm.mmlist) list_del_init(p); spin_unlock(&mmlist_lock); + */ } /* Index: linux-2.6.19/mm/vmscan.c =================================================================== --- linux-2.6.19.orig/mm/vmscan.c 2007-01-22 13:58:36.000000000 +0800 +++ linux-2.6.19/mm/vmscan.c 2007-01-23 12:39:48.000000000 +0800 @@ -66,6 +66,10 @@ int swappiness; int all_unreclaimable; + + /* pps control command. See Documentation/vm_pps.txt. */ + int may_reclaim; + int reclaim_node; }; /* @@ -1097,6 +1101,443 @@ return ret; } +// pps fields. +static wait_queue_head_t kppsd_wait; +static struct scan_control wakeup_sc; +struct pps_info pps_info = { + .total = ATOMIC_INIT(0), + .pte_count = ATOMIC_INIT(0), // stage 1 and 2. + .unmapped_count = ATOMIC_INIT(0), // stage 3 and 4. + .swapped_count = ATOMIC_INIT(0) // stage 6. +}; +// pps end. + +struct series_t { + pte_t orig_ptes[MAX_SERIES_LENGTH]; + pte_t* ptes[MAX_SERIES_LENGTH]; + struct page* pages[MAX_SERIES_LENGTH]; + int series_length; + int series_stage; +} series; + +static int get_series_stage(pte_t* pte, int index) +{ + series.orig_ptes[index] = *pte; + series.ptes[index] = pte; + if (pte_present(series.orig_ptes[index])) { + struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); + series.pages[index] = page; + if (page == ZERO_PAGE(addr)) // reserved page is exclusive from us. + return 7; + if (pte_young(series.orig_ptes[index])) { + return 1; + } else + return 2; + } else if (pte_unmapped(series.orig_ptes[index])) { + struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index])); + series.pages[index] = page; + if (!PageSwapCache(page)) + return 3; + else { + if (PageWriteback(page) || PageDirty(page)) + return 4; + else + return 5; + } + } else // pte_swapped -- SwappedPTE + return 6; +} + +static void find_series(pte_t** start, unsigned long* addr, unsigned long end) +{ + int i; + int series_stage = get_series_stage((*start)++, 0); + *addr += PAGE_SIZE; + + for (i = 1; i < MAX_SERIES_LENGTH && *addr < end; i++, (*start)++, + *addr += PAGE_SIZE) { + if (series_stage != get_series_stage(*start, i)) + break; + } + series.series_stage = series_stage; + series.series_length = i; +} + +struct delay_tlb_task delay_tlb_tasks[32] = { [0 ... 31] = {0} }; + +void timer_flush_tlb_tasks(void* data) +{ + int i; +#ifdef CONFIG_X86 + int flag = 0; +#endif + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm != NULL && + cpu_isset(smp_processor_id(), + delay_tlb_tasks[i].mm->cpu_vm_mask) && + cpu_isset(smp_processor_id(), + delay_tlb_tasks[i].cpu_mask)) { +#ifdef CONFIG_X86 + flag = 1; +#elif + // smp::local_flush_tlb_range(delay_tlb_tasks[i]); +#endif + cpu_clear(smp_processor_id(), delay_tlb_tasks[i].cpu_mask); + } + } +#ifdef CONFIG_X86 + if (flag) + local_flush_tlb(); +#endif +} + +static struct delay_tlb_task* delay_task = NULL; +static int vma_index = 0; + +static struct delay_tlb_task* search_free_tlb_tasks_slot(void) +{ + struct delay_tlb_task* ret = NULL; + int i; +again: + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm != NULL) { + if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) { + mmput(delay_tlb_tasks[i].mm); + delay_tlb_tasks[i].mm = NULL; + ret = &delay_tlb_tasks[i]; + } + } else + ret = &delay_tlb_tasks[i]; + } + if (!ret) { // Force flush TLBs. + on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1); + goto again; + } + return ret; +} + +static void init_delay_task(struct mm_struct* mm) +{ + cpus_clear(delay_task->cpu_mask); + vma_index = 0; + delay_task->mm = mm; +} + +/* + * We will be working on the mm, so let's force to flush it if necessary. + */ +static void start_tlb_tasks(struct mm_struct* mm) +{ + int i, flag = 0; +again: + for (i = 0; i < 32; i++) { + if (delay_tlb_tasks[i].mm == mm) { + if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) { + mmput(delay_tlb_tasks[i].mm); + delay_tlb_tasks[i].mm = NULL; + } else + flag = 1; + } + } + if (flag) { // Force flush TLBs. + on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1); + goto again; + } + BUG_ON(delay_task != NULL); + delay_task = search_free_tlb_tasks_slot(); + init_delay_task(mm); +} + +static void end_tlb_tasks(void) +{ + atomic_inc(&delay_task->mm->mm_users); + delay_task->cpu_mask = delay_task->mm->cpu_vm_mask; + delay_task = NULL; +#ifndef CONFIG_SMP + timer_flush_tlb_tasks(NULL); +#endif +} + +static void fill_in_tlb_tasks(struct vm_area_struct* vma, unsigned long addr, + unsigned long end) +{ + struct mm_struct* mm; + // First, try to combine the task with the previous. + if (vma_index != 0 && delay_task->vma[vma_index - 1] == vma && + delay_task->end[vma_index - 1] == addr) { + delay_task->end[vma_index - 1] = end; + return; + } +fill_it: + if (vma_index != 32) { + delay_task->vma[vma_index] = vma; + delay_task->start[vma_index] = addr; + delay_task->end[vma_index] = end; + vma_index++; + return; + } + mm = delay_task->mm; + end_tlb_tasks(); + + delay_task = search_free_tlb_tasks_slot(); + init_delay_task(mm); + goto fill_it; +} + +static void shrink_pvma_scan_ptes(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pmd_t* pmd, unsigned long addr, + unsigned long end) +{ + int i, statistic; + spinlock_t* ptl = pte_lockptr(mm, pmd); + pte_t* pte = pte_offset_map(pmd, addr); + int anon_rss = 0; + struct pagevec freed_pvec; + int may_enter_fs = (sc->gfp_mask & (__GFP_FS | __GFP_IO)); + struct address_space* mapping = &swapper_space; + + pagevec_init(&freed_pvec, 1); + do { + memset(&series, 0, sizeof(struct series_t)); + find_series(&pte, &addr, end); + if (sc->may_reclaim == 0 && series.series_stage == 5) + continue; + switch (series.series_stage) { + case 1: // PTE -- untouched PTE. + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + lock_page(page); + spin_lock(ptl); + if (unlikely(pte_same(*series.ptes[i], + series.orig_ptes[i]))) { + if (pte_dirty(*series.ptes[i])) + set_page_dirty(page); + set_pte_at(mm, addr + i * PAGE_SIZE, + series.ptes[i], + pte_mkold(pte_mkclean(*series.ptes[i]))); + } + spin_unlock(ptl); + unlock_page(page); + } + fill_in_tlb_tasks(vma, addr, addr + (PAGE_SIZE * + series.series_length)); + break; + case 2: // untouched PTE -- UnmappedPTE. + /* + * Note in stage 1, we've flushed TLB in fill_in_tlb_tasks, so + * if it's still clear here, we can shift it to Unmapped type. + * + * If some architecture doesn't support atomic cmpxchg + * instruction or can't atomically set the access bit after + * they touch a pte at first, combine stage 1 with stage 2, and + * send IPI immediately in fill_in_tlb_tasks. + */ + spin_lock(ptl); + statistic = 0; + for (i = 0; i < series.series_length; i++) { + if (unlikely(pte_same(*series.ptes[i], + series.orig_ptes[i]))) { + pte_t pte_unmapped = series.orig_ptes[i]; + pte_unmapped.pte_low &= ~_PAGE_PRESENT; + pte_unmapped.pte_low |= _PAGE_UNMAPPED; + if (cmpxchg(&series.ptes[i]->pte_low, + series.orig_ptes[i].pte_low, + pte_unmapped.pte_low) != + series.orig_ptes[i].pte_low) + continue; + page_remove_rmap(series.pages[i], vma); + anon_rss--; + statistic++; + } + } + atomic_add(statistic, &pps_info.unmapped_count); + atomic_sub(statistic, &pps_info.pte_count); + spin_unlock(ptl); + break; + case 3: // Attach SwapPage to PrivatePage. + /* + * A better arithmetic should be applied to Linux SwapDevice to + * allocate fake continual SwapPages which are close to each + * other, the offset between two close SwapPages is less than 8. + */ + if (sc->may_swap) { + for (i = 0; i < series.series_length; i++) { + lock_page(series.pages[i]); + if (!PageSwapCache(series.pages[i])) { + if (!add_to_swap(series.pages[i], + GFP_ATOMIC)) { + unlock_page(series.pages[i]); + break; + } + } + unlock_page(series.pages[i]); + } + } + break; + case 4: // SwapPage isn't consistent with PrivatePage. + /* + * A mini version pageout(). + * + * Current swap space can't commit multiple pages together:( + */ + if (sc->may_writepage && may_enter_fs) { + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + int res; + + if (!may_write_to_queue(mapping->backing_dev_info)) + break; + lock_page(page); + if (!PageDirty(page) || PageWriteback(page)) { + unlock_page(page); + continue; + } + clear_page_dirty_for_io(page); + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .nonblocking = 1, + .for_reclaim = 1, + }; + page_cache_get(page); + SetPageReclaim(page); + res = swap_writepage(page, &wbc); + if (res < 0) { + handle_write_error(mapping, page, res); + ClearPageReclaim(page); + page_cache_release(page); + break; + } + if (!PageWriteback(page)) + ClearPageReclaim(page); + page_cache_release(page); + } + } + break; + case 5: // UnmappedPTE -- SwappedPTE, reclaim PrivatePage. + statistic = 0; + for (i = 0; i < series.series_length; i++) { + struct page* page = series.pages[i]; + if (!(page_to_nid(page) == sc->reclaim_node || + sc->reclaim_node == -1)) + continue; + + lock_page(page); + spin_lock(ptl); + if (!pte_same(*series.ptes[i], series.orig_ptes[i]) || + /* We're racing with get_user_pages. */ + PageSwapCache(page) ? page_count(page) + > 2 : page_count(page) > 1) { + spin_unlock(ptl); + unlock_page(page); + continue; + } + statistic++; + swp_entry_t entry = { .val = page_private(page) }; + swap_duplicate(entry); + pte_t pte_swp = swp_entry_to_pte(entry); + set_pte_at(mm, addr + i * PAGE_SIZE, + series.ptes[i], pte_swp); + spin_unlock(ptl); + if (PageSwapCache(page) && !PageWriteback(page)) + delete_from_swap_cache(page); + unlock_page(page); + + if (!pagevec_add(&freed_pvec, page)) + __pagevec_release_nonlru(&freed_pvec); + } + atomic_add(statistic, &pps_info.swapped_count); + atomic_sub(statistic, &pps_info.unmapped_count); + atomic_sub(statistic, &pps_info.total); + break; + case 6: + // NULL operation! + break; + } + } while (addr < end); + add_mm_counter(mm, anon_rss, anon_rss); + if (pagevec_count(&freed_pvec)) + __pagevec_release_nonlru(&freed_pvec); +} + +static void shrink_pvma_pmd_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pud_t* pud, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pmd_t* pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + shrink_pvma_scan_ptes(sc, mm, vma, pmd, addr, next); + } while (pmd++, addr = next, addr != end); +} + +static void shrink_pvma_pud_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma, pgd_t* pgd, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pud_t* pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + shrink_pvma_pmd_range(sc, mm, vma, pud, addr, next); + } while (pud++, addr = next, addr != end); +} + +static void shrink_pvma_pgd_range(struct scan_control* sc, struct mm_struct* + mm, struct vm_area_struct* vma) +{ + unsigned long next; + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + pgd_t* pgd = pgd_offset(mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + shrink_pvma_pud_range(sc, mm, vma, pgd, addr, next); + } while (pgd++, addr = next, addr != end); +} + +static void shrink_private_vma(struct scan_control* sc) +{ + struct vm_area_struct* vma; + struct list_head *pos; + struct mm_struct *prev, *mm; + + prev = mm = &init_mm; + pos = &init_mm.mmlist; + atomic_inc(&prev->mm_users); + spin_lock(&mmlist_lock); + while ((pos = pos->next) != &init_mm.mmlist) { + mm = list_entry(pos, struct mm_struct, mmlist); + if (!atomic_inc_not_zero(&mm->mm_users)) + continue; + spin_unlock(&mmlist_lock); + mmput(prev); + prev = mm; + start_tlb_tasks(mm); + if (down_read_trylock(&mm->mmap_sem)) { + for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) { + if (!(vma->vm_flags & VM_PURE_PRIVATE)) + continue; + if (vma->vm_flags & VM_LOCKED) + continue; + shrink_pvma_pgd_range(sc, mm, vma); + } + up_read(&mm->mmap_sem); + } + end_tlb_tasks(); + spin_lock(&mmlist_lock); + } + spin_unlock(&mmlist_lock); + mmput(prev); +} + /* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at pages_high. @@ -1144,6 +1585,11 @@ sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); + wakeup_sc = sc; + wakeup_sc.may_reclaim = 1; + wakeup_sc.reclaim_node = pgdat->node_id; + wake_up_interruptible(&kppsd_wait); + for (i = 0; i < pgdat->nr_zones; i++) temp_priority[i] = DEF_PRIORITY; @@ -1723,3 +2169,39 @@ return __zone_reclaim(zone, gfp_mask, order); } #endif + +static int kppsd(void* p) +{ + struct task_struct *tsk = current; + int timeout; + DEFINE_WAIT(wait); + tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE; + struct scan_control default_sc; + default_sc.gfp_mask = GFP_KERNEL; + default_sc.may_writepage = 1; + default_sc.may_swap = 1; + default_sc.may_reclaim = 0; + default_sc.reclaim_node = -1; + + while (1) { + try_to_freeze(); + prepare_to_wait(&kppsd_wait, &wait, TASK_INTERRUPTIBLE); + timeout = schedule_timeout(2000); + finish_wait(&kppsd_wait, &wait); + + if (timeout) + shrink_private_vma(&wakeup_sc); + else + shrink_private_vma(&default_sc); + } + return 0; +} + +static int __init kppsd_init(void) +{ + init_waitqueue_head(&kppsd_wait); + kthread_run(kppsd, NULL, "kppsd"); + return 0; +} + +module_init(kppsd_init) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/