Hi Andrew,
I redid the qsbench runs with a bigger page cluster (2^4). It shows
improvement on both versions, the patched one still performing better.
Rik hinted me that we can make the default even bigger when we are
better at avoiding reading unrelated pages. I am currently testing
this. Here are the timings for 2^4 (i.e. twice the) ra pages:
vanilla:
1 x 2048M [20 runs] user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
2 x 1024M [20 runs] user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
4 x 540M [20 runs] user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
8 x 280M [20 runs] user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
16 x 128M [20 runs] user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
vswapra:
1 x 2048M [20 runs] user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
2 x 1024M [20 runs] user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
4 x 540M [20 runs] user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
8 x 280M [20 runs] user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
16 x 128M [20 runs] user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
Furthermore I changed the patch to leave shmem alone for now and added
documentation for the new approach. And I adjusted the changelog a
bit.
Andi, I think the NUMA policy is already taken care of. Can you have
another look at it? Other than that you gave positive feedback - can
I add your acked-by?
Hannes
---
The current swap readahead implementation reads a physically
contiguous group of swap slots around the faulting page to take
advantage of the disk head's position and in the hope that the
surrounding pages will be needed soon as well.
This works as long as the physical swap slot order approximates the
LRU order decently, otherwise it wastes memory and IO bandwidth to
read in pages that are unlikely to be needed soon.
However, the physical swap slot layout diverges from the LRU order
with increasing swap activity, i.e. high memory pressure situations,
and this is exactly the situation where swapin should not waste any
memory or IO bandwidth as both are the most contended resources at
this point.
Another approximation for LRU-relation is the VMA order as groups of
VMA-related pages are usually used together.
This patch combines both the physical and the virtual hint to get a
good approximation of pages that are sensible to read ahead.
When both diverge, we either read unrelated data, seek heavily for
related data, or, what this patch does, just decrease the readahead
efforts.
To achieve this, we have essentially two readahead windows of the same
size: one spans the virtual, the other one the physical neighborhood
of the faulting page. We only read where both areas overlap.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
---
mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 99 insertions(+), 16 deletions(-)
version 2:
o fall back to physical ra window for shmem
o add documentation to the new ra algorithm
qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
vanilla vswapra
1 x 2048M 391.25 ( 71.76) 384.56 445.55 (83.19) 415.41
2 x 1024M 384.25 ( 75.00) 423.08 290.26 (31.38) 299.51
4 x 540M 553.91 (100.02) 554.57 336.58 (52.49) 331.52
8 x 280M 561.08 ( 82.36) 583.12 319.13 (43.17) 307.69
16 x 128M 285.51 (113.20) 236.62 214.24 (62.37) 214.15
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
return found_page;
}
-/**
- * swapin_readahead - swap in pages in hope we need them soon
- * @entry: swap entry of this memory
- * @gfp_mask: memory allocation flags
- * @vma: user vma this address belongs to
- * @addr: target address for mempolicy
- *
- * Returns the struct page for entry and addr, after queueing swapin.
- *
+/*
* Primitive swap readahead code. We simply read an aligned block of
* (1 << page_cluster) entries in the swap area. This method is chosen
* because it doesn't cost us any seek time. We also make sure to queue
* the 'original' request together with the readahead ones...
- *
- * This has been extended to use the NUMA policies from the mm triggering
- * the readahead.
- *
- * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
*/
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_area_struct *vma, unsigned long addr)
+static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
+ struct vm_area_struct *vma, unsigned long addr)
{
int nr_pages;
struct page *page;
@@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
+
+/**
+ * swapin_readahead - swap in pages in hope we need them soon
+ * @entry: swap entry of this memory
+ * @gfp_mask: memory allocation flags
+ * @vma: user vma this address belongs to
+ * @addr: target address for mempolicy
+ *
+ * Returns the struct page for entry and addr, after queueing swapin.
+ *
+ * The readahead window is the virtual area around the faulting page,
+ * where the physical proximity of the swap slots is taken into
+ * account as well.
+ *
+ * While the swap allocation algorithm tries to keep LRU-related pages
+ * together on the swap backing, it is not reliable on heavy thrashing
+ * systems where concurrent reclaimers allocate swap slots and/or most
+ * anonymous memory pages are already in swap cache.
+ *
+ * On the virtual side, subgroups of VMA-related pages are usually
+ * used together, which gives another hint to LRU relationship.
+ *
+ * By taking both aspects into account, we get a good approximation of
+ * which pages are sensible to read together with the faulting one.
+ *
+ * This has been extended to use the NUMA policies from the mm
+ * triggering the readahead.
+ *
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
+ */
+struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ unsigned long start, pos, end;
+ unsigned long pmin, pmax;
+ int cluster, window;
+
+ if (!vma || !vma->vm_mm) /* XXX: shmem case */
+ return swapin_readahead_phys(entry, gfp_mask, vma, addr);
+
+ cluster = 1 << page_cluster;
+ window = cluster << PAGE_SHIFT;
+
+ /* Physical range to read from */
+ pmin = swp_offset(entry) & ~(cluster - 1);
+ pmax = pmin + cluster;
+
+ /* Virtual range to read from */
+ start = addr & ~(window - 1);
+ end = start + window;
+
+ for (pos = start; pos < end; pos += PAGE_SIZE) {
+ struct page *page;
+ swp_entry_t swp;
+ spinlock_t *ptl;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(vma->vm_mm, pos);
+ if (!pgd_present(*pgd))
+ continue;
+ pud = pud_offset(pgd, pos);
+ if (!pud_present(*pud))
+ continue;
+ pmd = pmd_offset(pud, pos);
+ if (!pmd_present(*pmd))
+ continue;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
+ if (!is_swap_pte(*pte)) {
+ pte_unmap_unlock(pte, ptl);
+ continue;
+ }
+ swp = pte_to_swp_entry(*pte);
+ pte_unmap_unlock(pte, ptl);
+
+ if (swp_type(swp) != swp_type(entry))
+ continue;
+ /*
+ * Dont move the disk head too far away. This also
+ * throttles readahead while thrashing, where virtual
+ * order diverges more and more from physical order.
+ */
+ if (swp_offset(swp) > pmax)
+ continue;
+ if (swp_offset(swp) < pmin)
+ continue;
+ page = read_swap_cache_async(swp, gfp_mask, vma, pos);
+ if (!page)
+ continue;
+ page_cache_release(page);
+ }
+ lru_add_drain(); /* Push any new pages onto the LRU now */
+ return read_swap_cache_async(entry, gfp_mask, vma, addr);
+}
On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + unsigned long start, pos, end;
> + unsigned long pmin, pmax;
> + int cluster, window;
> +
> + if (!vma || !vma->vm_mm) /* XXX: shmem case */
> + return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> + cluster = 1 << page_cluster;
> + window = cluster << PAGE_SHIFT;
> +
> + /* Physical range to read from */
> + pmin = swp_offset(entry) & ~(cluster - 1);
Is cluster really properly sign extended on 64bit? Looks a little
dubious. long from the start would be safer
> +
> + /* Virtual range to read from */
> + start = addr & ~(window - 1);
Same.
> + pgd = pgd_offset(vma->vm_mm, pos);
> + if (!pgd_present(*pgd))
> + continue;
> + pud = pud_offset(pgd, pos);
> + if (!pud_present(*pud))
> + continue;
> + pmd = pmd_offset(pud, pos);
> + if (!pmd_present(*pmd))
> + continue;
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
You could be more efficient here by using the standard mm/* nested loop
pattern that avoids relookup of everything in each iteration. I suppose
it would mainly make a difference with 32bit highpte where mapping a pte
can be somewhat costly. And you would take less locks this way.
> + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> + if (!page)
> + continue;
That's out of memory, break would be better here because prefetch
while oom is usually harmful.
> + page_cache_release(page);
> + }
> + lru_add_drain(); /* Push any new pages onto the LRU now */
> + return read_swap_cache_async(entry, gfp_mask, vma, addr);
Shouldn't that page be already handled in the loop earlier? Why doing that
again? It would be better to remember it from there.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
> > + *
> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> > + */
> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > + struct vm_area_struct *vma, unsigned long addr)
> > +{
> > + unsigned long start, pos, end;
> > + unsigned long pmin, pmax;
> > + int cluster, window;
> > +
> > + if (!vma || !vma->vm_mm) /* XXX: shmem case */
> > + return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> > +
> > + cluster = 1 << page_cluster;
> > + window = cluster << PAGE_SHIFT;
> > +
> > + /* Physical range to read from */
> > + pmin = swp_offset(entry) & ~(cluster - 1);
>
> Is cluster really properly sign extended on 64bit? Looks a little
> dubious. long from the start would be safer
Fixed.
> > + /* Virtual range to read from */
> > + start = addr & ~(window - 1);
>
> Same.
Fixed.
> > + pgd = pgd_offset(vma->vm_mm, pos);
> > + if (!pgd_present(*pgd))
> > + continue;
> > + pud = pud_offset(pgd, pos);
> > + if (!pud_present(*pud))
> > + continue;
> > + pmd = pmd_offset(pud, pos);
> > + if (!pmd_present(*pmd))
> > + continue;
> > + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>
> You could be more efficient here by using the standard mm/* nested loop
> pattern that avoids relookup of everything in each iteration. I suppose
> it would mainly make a difference with 32bit highpte where mapping a pte
> can be somewhat costly. And you would take less locks this way.
I ran into weird problems here. The above version is actually faster
in the benchmarks than writing a nested level walker or using
walk_page_range(). Still digging but it can take some time. Busy
week :(
> > + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> > + if (!page)
> > + continue;
>
> That's out of memory, break would be better here because prefetch
> while oom is usually harmful.
It can also happen due to a race with something releasing the swap
slot (i.e. swap_duplicate() fails). But the old version did a break
too and this patch shouldn't do it differently. Fixed.
> > + page_cache_release(page);
> > + }
> > + lru_add_drain(); /* Push any new pages onto the LRU now */
> > + return read_swap_cache_async(entry, gfp_mask, vma, addr);
>
> Shouldn't that page be already handled in the loop earlier? Why doing that
> again? It would be better to remember it from there.
When doing the nested page table level walker, communicating even more
state back and forth gets pretty ugly. I see what I can do.
Thanks for your input Andi,
Hannes
Johannes Weiner wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
>>> + pgd = pgd_offset(vma->vm_mm, pos);
>>> + if (!pgd_present(*pgd))
>>> + continue;
>>> + pud = pud_offset(pgd, pos);
>>> + if (!pud_present(*pud))
>>> + continue;
>>> + pmd = pmd_offset(pud, pos);
>>> + if (!pmd_present(*pmd))
>>> + continue;
>>> + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
>
> I ran into weird problems here. The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range(). Still digging but it can take some time. Busy
> week :(
I'm not too worried about not walking the page tables,
because swap is an extreme slow path anyway.
--
All rights reversed.
> I'm not too worried about not walking the page tables,
> because swap is an extreme slow path anyway.
it was more about taking less locks and doing less mappings.
Especially highmem pte mappings can be quite expensive, because
they have to flush parts of the TLB.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, 3 Jun 2009 00:37:39 +0200
Johannes Weiner <[email protected]> wrote:
> Hi Andrew,
>
> I redid the qsbench runs with a bigger page cluster (2^4). It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages. I am currently testing
> this. Here are the timings for 2^4 (i.e. twice the) ra pages:
>
> vanilla:
> 1 x 2048M [20 runs] user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs] user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs] user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs] user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs] user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
>
> vswapra:
> 1 x 2048M [20 runs] user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs] user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs] user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs] user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs] user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
>
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach. And I adjusted the changelog a
> bit.
>
> Andi, I think the NUMA policy is already taken care of. Can you have
> another look at it? Other than that you gave positive feedback - can
> I add your acked-by?
>
> Hannes
>
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
>
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
>
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
>
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
>
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
>
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
>
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page. We only read where both areas overlap.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andi Kleen <[email protected]>
> ---
> mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 99 insertions(+), 16 deletions(-)
>
> version 2:
> o fall back to physical ra window for shmem
> o add documentation to the new ra algorithm
>
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
>
> vanilla vswapra
>
> 1 x 2048M 391.25 ( 71.76) 384.56 445.55 (83.19) 415.41
> 2 x 1024M 384.25 ( 75.00) 423.08 290.26 (31.38) 299.51
> 4 x 540M 553.91 (100.02) 554.57 336.58 (52.49) 331.52
> 8 x 280M 561.08 ( 82.36) 583.12 319.13 (43.17) 307.69
> 16 x 128M 285.51 (113.20) 236.62 214.24 (62.37) 214.15
>
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
> return found_page;
> }
>
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
> * Primitive swap readahead code. We simply read an aligned block of
> * (1 << page_cluster) entries in the swap area. This method is chosen
> * because it doesn't cost us any seek time. We also make sure to queue
> * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> - struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr)
> {
> int nr_pages;
> struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + unsigned long start, pos, end;
> + unsigned long pmin, pmax;
> + int cluster, window;
> +
> + if (!vma || !vma->vm_mm) /* XXX: shmem case */
> + return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> + cluster = 1 << page_cluster;
> + window = cluster << PAGE_SHIFT;
> +
> + /* Physical range to read from */
> + pmin = swp_offset(entry) & ~(cluster - 1);
> + pmax = pmin + cluster;
> +
> + /* Virtual range to read from */
> + start = addr & ~(window - 1);
> + end = start + window;
> +
> + for (pos = start; pos < end; pos += PAGE_SIZE) {
> + struct page *page;
> + swp_entry_t swp;
> + spinlock_t *ptl;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(vma->vm_mm, pos);
> + if (!pgd_present(*pgd))
> + continue;
> + pud = pud_offset(pgd, pos);
> + if (!pud_present(*pud))
> + continue;
> + pmd = pmd_offset(pud, pos);
> + if (!pmd_present(*pmd))
> + continue;
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> + if (!is_swap_pte(*pte)) {
> + pte_unmap_unlock(pte, ptl);
> + continue;
> + }
> + swp = pte_to_swp_entry(*pte);
> + pte_unmap_unlock(pte, ptl);
> +
> + if (swp_type(swp) != swp_type(entry))
> + continue;
> + /*
> + * Dont move the disk head too far away. This also
> + * throttles readahead while thrashing, where virtual
> + * order diverges more and more from physical order.
> + */
> + if (swp_offset(swp) > pmax)
> + continue;
> + if (swp_offset(swp) < pmin)
> + continue;
I wonder (I just wonder..) can we add code like following here ?
/* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
if (swp != entry)
gfp_mask &= ~__GFP_WAIT
> + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
too slow ?
Bye.
-Kame
> + if (!page)
> + continue;
> + page_cache_release(page);
> + }
> + lru_add_drain(); /* Push any new pages onto the LRU now */
> + return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, Jun 04, 2009 at 10:46:28AM +0900, KAMEZAWA Hiroyuki wrote:
> I wonder (I just wonder..) can we add code like following here ?
>
> /* we do _readahead_ here. Then, we don't want to add too much jobs to vm/IO*/
> if (swp != entry)
> gfp_mask &= ~__GFP_WAIT
> > + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>
> too slow ?
Good idea, certainly worth evaluating. But not in this patch, I don't
want to change _everything_ at once :-)
Thanks, Hannes
Hi, Hannes.
On Wed, Jun 3, 2009 at 10:27 PM, Johannes Weiner<[email protected]> wrote:
> On Wed, Jun 03, 2009 at 01:34:57AM +0200, Andi Kleen wrote:
>> On Wed, Jun 03, 2009 at 12:37:39AM +0200, Johannes Weiner wrote:
>> > + *
>> > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
>> > + */
>> > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> > + struct vm_area_struct *vma, unsigned long addr)
>> > +{
>> > + unsigned long start, pos, end;
>> > + unsigned long pmin, pmax;
>> > + int cluster, window;
>> > +
>> > + if (!vma || !vma->vm_mm) /* XXX: shmem case */
>> > + return swapin_readahead_phys(entry, gfp_mask, vma, addr);
>> > +
>> > + cluster = 1 << page_cluster;
>> > + window = cluster << PAGE_SHIFT;
>> > +
>> > + /* Physical range to read from */
>> > + pmin = swp_offset(entry) & ~(cluster - 1);
>>
>> Is cluster really properly sign extended on 64bit? Looks a little
>> dubious. long from the start would be safer
>
> Fixed.
>
>> > + /* Virtual range to read from */
>> > + start = addr & ~(window - 1);
>>
>> Same.
>
> Fixed.
>
>> > + pgd = pgd_offset(vma->vm_mm, pos);
>> > + if (!pgd_present(*pgd))
>> > + continue;
>> > + pud = pud_offset(pgd, pos);
>> > + if (!pud_present(*pud))
>> > + continue;
>> > + pmd = pmd_offset(pud, pos);
>> > + if (!pmd_present(*pmd))
>> > + continue;
>> > + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
>>
>> You could be more efficient here by using the standard mm/* nested loop
>> pattern that avoids relookup of everything in each iteration. I suppose
>> it would mainly make a difference with 32bit highpte where mapping a pte
>> can be somewhat costly. And you would take less locks this way.
>
> I ran into weird problems here. The above version is actually faster
> in the benchmarks than writing a nested level walker or using
> walk_page_range(). Still digging but it can take some time. Busy
> week :(
>
>> > + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
>> > + if (!page)
>> > + continue;
>>
>> That's out of memory, break would be better here because prefetch
>> while oom is usually harmful.
>
> It can also happen due to a race with something releasing the swap
> slot (i.e. swap_duplicate() fails). But the old version did a break
> too and this patch shouldn't do it differently. Fixed.
I think it would be better to read fault page earlier than readahead pages.
That's because,
1) Readahead pages would prevent to read fault page due to out-of-memory.
2) If we can't get the fault page, we don't need extra pages(ie,
readahead pages)
It's waste of memory or IO bandwidth. It's what you want.
3) If we read fault page at first and meet oom, we can also stop readahead.
--
Kinds regards,
Minchan Kim
On Wed, 3 Jun 2009, Johannes Weiner wrote:
> Hi Andrew,
>
> I redid the qsbench runs with a bigger page cluster (2^4). It shows
> improvement on both versions, the patched one still performing better.
> Rik hinted me that we can make the default even bigger when we are
> better at avoiding reading unrelated pages. I am currently testing
> this. Here are the timings for 2^4 (i.e. twice the) ra pages:
>
> vanilla:
> 1 x 2048M [20 runs] user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47]
> 2 x 1024M [20 runs] user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94]
> 4 x 540M [20 runs] user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69]
> 8 x 280M [20 runs] user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61]
> 16 x 128M [20 runs] user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53]
>
> vswapra:
> 1 x 2048M [20 runs] user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77]
> 2 x 1024M [20 runs] user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12]
> 4 x 540M [20 runs] user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85]
> 8 x 280M [20 runs] user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09]
> 16 x 128M [20 runs] user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06]
>
> Furthermore I changed the patch to leave shmem alone for now and added
> documentation for the new approach. And I adjusted the changelog a
> bit.
>
> Andi, I think the NUMA policy is already taken care of. Can you have
> another look at it? Other than that you gave positive feedback - can
> I add your acked-by?
>
> Hannes
>
> ---
> The current swap readahead implementation reads a physically
> contiguous group of swap slots around the faulting page to take
> advantage of the disk head's position and in the hope that the
> surrounding pages will be needed soon as well.
>
> This works as long as the physical swap slot order approximates the
> LRU order decently, otherwise it wastes memory and IO bandwidth to
> read in pages that are unlikely to be needed soon.
>
> However, the physical swap slot layout diverges from the LRU order
> with increasing swap activity, i.e. high memory pressure situations,
> and this is exactly the situation where swapin should not waste any
> memory or IO bandwidth as both are the most contended resources at
> this point.
>
> Another approximation for LRU-relation is the VMA order as groups of
> VMA-related pages are usually used together.
>
> This patch combines both the physical and the virtual hint to get a
> good approximation of pages that are sensible to read ahead.
>
> When both diverge, we either read unrelated data, seek heavily for
> related data, or, what this patch does, just decrease the readahead
> efforts.
>
> To achieve this, we have essentially two readahead windows of the same
> size: one spans the virtual, the other one the physical neighborhood
> of the faulting page. We only read where both areas overlap.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Andi Kleen <[email protected]>
I think this a great idea, a very promising approach. I like it
so much better than Andrew's and others' proposals to dedicate
areas of swap space to distinct virtual objects: which, as you
rightly pointed out, condemn us to unnecessary seeking when
writing swap (and is even more of an issue if we're writing
to SSD rather than HDD).
It would be nice to get results from a wider set of benchmarks
than just qsbench; but I don't think qsbench is biased in your
favour, and I don't think you need go to too much trouble on
that - let's just aim to get your work into mmotm, then we can
all play around with it for a while. I suppose what I'd most
like is to try it with shmem, which you've understandably left
out for now.
You'll be hating me for the way I made shmem_truncate_range() etc.
nigh incomprehensible when enabling highmem index pages there.
Christoph Rohland's original was much nicer. Again and again and
again I've wanted to throw all that out, and keep swap entries in
the standard radix tree which keeps the struct page pointers; but
again and again and again, I've been unable to justify losing the
highmem index ability - for a while it seemed as if x86_64 was
going to make highmem a thing of the past, and it's certainly helped
us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.
Though I like the way you localized it all into swapin_readahead(),
I'd prefer to keep ptes out of swap_state.c, and think several issues
will fall away if you turn your patch around. You'll avoid the pte code
in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
overhead, and you'll find shmem much much easier, if instead of peering
back at where you've come from in swapin_readahead(), you make the outer
levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
swapin_readahead()? That vector on stack, and copied from the one page
table or index page (don't bother to cross page table or index page
boundaries) while it was mapped.
It's probably irrelevant to you, but I've attached an untested patch
of mine which stomps somewhat on this area: I was shocked to notice
shmem_getpage() in a list of deep stack offenders a few months back,
realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
a patch to get rid of them. I've rebased it to 2.6.30-rc8 and checked
that the resulting kernel runs, but not really tested it; and I think
I didn't even try to get the mpol reference counting right (tends to
be an issue precisely in swapin_readahead, where one mpol ends up used
repeatedly) - mpol refcounting is an arcane art only Lee understands!
I've attached the patch because you may want to glance at it and
decide, either that it's something which is helpful to the direction
you're going in and you'd like to base upon it, or that it's a
distraction and you'd prefer me to keep it to myself until your
changes are in.
But your patch below is incomplete, isn't it? The old swapin_readahead()
is now called swapin_readahead_phys(), and you want shmem_getpage() to be
using that for now: but no prototype for it and no change to mm/shmem.c.
Hugh
> ---
> mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 99 insertions(+), 16 deletions(-)
>
> version 2:
> o fall back to physical ra window for shmem
> o add documentation to the new ra algorithm
>
> qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median":
>
> vanilla vswapra
>
> 1 x 2048M 391.25 ( 71.76) 384.56 445.55 (83.19) 415.41
> 2 x 1024M 384.25 ( 75.00) 423.08 290.26 (31.38) 299.51
> 4 x 540M 553.91 (100.02) 554.57 336.58 (52.49) 331.52
> 8 x 280M 561.08 ( 82.36) 583.12 319.13 (43.17) 307.69
> 16 x 128M 285.51 (113.20) 236.62 214.24 (62.37) 214.15
>
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e
> return found_page;
> }
>
> -/**
> - * swapin_readahead - swap in pages in hope we need them soon
> - * @entry: swap entry of this memory
> - * @gfp_mask: memory allocation flags
> - * @vma: user vma this address belongs to
> - * @addr: target address for mempolicy
> - *
> - * Returns the struct page for entry and addr, after queueing swapin.
> - *
> +/*
> * Primitive swap readahead code. We simply read an aligned block of
> * (1 << page_cluster) entries in the swap area. This method is chosen
> * because it doesn't cost us any seek time. We also make sure to queue
> * the 'original' request together with the readahead ones...
> - *
> - * This has been extended to use the NUMA policies from the mm triggering
> - * the readahead.
> - *
> - * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> */
> -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> - struct vm_area_struct *vma, unsigned long addr)
> +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr)
> {
> int nr_pages;
> struct page *page;
> @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_
> lru_add_drain(); /* Push any new pages onto the LRU now */
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
> +
> +/**
> + * swapin_readahead - swap in pages in hope we need them soon
> + * @entry: swap entry of this memory
> + * @gfp_mask: memory allocation flags
> + * @vma: user vma this address belongs to
> + * @addr: target address for mempolicy
> + *
> + * Returns the struct page for entry and addr, after queueing swapin.
> + *
> + * The readahead window is the virtual area around the faulting page,
> + * where the physical proximity of the swap slots is taken into
> + * account as well.
> + *
> + * While the swap allocation algorithm tries to keep LRU-related pages
> + * together on the swap backing, it is not reliable on heavy thrashing
> + * systems where concurrent reclaimers allocate swap slots and/or most
> + * anonymous memory pages are already in swap cache.
> + *
> + * On the virtual side, subgroups of VMA-related pages are usually
> + * used together, which gives another hint to LRU relationship.
> + *
> + * By taking both aspects into account, we get a good approximation of
> + * which pages are sensible to read together with the faulting one.
> + *
> + * This has been extended to use the NUMA policies from the mm
> + * triggering the readahead.
> + *
> + * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
> + */
> +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + unsigned long start, pos, end;
> + unsigned long pmin, pmax;
> + int cluster, window;
> +
> + if (!vma || !vma->vm_mm) /* XXX: shmem case */
> + return swapin_readahead_phys(entry, gfp_mask, vma, addr);
> +
> + cluster = 1 << page_cluster;
> + window = cluster << PAGE_SHIFT;
> +
> + /* Physical range to read from */
> + pmin = swp_offset(entry) & ~(cluster - 1);
> + pmax = pmin + cluster;
> +
> + /* Virtual range to read from */
> + start = addr & ~(window - 1);
> + end = start + window;
> +
> + for (pos = start; pos < end; pos += PAGE_SIZE) {
> + struct page *page;
> + swp_entry_t swp;
> + spinlock_t *ptl;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> +
> + pgd = pgd_offset(vma->vm_mm, pos);
> + if (!pgd_present(*pgd))
> + continue;
> + pud = pud_offset(pgd, pos);
> + if (!pud_present(*pud))
> + continue;
> + pmd = pmd_offset(pud, pos);
> + if (!pmd_present(*pmd))
> + continue;
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl);
> + if (!is_swap_pte(*pte)) {
> + pte_unmap_unlock(pte, ptl);
> + continue;
> + }
> + swp = pte_to_swp_entry(*pte);
> + pte_unmap_unlock(pte, ptl);
> +
> + if (swp_type(swp) != swp_type(entry))
> + continue;
> + /*
> + * Dont move the disk head too far away. This also
> + * throttles readahead while thrashing, where virtual
> + * order diverges more and more from physical order.
> + */
> + if (swp_offset(swp) > pmax)
> + continue;
> + if (swp_offset(swp) < pmin)
> + continue;
> + page = read_swap_cache_async(swp, gfp_mask, vma, pos);
> + if (!page)
> + continue;
> + page_cache_release(page);
> + }
> + lru_add_drain(); /* Push any new pages onto the LRU now */
> + return read_swap_cache_async(entry, gfp_mask, vma, addr);
> +}
On Sun, Jun 07, 2009 at 06:55:15PM +0100, Hugh Dickins wrote:
> On Wed, 3 Jun 2009, Johannes Weiner wrote:
> >
> > The current swap readahead implementation reads a physically
> > contiguous group of swap slots around the faulting page to take
> > advantage of the disk head's position and in the hope that the
> > surrounding pages will be needed soon as well.
> >
> > This works as long as the physical swap slot order approximates the
> > LRU order decently, otherwise it wastes memory and IO bandwidth to
> > read in pages that are unlikely to be needed soon.
> >
> > However, the physical swap slot layout diverges from the LRU order
> > with increasing swap activity, i.e. high memory pressure situations,
> > and this is exactly the situation where swapin should not waste any
> > memory or IO bandwidth as both are the most contended resources at
> > this point.
> >
> > Another approximation for LRU-relation is the VMA order as groups of
> > VMA-related pages are usually used together.
> >
> > This patch combines both the physical and the virtual hint to get a
> > good approximation of pages that are sensible to read ahead.
> >
> > When both diverge, we either read unrelated data, seek heavily for
> > related data, or, what this patch does, just decrease the readahead
> > efforts.
> >
> > To achieve this, we have essentially two readahead windows of the same
> > size: one spans the virtual, the other one the physical neighborhood
> > of the faulting page. We only read where both areas overlap.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Andi Kleen <[email protected]>
>
> I think this a great idea, a very promising approach. I like it
> so much better than Andrew's and others' proposals to dedicate
> areas of swap space to distinct virtual objects: which, as you
> rightly pointed out, condemn us to unnecessary seeking when
> writing swap (and is even more of an issue if we're writing
> to SSD rather than HDD).
>
> It would be nice to get results from a wider set of benchmarks
> than just qsbench; but I don't think qsbench is biased in your
> favour, and I don't think you need go to too much trouble on
> that - let's just aim to get your work into mmotm, then we can
> all play around with it for a while. I suppose what I'd most
> like is to try it with shmem, which you've understandably left
> out for now.
I agree, it would be nice to have this in mm soonish and have it
exposed a bit more until .31 or .32. And I'll continue to test other
loads.
> You'll be hating me for the way I made shmem_truncate_range() etc.
> nigh incomprehensible when enabling highmem index pages there.
> Christoph Rohland's original was much nicer. Again and again and
> again I've wanted to throw all that out, and keep swap entries in
> the standard radix tree which keeps the struct page pointers; but
> again and again and again, I've been unable to justify losing the
> highmem index ability - for a while it seemed as if x86_64 was
> going to make highmem a thing of the past, and it's certainly helped
> us to ignore 32GB 32-bit; but I think 4GB 32-bit is here a long while.
Regarding highmem, I have only one 32 bit box with 1G memory and
highmem= seems to be broken right now. The documentation says it will
fix-set the zone. It does not on x86, but instead seems to set up
Normal and then complain about the lack of remaining pages to stuff
into HighMem.
With 128M HighMem, testing for highpte mapping overhead is a bit bogus
so I can't do that until I figure something out (either fixing
highmem= or no-opping GFP_HIGHMEM and use GFP_HIGHPTE for the pte
pages).
Or redoing the patch as you suggested below, which is probably what I
will opt for.
> Though I like the way you localized it all into swapin_readahead(),
> I'd prefer to keep ptes out of swap_state.c, and think several issues
> will fall away if you turn your patch around. You'll avoid the pte code
> in swap_state.c, you'll satisfy Andi's concerns about locking/kmapping
> overhead, and you'll find shmem much much easier, if instead of peering
> back at where you've come from in swapin_readahead(), you make the outer
> levels (do_swap_page and shmem_getpage) pass a vector of swap entries to
> swapin_readahead()? That vector on stack, and copied from the one page
> table or index page (don't bother to cross page table or index page
> boundaries) while it was mapped.
That is a nice suggestion.
> It's probably irrelevant to you, but I've attached an untested patch
> of mine which stomps somewhat on this area: I was shocked to notice
> shmem_getpage() in a list of deep stack offenders a few months back,
> realized it was those unpleasant NUMA mempolicy pseudo-vmas, and made
> a patch to get rid of them. I've rebased it to 2.6.30-rc8 and checked
> that the resulting kernel runs, but not really tested it; and I think
> I didn't even try to get the mpol reference counting right (tends to
> be an issue precisely in swapin_readahead, where one mpol ends up used
> repeatedly) - mpol refcounting is an arcane art only Lee understands!
> I've attached the patch because you may want to glance at it and
> decide, either that it's something which is helpful to the direction
> you're going in and you'd like to base upon it, or that it's a
> distraction and you'd prefer me to keep it to myself until your
> changes are in.
I will try to make swapin_readahead() take an array of ptes, then your
patch shouldn't get in my way as I don't need the vma anymore.
> But your patch below is incomplete, isn't it? The old swapin_readahead()
> is now called swapin_readahead_phys(), and you want shmem_getpage() to be
> using that for now: but no prototype for it and no change to mm/shmem.c.
You probably missed the no-vma or no-vma->vm_mm branch in
swapin_readahead(). shmem either sends in a NULL-vma or the dummy-vma
that has no vm->vm_mm set. Oops. That is of course a bug, vma->vm_mm
is uninitialized, not NULL. But that will go away as well.
Thanks,
Hannes