Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.
Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap offset,
without stopping to realize that that actually depends upon the pte swap
encoding shifting swap offset to the higher bits and truncating it there.
Whereas our radix_tree swap encoding leaves offset in the lower bits:
it's swap "type" (that is, index of swap area) that was truncated.
Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
This does not reduce the usable size of a swap area any further, it leaves
it as claimed when making the original commit: no change from 3.0 on x86_64,
nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile
on i386 with PAE. It's not a change I would have risked five years ago,
but with x86_64 supported for ten years, I believe it's appropriate now.
Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check. Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.
Reported-and-Reviewed-and-Tested-by: Minchan Kim <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] [3.1, 3.2, 3.3, 3.4]
---
include/linux/swapops.h | 8 +++++---
mm/swapfile.c | 12 ++++--------
2 files changed, 9 insertions(+), 11 deletions(-)
--- 3.5-rc2/include/linux/swapops.h 2012-05-20 15:29:13.000000000 -0700
+++ linux/include/linux/swapops.h 2012-06-13 12:01:35.390711624 -0700
@@ -9,13 +9,15 @@
* get good packing density in that tree, so the index should be dense in
* the low-order bits.
*
- * We arrange the `type' and `offset' fields so that `type' is at the five
+ * We arrange the `type' and `offset' fields so that `type' is at the seven
* high-order bits of the swp_entry_t and `offset' is right-aligned in the
- * remaining bits.
+ * remaining bits. Although `type' itself needs only five bits, we allow for
+ * shmem/tmpfs to shift it all up a further two bits: see swp_to_radix_entry().
*
* swp_entry_t's are *never* stored anywhere in their arch-dependent format.
*/
-#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
+#define SWP_TYPE_SHIFT(e) ((sizeof(e.val) * 8) - \
+ (MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT))
#define SWP_OFFSET_MASK(e) ((1UL << SWP_TYPE_SHIFT(e)) - 1)
/*
--- 3.5-rc2/mm/swapfile.c 2012-06-08 18:48:40.744605221 -0700
+++ linux/mm/swapfile.c 2012-06-13 12:13:56.214729684 -0700
@@ -1916,24 +1916,20 @@ static unsigned long read_swap_header(st
/*
* Find out how many pages are allowed for a single swap
- * device. There are three limiting factors: 1) the number
+ * device. There are two limiting factors: 1) the number
* of bits for the swap offset in the swp_entry_t type, and
* 2) the number of bits in the swap pte as defined by the
- * the different architectures, and 3) the number of free bits
- * in an exceptional radix_tree entry. In order to find the
+ * different architectures. In order to find the
* largest possible bit mask, a swap entry with swap type 0
* and swap offset ~0UL is created, encoded to a swap pte,
* decoded to a swp_entry_t again, and finally the swap
* offset is extracted. This will mask all the bits from
* the initial ~0UL mask that can't be encoded in either
* the swp_entry_t or the architecture definition of a
- * swap pte. Then the same is done for a radix_tree entry.
+ * swap pte.
*/
maxpages = swp_offset(pte_to_swp_entry(
- swp_entry_to_pte(swp_entry(0, ~0UL))));
- maxpages = swp_offset(radix_to_swp_entry(
- swp_to_radix_entry(swp_entry(0, maxpages)))) + 1;
-
+ swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
if (maxpages > swap_header->info.last_page) {
maxpages = swap_header->info.last_page + 1;
/* p->max is an unsigned int: don't overflow it */
On Fri, Jun 15, 2012 at 05:55:50PM -0700, Hugh Dickins wrote:
>Minchan Kim reports that when a system has many swap areas, and tmpfs
>swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
>back the page cannot locate it, and the read fails with -ENOMEM.
>
>Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
>swp_entry_to_pte()) technique for determining maximum usable swap offset,
>without stopping to realize that that actually depends upon the pte swap
>encoding shifting swap offset to the higher bits and truncating it there.
>Whereas our radix_tree swap encoding leaves offset in the lower bits:
>it's swap "type" (that is, index of swap area) that was truncated.
>
>Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
>broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
>
>This does not reduce the usable size of a swap area any further, it leaves
>it as claimed when making the original commit: no change from 3.0 on x86_64,
>nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile
>on i386 with PAE. It's not a change I would have risked five years ago,
>but with x86_64 supported for ten years, I believe it's appropriate now.
>
>Hmm, and what if some architecture implements its swap pte with offset
>encoded below type? That would equally break the maximum usable swap
>offset check. Happily, they all follow the same tradition of encoding
>offset above type, but I'll prepare a check on that for next.
>
>Reported-and-Reviewed-and-Tested-by: Minchan Kim <[email protected]>
>Signed-off-by: Hugh Dickins <[email protected]>
>Cc: [email protected] [3.1, 3.2, 3.3, 3.4]
>---
>
> include/linux/swapops.h | 8 +++++---
> mm/swapfile.c | 12 ++++--------
> 2 files changed, 9 insertions(+), 11 deletions(-)
>
>--- 3.5-rc2/include/linux/swapops.h 2012-05-20 15:29:13.000000000 -0700
>+++ linux/include/linux/swapops.h 2012-06-13 12:01:35.390711624 -0700
>@@ -9,13 +9,15 @@
> * get good packing density in that tree, so the index should be dense in
> * the low-order bits.
> *
>- * We arrange the `type' and `offset' fields so that `type' is at the five
>+ * We arrange the `type' and `offset' fields so that `type' is at the seven
> * high-order bits of the swp_entry_t and `offset' is right-aligned in the
>- * remaining bits.
>+ * remaining bits. Although `type' itself needs only five bits, we allow for
>+ * shmem/tmpfs to shift it all up a further two bits: see swp_to_radix_entry().
> *
> * swp_entry_t's are *never* stored anywhere in their arch-dependent format.
> */
>-#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
>+#define SWP_TYPE_SHIFT(e) ((sizeof(e.val) * 8) - \
>+ (MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT))
Hi Hugh,
Since SHIFT == MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT == 7
and the low two bits used for radix_tree, the available swappages number
based of 32bit architectures reduce to 2^(32-7-2) = 32GB?
Regards,
Wanpeng Li
> #define SWP_OFFSET_MASK(e) ((1UL << SWP_TYPE_SHIFT(e)) - 1)
>
> /*
>--- 3.5-rc2/mm/swapfile.c 2012-06-08 18:48:40.744605221 -0700
>+++ linux/mm/swapfile.c 2012-06-13 12:13:56.214729684 -0700
>@@ -1916,24 +1916,20 @@ static unsigned long read_swap_header(st
>
> /*
> * Find out how many pages are allowed for a single swap
>- * device. There are three limiting factors: 1) the number
>+ * device. There are two limiting factors: 1) the number
> * of bits for the swap offset in the swp_entry_t type, and
> * 2) the number of bits in the swap pte as defined by the
>- * the different architectures, and 3) the number of free bits
>- * in an exceptional radix_tree entry. In order to find the
>+ * different architectures. In order to find the
> * largest possible bit mask, a swap entry with swap type 0
> * and swap offset ~0UL is created, encoded to a swap pte,
> * decoded to a swp_entry_t again, and finally the swap
> * offset is extracted. This will mask all the bits from
> * the initial ~0UL mask that can't be encoded in either
> * the swp_entry_t or the architecture definition of a
>- * swap pte. Then the same is done for a radix_tree entry.
>+ * swap pte.
> */
> maxpages = swp_offset(pte_to_swp_entry(
>- swp_entry_to_pte(swp_entry(0, ~0UL))));
>- maxpages = swp_offset(radix_to_swp_entry(
>- swp_to_radix_entry(swp_entry(0, maxpages)))) + 1;
>-
>+ swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
> if (maxpages > swap_header->info.last_page) {
> maxpages = swap_header->info.last_page + 1;
> /* p->max is an unsigned int: don't overflow it */
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to [email protected]. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Sat, Jun 16, 2012 at 12:56 PM, Wanpeng Li <[email protected]> wrote:
>>-#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
>>+#define SWP_TYPE_SHIFT(e) ((sizeof(e.val) * 8) - \
>>+ (MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT))
>
> Hi Hugh,
>
> Since SHIFT == MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT == 7
> and the low two bits used for radix_tree, the available swappages number
> based of 32bit architectures reduce to 2^(32-7-2) = 32GB?
>
The lower two bits are in the 7 bits you calculated,
so it is 2^(32-7), not 2^(32-7-2)
On Sat, 16 Jun 2012, Cong Wang wrote:
> On Sat, Jun 16, 2012 at 12:56 PM, Wanpeng Li <[email protected]> wrote:
> >>-#define SWP_TYPE_SHIFT(e) (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT)
> >>+#define SWP_TYPE_SHIFT(e) ((sizeof(e.val) * 8) - \
> >>+ (MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT))
> > Since SHIFT == MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT == 7
> > and the low two bits used for radix_tree, the available swappages number
> > based of 32bit architectures reduce to 2^(32-7-2) = 32GB?
>
> The lower two bits are in the 7 bits you calculated,
> so it is 2^(32-7), not 2^(32-7-2)
Correct.
And that is not the limiting condition on available swap pages on 32-bit
without PAE, which is limited more by the pte<->swp conversion: a swap
entry must be distinguished from a present pte, from a PROT_NONE page,
and from a pte_file() entry - see arch/x86/include/asm/pgtable-2level.h
for how i386 in particular arranges that.
Nor is it the limiting condition on 64-bit, where include/linux/swap.h's
use of __u32 and unsigned int for counting swap pages is more limiting.
Hugh