On 01/26/2012 01:12 PM, Dave Hansen wrote:
> On 01/09/2012 02:51 PM, Seth Jennings wrote:
>> + area = &get_cpu_var(zs_map_area);
>> + if (off + class->size <= PAGE_SIZE) {
>> + /* this object is contained entirely within a page */
>> + area->vm_addr = kmap_atomic(page);
>> + } else {
>> + /* this object spans two pages */
>> + struct page *nextp;
>> +
>> + nextp = get_next_page(page);
>> + BUG_ON(!nextp);
>> +
>> +
>> + set_pte(area->vm_ptes[0], mk_pte(page, PAGE_KERNEL));
>> + set_pte(area->vm_ptes[1], mk_pte(nextp, PAGE_KERNEL));
>> +
>> + /* We pre-allocated VM area so mapping can never fail */
>> + area->vm_addr = area->vm->addr;
>> + }
>
> This bit appears to be trying to make kmap_atomic() variant that can map
> two pages in to contigious virtual addresses. Instead of open-coding it
> in a non-portable way like this, should we just make a new kmap_atomic()
> variant that does this?
>
> From the way it's implemented, I _think_ you're guaranteed to get two
> contiguous addresses if you do two adjacent kmap_atomics() on the same CPU:
>
> void *kmap_atomic_prot(struct page *page, pgprot_t prot)
> {
> ...
> type = kmap_atomic_idx_push();
> idx = type + KM_TYPE_NR*smp_processor_id();
> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
>
> I think if you do a get_cpu()/put_cpu() or just a preempt_disable()
> across the operations you'll be guaranteed to get two contiguous addresses.
I'm not quite following here. kmap_atomic() only does this for highmem pages.
For normal pages (all pages for 64-bit), it doesn't do any mapping at all. It
just returns the virtual address of the page since it is in the kernel's address
space.
For this design, the pages _must_ be mapped, even if the pages are directly
reachable in the address space, because they must be virtually contiguous.
--
Seth
On 02/06/2012 09:26 AM, Seth Jennings wrote:
> On 01/26/2012 01:12 PM, Dave Hansen wrote:
>> void *kmap_atomic_prot(struct page *page, pgprot_t prot)
>> {
>> ...
>> type = kmap_atomic_idx_push();
>> idx = type + KM_TYPE_NR*smp_processor_id();
>> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
>>
>> I think if you do a get_cpu()/put_cpu() or just a preempt_disable()
>> across the operations you'll be guaranteed to get two contiguous addresses.
>
> I'm not quite following here. kmap_atomic() only does this for highmem pages.
> For normal pages (all pages for 64-bit), it doesn't do any mapping at all. It
> just returns the virtual address of the page since it is in the kernel's address
> space.
>
> For this design, the pages _must_ be mapped, even if the pages are directly
> reachable in the address space, because they must be virtually contiguous.
I guess you could use vmap() for that. It's just going to be slower
than kmap_atomic(). I'm really not sure it's worth all the trouble to
avoid order-1 allocations, though.
> From: Dave Hansen [mailto:[email protected]]
> Subject: Re: [PATCH 1/5] staging: zsmalloc: zsmalloc memory allocation library
>
> On 02/06/2012 09:26 AM, Seth Jennings wrote:
> > On 01/26/2012 01:12 PM, Dave Hansen wrote:
> >> void *kmap_atomic_prot(struct page *page, pgprot_t prot)
> >> {
> >> ...
> >> type = kmap_atomic_idx_push();
> >> idx = type + KM_TYPE_NR*smp_processor_id();
> >> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
> >>
> >> I think if you do a get_cpu()/put_cpu() or just a preempt_disable()
> >> across the operations you'll be guaranteed to get two contiguous addresses.
> >
> > I'm not quite following here. kmap_atomic() only does this for highmem pages.
> > For normal pages (all pages for 64-bit), it doesn't do any mapping at all. It
> > just returns the virtual address of the page since it is in the kernel's address
> > space.
> >
> > For this design, the pages _must_ be mapped, even if the pages are directly
> > reachable in the address space, because they must be virtually contiguous.
>
> I guess you could use vmap() for that. It's just going to be slower
> than kmap_atomic(). I'm really not sure it's worth all the trouble to
> avoid order-1 allocations, though.
Seth, Nitin, please correct me if I am wrong, but...
Dave, your comment makes me wonder if maybe you might be missing
the key value of the new allocator. The zsmalloc allocator can grab
any random* page "A" with X unused bytes at the END of the page,
and any random page "B" with Y unused bytes at the BEGINNING of the page
and "coalesce" them to store any byte sequence with a length** Z
not exceeding X+Y. Presumably this markedly increases
the density of compressed-pages-stored-per-physical-page***. I don't
see how allowing order-1 allocations helps here but if I am missing
something clever, please explain further.
(If anyone missed Jonathan Corbet's nice lwn.net article, see:
https://lwn.net/Articles/477067/ )
* Not really ANY random page, just any random page that has been
previously get_free_page'd by the allocator and hasn't been
free'd yet.
** X, Y and Z are all rounded to a multiple of 16 so there
is still some internal fragmentation cost.
*** Would be interesting to see some random and real workload data
comparing density for zsmalloc and xvmalloc. And also zbud
too as a goal is to replace zbud with zsmalloc too.
On 02/08/2012 09:15 AM, Dan Magenheimer wrote:
> The zsmalloc allocator can grab
> any random* page "A" with X unused bytes at the END of the page,
> and any random page "B" with Y unused bytes at the BEGINNING of the page
> and "coalesce" them to store any byte sequence with a length** Z
> not exceeding X+Y. Presumably this markedly increases
> the density of compressed-pages-stored-per-physical-page***.
Ahh, I did miss that. I assumed it was simply trying to tie two order-0
pages together. I _guess_ the vmap() comment stands, though.
On 02/08/2012 11:39 AM, Dave Hansen wrote:
> On 02/06/2012 09:26 AM, Seth Jennings wrote:
>> On 01/26/2012 01:12 PM, Dave Hansen wrote:
>>> void *kmap_atomic_prot(struct page *page, pgprot_t prot)
>>> {
>>> ...
>>> type = kmap_atomic_idx_push();
>>> idx = type + KM_TYPE_NR*smp_processor_id();
>>> vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
>>>
>>> I think if you do a get_cpu()/put_cpu() or just a preempt_disable()
>>> across the operations you'll be guaranteed to get two contiguous addresses.
>>
>> I'm not quite following here. kmap_atomic() only does this for highmem pages.
>> For normal pages (all pages for 64-bit), it doesn't do any mapping at all. It
>> just returns the virtual address of the page since it is in the kernel's address
>> space.
>>
>> For this design, the pages _must_ be mapped, even if the pages are directly
>> reachable in the address space, because they must be virtually contiguous.
>
> I guess you could use vmap() for that. It's just going to be slower
> than kmap_atomic(). I'm really not sure it's worth all the trouble to
> avoid order-1 allocations, though.
>
vmap() is not just slower but also does memory allocations at various
places. Under memory pressure, this may cause failure in reading a
stored object just because we failed to map it. Also, it allocates VA
region each time its called which is a real big waste when we can simply
pre-allocate 2 * PAGE_SIZE'ed VA regions (per-cpu).
Thanks,
Nitin
On 02/08/2012 09:53 AM, Nitin Gupta wrote:
> vmap() is not just slower but also does memory allocations at various
> places. Under memory pressure, this may cause failure in reading a
> stored object just because we failed to map it. Also, it allocates VA
> region each time its called which is a real big waste when we can simply
> pre-allocate 2 * PAGE_SIZE'ed VA regions (per-cpu).
Yeah, vmap() is a bit heavy-handed. I'm just loathe to go mucking
around in the low-level pagetables too much. Just seems like there'll
be a ton of pitfalls, like arch-specific TLB flushing, and it _seems_
like one of the existing kernel mechanisms should work.
I guess if you've exhaustively explored all of the existing kernel
mapping mechanisms and found none of them to work, and none of them to
be in any way suitably adaptable to your use, you should go ahead and
roll your own. I guess you do at least use alloc_vm_area(). What made
map_vm_area() unsuitable for your needs? If you're remapping, you
should at least be guaranteed not to have to allocate pte pages.
On 02/08/2012 01:28 PM, Dave Hansen wrote:
> On 02/08/2012 09:53 AM, Nitin Gupta wrote:
>> vmap() is not just slower but also does memory allocations at various
>> places. Under memory pressure, this may cause failure in reading a
>> stored object just because we failed to map it. Also, it allocates VA
>> region each time its called which is a real big waste when we can simply
>> pre-allocate 2 * PAGE_SIZE'ed VA regions (per-cpu).
>
> Yeah, vmap() is a bit heavy-handed. I'm just loathe to go mucking
> around in the low-level pagetables too much. Just seems like there'll
> be a ton of pitfalls, like arch-specific TLB flushing, and it _seems_
> like one of the existing kernel mechanisms should work.
>
> I guess if you've exhaustively explored all of the existing kernel
> mapping mechanisms and found none of them to work, and none of them to
> be in any way suitably adaptable to your use, you should go ahead and
> roll your own. I guess you do at least use alloc_vm_area(). What made
> map_vm_area() unsuitable for your needs? If you're remapping, you
> should at least be guaranteed not to have to allocate pte pages.
>
map_vm_area() needs 'struct vm_struct' parameter but for mapping kernel
allocated pages within kernel, what should we pass here? I think we can
instead use map_kernel_range_noflush() -- surprisingly
unmap_kernel_range_noflush() is exported but this one is not.
Nitin
(cc'ing the _real_ GregKH to avoid further bounces... Greg, if
you care, the whole thread is on the various lists)
> From: Nitin Gupta [mailto:[email protected]]
> Subject: Re: [PATCH 1/5] staging: zsmalloc: zsmalloc memory allocation library
>
> On 02/08/2012 01:28 PM, Dave Hansen wrote:
>
> > On 02/08/2012 09:53 AM, Nitin Gupta wrote:
> >> vmap() is not just slower but also does memory allocations at various
> >> places. Under memory pressure, this may cause failure in reading a
> >> stored object just because we failed to map it. Also, it allocates VA
> >> region each time its called which is a real big waste when we can simply
> >> pre-allocate 2 * PAGE_SIZE'ed VA regions (per-cpu).
> >
> > Yeah, vmap() is a bit heavy-handed. I'm just loathe to go mucking
> > around in the low-level pagetables too much. Just seems like there'll
> > be a ton of pitfalls, like arch-specific TLB flushing, and it _seems_
> > like one of the existing kernel mechanisms should work.
> >
> > I guess if you've exhaustively explored all of the existing kernel
> > mapping mechanisms and found none of them to work, and none of them to
> > be in any way suitably adaptable to your use, you should go ahead and
> > roll your own. I guess you do at least use alloc_vm_area(). What made
> > map_vm_area() unsuitable for your needs? If you're remapping, you
> > should at least be guaranteed not to have to allocate pte pages.
>
> map_vm_area() needs 'struct vm_struct' parameter but for mapping kernel
> allocated pages within kernel, what should we pass here? I think we can
> instead use map_kernel_range_noflush() -- surprisingly
> unmap_kernel_range_noflush() is exported but this one is not.
Creating a dependency on a core kernel change (even just an EXPORT_SYMBOL)
is probably not a good idea. Unless Dave vehemently objects, I'd suggest
implementing it both ways, leaving the method that relies on the
kernel change ifdef'd out, and add this to "the list of things that
need to be done before zcache can be promoted out of staging".
On 02/08/2012 01:39 PM, Dan Magenheimer wrote:
>> > map_vm_area() needs 'struct vm_struct' parameter but for mapping kernel
>> > allocated pages within kernel, what should we pass here? I think we can
>> > instead use map_kernel_range_noflush() -- surprisingly
>> > unmap_kernel_range_noflush() is exported but this one is not.
> Creating a dependency on a core kernel change (even just an EXPORT_SYMBOL)
> is probably not a good idea. Unless Dave vehemently objects, I'd suggest
> implementing it both ways, leaving the method that relies on the
> kernel change ifdef'd out, and add this to "the list of things that
> need to be done before zcache can be promoted out of staging".
Seems like a sane approach to me.