Hello,
if get_user_pages is used on a hugetlb vma, and there was no previous write
to the pages,
follow_hugetlb_page will call
ret = hugetlb_fault(mm, vma, vaddr, 0),
although the page should be used for write access in get_user_pages.
We currently see this when testing Infiniband on ppc64 with ehca +
hugetlbfs.
>From reading the code this should also be an issue on other architectures.
Roland, Adam, are you aware of anything in this area with mellanox
Infiniband cards or other usages with I/O adapters?
Gruss / Regards
Christoph R. + Nam Ng.
Please try this patch and see if it helps.
commit 6decbd17d6fb70d50f6db2c348bb41d7246a67d1
Author: Adam Litke <[email protected]>
Date: Tue Nov 6 06:59:12 2007 -0800
hugetlb: follow_hugetlb_page for write access
When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages. Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access.
This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.
Test patch only. Not Signed-off.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3a19b03..31fa0a0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -19,7 +19,7 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void __unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eab8c42..b645985 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -621,7 +621,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, int *length, int i)
+ unsigned long *position, int *length, int i,
+ int write)
{
unsigned long pfn_offset;
unsigned long vaddr = *position;
@@ -643,7 +644,7 @@ int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
int ret;
spin_unlock(&mm->page_table_lock);
- ret = hugetlb_fault(mm, vma, vaddr, 0);
+ ret = hugetlb_fault(mm, vma, vaddr, write);
spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;
diff --git a/mm/memory.c b/mm/memory.c
index f82b359..1bcd444 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,7 +1039,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
+ &start, &len, i, write);
continue;
}
On Tue, 2007-11-06 at 08:42 +0100, Christoph Raisch wrote:
> Hello,
> if get_user_pages is used on a hugetlb vma, and there was no previous write
> to the pages,
> follow_hugetlb_page will call
> ret = hugetlb_fault(mm, vma, vaddr, 0),
> although the page should be used for write access in get_user_pages.
>
> We currently see this when testing Infiniband on ppc64 with ehca +
> hugetlbfs.
> From reading the code this should also be an issue on other architectures.
> Roland, Adam, are you aware of anything in this area with mellanox
> Infiniband cards or other usages with I/O adapters?
>
> Gruss / Regards
> Christoph R. + Nam Ng.
>
>
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center
Hello Roland!
> We currently see this when testing Infiniband on ppc64 with ehca +
> hugetlbfs.
> From reading the code this should also be an issue on other architectures.
> Roland, Adam, are you aware of anything in this area with mellanox
> Infiniband cards or other usages with I/O adapters?
Below is a testcase demonstrating this problem. You need to install
libhugetlbfs.so and run it as below:
HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so ./hugetlb_ibtest 100
This testcase does the following steps (high level desc):
1. malloc two buffers each of 100MB for send and recv
2. register them as memory regions
3. create queue pair QP
4. send data in send buffer using QP to itself (target is then recv buffer)
5. compare those buffers content
It runs fine without libhugetlbsf. If you call it with libhugetlbfs as
above, step 5 will fail. If you do memset() of the buffers before step 2
(register mr), then it runs without errors.
It appears that hugetlb_cow() is called when first write access is performed
after mrs have been registered. That means the testcase is seeing other pages
than the ones registered to the adapter...
I was able reproduce this with mthca on 2.6.23/ppc64 and fc6/intel.
Regards
Nam
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <malloc.h>
#include <unistd.h>
#include <infiniband/verbs.h>
static unsigned int pagesize;
static unsigned int bufsize=1024*1024*19;
int cmp_data(void *s, void *d, unsigned long len, unsigned long *fail_pos)
{
unsigned char *cs = s, *cd = d;
assert(cs);
assert(cd);
assert(fail_pos);
*fail_pos = 0;
while (len) {
if (*cs < *cd)
return -1;
if (*cs > *cd)
return 1;
len--;
cs++;
cd++;
*fail_pos += 1;
}
return 0;
}
int hugetlb_ibtest(struct ibv_device* device)
{
struct ibv_context *context = NULL;
struct ibv_port_attr port_attr;
struct ibv_pd *pd = NULL;
struct ibv_cq *send_cq = NULL;
struct ibv_cq *recv_cq = NULL;
struct ibv_qp *qp = NULL;
struct ibv_mr *send_mr = NULL;
struct ibv_mr *recv_mr = NULL;
unsigned char *send_buffer = NULL;
unsigned char *recv_buffer = NULL;
int port = 1; // hardcoded for now
int rc = 0;
context = ibv_open_device(device);
assert(context!=NULL);
// query port
memset(&port_attr, 0, sizeof(port_attr));
rc = ibv_query_port(context, port, &port_attr);
assert(rc==0);
// pd
pd = ibv_alloc_pd(context);
assert(pd!=NULL);
// ah
struct ibv_ah_attr ah_attr = {
.is_global = 0,
.dlid = port_attr.lid,
.sl = 0,
.src_path_bits = 0,
.port_num = port,
.static_rate = 3
};
struct ibv_ah *ah = ibv_create_ah(pd, &ah_attr);
assert(ah!=NULL);
// send cq
send_cq = ibv_create_cq(context, 1, NULL, NULL, 0);
assert(send_cq!=NULL);
// recv cq
recv_cq = ibv_create_cq(context, 1, NULL, NULL, 0);
assert(recv_cq!=NULL);
// qp
struct ibv_qp_init_attr attr = {
.send_cq = send_cq,
.recv_cq = recv_cq,
.cap = {
.max_send_wr = 2,
.max_recv_wr = 2,
.max_send_sge = 1,
.max_recv_sge = 1
},
.qp_type = IBV_QPT_RC,
};
qp = ibv_create_qp(pd, &attr);
assert(qp!=NULL);
// qp RESET -> INIT
struct ibv_qp_attr qp_attr;
memset(&qp_attr, 0, sizeof(qp_attr));
qp_attr.qp_state = IBV_QPS_INIT;
qp_attr.pkey_index = 0;
qp_attr.port_num = port;
qp_attr.qp_access_flags = 0;
rc = ibv_modify_qp(qp, &qp_attr,
IBV_QP_STATE |
IBV_QP_PKEY_INDEX |
IBV_QP_PORT |
IBV_QP_ACCESS_FLAGS);
assert(rc==0);
// qp INIT -> RTR
memset(&qp_attr, 0, sizeof(qp_attr));
qp_attr.qp_state = IBV_QPS_RTR;
qp_attr.rq_psn = 0;
qp_attr.max_rd_atomic = 1;
qp_attr.dest_qp_num = qp->qp_num;
qp_attr.path_mtu = IBV_MTU_2048;
qp_attr.ah_attr = ah_attr;
qp_attr.min_rnr_timer = 0;
rc = ibv_modify_qp(qp, &qp_attr,
IBV_QP_STATE | IBV_QP_RQ_PSN |
IBV_QP_MAX_DEST_RD_ATOMIC |
IBV_QP_DEST_QPN | IBV_QP_PATH_MTU |
IBV_QP_AV | IBV_QP_MIN_RNR_TIMER);
assert(rc==0);
// qp RTR -> RTS
memset(&qp_attr, 0, sizeof(qp_attr));
qp_attr.qp_state = IBV_QPS_RTS;
qp_attr.sq_psn = 0;
qp_attr.max_dest_rd_atomic = 1;
qp_attr.timeout = 18;
qp_attr.retry_cnt = 1;
qp_attr.rnr_retry = 1;
rc = ibv_modify_qp(qp, &qp_attr,
IBV_QP_STATE | IBV_QP_SQ_PSN |
IBV_QP_MAX_QP_RD_ATOMIC |
IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
IBV_QP_RNR_RETRY);
assert(rc==0);
// mr recv
recv_buffer = malloc(bufsize);
assert(recv_buffer);
unsigned int i;
recv_mr = ibv_reg_mr(pd, recv_buffer, bufsize,
IBV_ACCESS_LOCAL_WRITE);
assert(recv_mr!=NULL);
for (i = 0; i < bufsize; i++)
recv_buffer[i] = ~(i & 0xff);
// qp post_recv
rc = ibv_req_notify_cq(recv_cq, 0);
struct ibv_sge sge_recv = {
.addr = (uintptr_t) recv_buffer,
.length = bufsize,
.lkey = recv_mr->lkey
};
struct ibv_recv_wr recv_wr = {
.next = NULL,
.wr_id = 0x5003,
.sg_list = &sge_recv,
.num_sge = 1
};
struct ibv_recv_wr *bad_recv_wr = NULL;
rc = ibv_post_recv(qp, &recv_wr, &bad_recv_wr);
assert(rc==0);
// mr send
send_buffer = malloc(bufsize);
assert(send_buffer);
send_mr = ibv_reg_mr(pd, send_buffer, bufsize,
IBV_ACCESS_LOCAL_WRITE);
assert(send_mr!=NULL);
for (i = 0; i < bufsize; i++)
send_buffer[i] = (i & 0xff);
rc = ibv_req_notify_cq(send_cq, 0);
strcpy(send_buffer, "300 lines for one packet");
int slen = strlen(send_buffer);
if (bufsize > slen*2+2)
strcpy(send_buffer+bufsize-slen-1, send_buffer);
struct ibv_sge sge_send = {
.addr = (uintptr_t) send_buffer,
.length = bufsize,
.lkey = send_mr->lkey
};
struct ibv_send_wr send_wr = {
.wr_id = 0x71032,
.sg_list = &sge_send,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};
struct ibv_send_wr *bad_send_wr = NULL;
rc = ibv_post_send(qp, &send_wr, &bad_send_wr);
assert(rc==0);
// poll send completion
struct ibv_wc wc;
int ne;
memset(&wc, 0, sizeof(wc));
do {
ne = ibv_poll_cq(send_cq, 1, &wc);
} while (ne < 1);
assert(ne==1);
assert(wc.status==IBV_WC_SUCCESS);
// poll recv completion
memset(&wc, 0, sizeof(wc));
do {
ne = ibv_poll_cq(recv_cq, 1, &wc);
} while (ne < 1);
assert(ne==1);
assert(wc.status==IBV_WC_SUCCESS);
// check what we received is what we sent
printf("send: \"%s\"\n", send_buffer);
printf("recv: \"%s\"\n", recv_buffer);
unsigned long fail_pos;
rc = cmp_data(send_buffer, recv_buffer, bufsize, &fail_pos);
if (rc) {
printf("fail_pos=%lx send_buffer=%p recv_buffer=%p "
"%02x<>%02x\n", fail_pos, send_buffer, recv_buffer,
send_buffer[fail_pos], recv_buffer[fail_pos]);
FILE *f = fopen("hugetlb_ibtest.log", "w");
fprintf(f, "fail_pos=%lx send_buffer=%p recv_buffer=%p "
"%02x<>%02x\n", fail_pos, send_buffer, recv_buffer,
send_buffer[fail_pos], recv_buffer[fail_pos]);
for (i = 0; i < bufsize; i += 16) {
unsigned int j;
fprintf(f, "%016lx %p ", (unsigned long)i, send_buffer + i);
for (j = 0; j < 16; j++)
fprintf(f, "%02x ", send_buffer[i + j]);
fprintf(f, " %p ", recv_buffer + i);
for (j = 0; j < 16; j++)
fprintf(f, "%02x ", recv_buffer[i + j]);
fprintf(f, "\n");
}
fclose(f);
printf("see log file hugetlb_ibtest.log\n");
}
// clean up
rc = ibv_dereg_mr(recv_mr);
assert(rc==0);
rc = ibv_dereg_mr(send_mr);
assert(rc==0);
rc = ibv_destroy_ah(ah);
assert(rc==0);
rc = ibv_destroy_qp(qp);
assert(rc==0);
rc = ibv_destroy_cq(send_cq);
assert(rc==0);
rc = ibv_destroy_cq(recv_cq);
assert(rc==0);
rc = ibv_dealloc_pd(pd);
assert(rc==0);
rc = ibv_close_device(context);
assert(rc==0);
return rc;
}
int main(int argc, char *argv[])
{
struct ibv_device **dev_array = ibv_get_device_list(NULL);
struct ibv_device *device = NULL;
assert(dev_array!=NULL);
device = dev_array[0]; // take first IB device
assert(device!=NULL);
pagesize = sysconf(_SC_PAGESIZE);;
printf("pagesize=0x%x\n", pagesize);
if (argc > 1) {
int l = atoi(argv[1]);
if (l)
bufsize = 1024*1024*l;
}
printf("bufsize=0x%x\n", bufsize);
int rc = hugetlb_ibtest(device);
assert(rc==0);
printf("OK!\n");
return 0;
}
Hi Adam!
On Tuesday 06 November 2007 16:05, aglitke wrote:
> Please try this patch and see if it helps.
Tested on 2.6.22 (don't have the system with 2.6.23 at the moment) and
the testcase ran perfectly.
Thanks!
Nam
On Tue, Nov 06, 2007 at 04:06:04PM +0100, Hoang-Nam Nguyen wrote:
> Hello Roland!
> > We currently see this when testing Infiniband on ppc64 with ehca +
> > hugetlbfs.
> > From reading the code this should also be an issue on other architectures.
> > Roland, Adam, are you aware of anything in this area with mellanox
> > Infiniband cards or other usages with I/O adapters?
> Below is a testcase demonstrating this problem. You need to install
> libhugetlbfs.so and run it as below:
> HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so ./hugetlb_ibtest 100
>
> This testcase does the following steps (high level desc):
> 1. malloc two buffers each of 100MB for send and recv
> 2. register them as memory regions
> 3. create queue pair QP
> 4. send data in send buffer using QP to itself (target is then recv buffer)
> 5. compare those buffers content
>
> It runs fine without libhugetlbsf. If you call it with libhugetlbfs as
> above, step 5 will fail. If you do memset() of the buffers before step 2
> (register mr), then it runs without errors.
> It appears that hugetlb_cow() is called when first write access is performed
> after mrs have been registered. That means the testcase is seeing other pages
> than the ones registered to the adapter...
>
> I was able reproduce this with mthca on 2.6.23/ppc64 and fc6/intel.
We should cut this down to the bare necessary and fold it into the
libhugetlbfs testsuite.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[email protected] wrote on 06.11.2007 23:31:19:
> We should cut this down to the bare necessary and fold it into the
> libhugetlbfs testsuite.
Well, this testcase is already pretty close to the bare minimum
what's needed to run IB/RDMA queues.
You can compare this to for example ibv_rc_pingpong in libibverbs...
Maybe it's possible to test this with anything else than IB?
> --
Gruss / Regards
Christoph R