Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4736156imm; Wed, 30 May 2018 10:56:58 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIpYS+EfbhR3pzzW54sH+tcJdujhUdtxDWQKnEPpD4WgpK9U++k7V/C9TGPjXzGrEyYzake X-Received: by 2002:a65:6119:: with SMTP id z25-v6mr2966742pgu.139.1527703018603; Wed, 30 May 2018 10:56:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527703018; cv=none; d=google.com; s=arc-20160816; b=NVz52MtXssxh/NHVZb3Ykia5LsLO7JG8s++vIqHU9EKv+/f31r1rKzOYIf66NLNOcX 8HbTMess3uU4yFfjS2WxCv2oe8x/UqivBRnOw9KZFrhU3dgFHu/XTtAheI3QL3YziMWO CXsddzJhGvWukecU+PeQXlZEaLSAMKhWVMIpXuRRo0H9qG9hoVXCOrS0GAp1EG3zyDg2 eakzHZP5Qt8BcOG7RPfeNZII+JOU8gD/85QBwQmv6syfqqZVc/uZUQqDPbjpzIR7w3ZV 1oH6if2gDMYRAtwIXBkj+ioLxlbd4IF9/Xsop2ajzw3O2BteGFKYBbr/r5FvHRUE2UcR NYfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=KAlGyFMvcbtfGI1prB9v3q5uhP2hnWxwl0+bmKXnjyg=; b=rCn3PlhdJrD2OpeD+K0jpF9izLcvuKj6kuqQepdnga7zcSzAhf3Z2JwPBLER8yydUz BM9uwZ2DRb7MAQ+zoxUJrsPrxUU8MJJ6+ebB2uK5RwP2skTUH8yK0HdLemCI9exwUCnY DMnx/alraGkOOj3eQdscorl3YXJUZq1w2Kh+XDKqNVJDijPTq3rBPxEArfW8tCRt7PLG +lwonvnsibCGe1tUQE6QSTS4X7A4venPDjwz94+X5e2eB128OCR8MoT5YinzkoOO5aPq bQYbYrHwy4n2VuuPSFPjMSu9YfK+vgh4JeQXEgJxnsPlW2ceO05nBenLetfG84iSYEMv DscA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=SKOX1Lx0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o89-v6si10079022pfi.165.2018.05.30.10.56.44; Wed, 30 May 2018 10:56:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=SKOX1Lx0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753821AbeE3RyJ (ORCPT + 99 others); Wed, 30 May 2018 13:54:09 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:33990 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753427AbeE3RyG (ORCPT ); Wed, 30 May 2018 13:54:06 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w4UHjdWM110242; Wed, 30 May 2018 17:53:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=KAlGyFMvcbtfGI1prB9v3q5uhP2hnWxwl0+bmKXnjyg=; b=SKOX1Lx0U9o+qH/eBBXfADdNRLQlPmxqLjXhTu8Y3O3Rld3F2kBFRTUjupXflrJDO4FG GPrHhHCOOmRO7UIXYfOAFQ88KIfnBlt8ZLl/2VCTl0ejGW84+gfkePTms+tAgg/FxYb4 hP/613iPDukAktOrsF2Vn48xiE0IHIYPzGYd+D7sbvR3p/IQb9GxzU4ME2BqaiVTlQzG kDbPu1DQMuGBExU9neJmocUq2tUYc+I2lec4tx83ArfE/1pu3oaAnuMt29bxQMbCoiWK 8gQRpOqAuuUtOP3UzXc/hUZoiMHtzelV4cCwjSRM9mMtdToMTCth+UC1Je1DP6nN+R5s rg== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2130.oracle.com with ESMTP id 2j9x4h8rnb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 30 May 2018 17:53:58 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w4UHrvja001549 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 30 May 2018 17:53:57 GMT Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w4UHruAi007209; Wed, 30 May 2018 17:53:57 GMT Received: from [192.168.1.122] (/24.130.61.68) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 30 May 2018 10:53:56 -0700 Subject: Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks To: Eric Dumazet , David Miller Cc: tariqt@mellanox.com, haakon.bugge@oracle.com, yanjun.zhu@oracle.com, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, gi-oh.kim@profitbricks.com References: <20180523232246.20445-1-qing.huang@oracle.com> <20180525.102321.858995452200286788.davem@davemloft.net> <7a353b65-6b7f-1aee-1c48-e83c8e02f693@gmail.com> From: Qing Huang Message-ID: <145e15de-f4b5-570e-1091-659c464be252@oracle.com> Date: Wed, 30 May 2018 10:53:58 -0700 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <7a353b65-6b7f-1aee-1c48-e83c8e02f693@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8909 signatures=668702 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=9 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1805220000 definitions=main-1805300189 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/29/2018 8:34 PM, Eric Dumazet wrote: > > On 05/25/2018 10:23 AM, David Miller wrote: >> From: Qing Huang >> Date: Wed, 23 May 2018 16:22:46 -0700 >> >>> When a system is under memory presure (high usage with fragments), >>> the original 256KB ICM chunk allocations will likely trigger kernel >>> memory management to enter slow path doing memory compact/migration >>> ops in order to complete high order memory allocations. >>> >>> When that happens, user processes calling uverb APIs may get stuck >>> for more than 120s easily even though there are a lot of free pages >>> in smaller chunks available in the system. >>> >>> Syslog: >>> ... >>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task >>> oracle_205573_e:205573 blocked for more than 120 seconds. >>> ... >>> >>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. >>> >>> However in order to support smaller ICM chunk size, we need to fix >>> another issue in large size kcalloc allocations. >>> >>> E.g. >>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk >>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt >>> entry). So we need a 16MB allocation for a table->icm pointer array to >>> hold 2M pointers which can easily cause kcalloc to fail. >>> >>> The solution is to use kvzalloc to replace kcalloc which will fall back >>> to vmalloc automatically if kmalloc fails. >>> >>> Signed-off-by: Qing Huang >>> Acked-by: Daniel Jurgens >>> Reviewed-by: Zhu Yanjun >> Applied, thanks. >> > I must say this patch causes regressions here. > > KASAN is not happy. > > It looks that you guys did not really looked at mlx4_alloc_icm() > > This function is properly handling high order allocations with fallbacks to order-0 pages > under high memory pressure. > > BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] > Read of size 4 at addr ffff8817df584f68 by task qp_listing_test/92585 > > CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G O > Call Trace: > [] dump_stack+0x4d/0x72 > [] print_address_description+0x6f/0x260 > [] kasan_report+0x257/0x370 > [] __asan_report_load4_noabort+0x19/0x20 > [] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] > [] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib] > [] qpstat_print_qp+0x13b/0x500 [ib_uverbs] > [] qpstat_seq_show+0x4a/0xb0 [ib_uverbs] > [] seq_read+0xa9c/0x1230 > [] proc_reg_read+0xc1/0x180 > [] __vfs_read+0xe8/0x730 > [] vfs_read+0xf7/0x300 > [] SyS_read+0xd2/0x1b0 > [] do_syscall_64+0x186/0x420 > [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > RIP: 0033:0x7f851a7bb30d > RSP: 002b:00007ffd09a758c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000 > RAX: ffffffffffffffda RBX: 00007f84ff959440 RCX: 00007f851a7bb30d > RDX: 000000000003fc00 RSI: 00007f84ff60a000 RDI: 000000000000000b > RBP: 00007ffd09a75900 R08: 00000000ffffffff R09: 0000000000000000 > R10: 0000000000000022 R11: 0000000000000293 R12: 0000000000000000 > R13: 000000000003ffff R14: 000000000003ffff R15: 00007f84ff60a000 > > Allocated by task 4488: > save_stack+0x46/0xd0 > kasan_kmalloc+0xad/0xe0 > __kmalloc+0x101/0x5e0 > ib_register_device+0xc03/0x1250 [ib_core] > mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib] > mlx4_add_device+0xa9/0x340 [mlx4_core] > mlx4_register_interface+0x16e/0x390 [mlx4_core] > xhci_pci_remove+0x7a/0x180 [xhci_pci] > do_one_initcall+0xa0/0x230 > do_init_module+0x1b9/0x5a4 > load_module+0x63e6/0x94c0 > SYSC_init_module+0x1a4/0x1c0 > SyS_init_module+0xe/0x10 > do_syscall_64+0x186/0x420 > entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Freed by task 0: > (stack is not available) > > The buggy address belongs to the object at ffff8817df584f40 > which belongs to the cache kmalloc-32 of size 32 > The buggy address is located 8 bytes to the right of > 32-byte region [ffff8817df584f40, ffff8817df584f60) > The buggy address belongs to the page: > page:ffffea005f7d6100 count:1 mapcount:0 mapping:ffff8817df584000 index:0xffff8817df584fc1 > flags: 0x880000000000100(slab) > raw: 0880000000000100 ffff8817df584000 ffff8817df584fc1 000000010000003f > raw: ffffea005f3ac0a0 ffffea005c476760 ffff8817fec00900 ffff883ff78d26c0 > page dumped because: kasan: bad access detected > page->mem_cgroup:ffff883ff78d26c0 What kind of test case did you run? It looks like a bug somewhere in the code. Perhaps smaller chunks make it easier to occur, we should fix the bug though. > > Memory state around the buggy address: > ffff8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc > ffff8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc >> ffff8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc > ^ > ffff8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc > ffff8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > I will test : > > diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c > index 685337d58276fc91baeeb64387c52985e1bc6dda..4d2a71381acb739585d662175e86caef72338097 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/icm.c > +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c > @@ -43,12 +43,13 @@ > #include "fw.h" > > /* > - * We allocate in page size (default 4KB on many archs) chunks to avoid high > - * order memory allocations in fragmented/high usage memory situation. > + * We allocate in as big chunks as we can, up to a maximum of 256 KB > + * per chunk. Note that the chunks are not necessarily in contiguous > + * physical memory. > */ > enum { > - MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, > - MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, > + MLX4_ICM_ALLOC_SIZE = 1 << 18, > + MLX4_TABLE_CHUNK_SIZE = 1 << 18 > }; > > static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html