Return-Path: Received: from linuxhacker.ru ([217.76.32.60]:51662 "EHLO fiona.linuxhacker.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934322AbcIFO6t (ORCPT ); Tue, 6 Sep 2016 10:58:49 -0400 Subject: Re: 4.6, 4.7 slow ifs export with more than one client. Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=us-ascii From: Oleg Drokin In-Reply-To: <1473172215.13234.8.camel@redhat.com> Date: Tue, 6 Sep 2016 10:58:22 -0400 Cc: linux-nfs@vger.kernel.org Message-Id: References: <6C329B27-111A-4B16-84F4-7357940EBC01@linuxhacker.ru> <1473172215.13234.8.camel@redhat.com> To: Jeff Layton Sender: linux-nfs-owner@vger.kernel.org List-ID: On Sep 6, 2016, at 10:30 AM, Jeff Layton wrote: > On Mon, 2016-09-05 at 00:55 -0400, Oleg Drokin wrote: >> Hello! >> >> I have a somewhat mysterious problem with my nfs test rig that I suspect is something >> stupid I am missing, but I cannot figure it out and would appreciate any help. >> >> NFS server is Fedora23 with 4.6.7-200.fc23.x86_64 as the kernel. >> Clients are a bunch of 4.8-rc5 nodes, nfsroot. >> If I only start one of them, all is fine, if I start all 9 or 10, then suddenly all >> operations ground to a half (nfs-wise). NFS server side there's very little load. >> >> I hit this (or something similar) back in June, when testing 4.6-rcs (and the server >> was running 4.4.something I believe), and back then after some mucking around >> I set: >> net.core.rmem_default=268435456 >> net.core.wmem_default=268435456 >> net.core.rmem_max=268435456 >> net.core.wmem_max=268435456 >> >> and while no idea why, that helped, so I stopped looking into it completely. >> >> Now fast forward to now, I am back at the same problem and the workaround above >> does not help anymore. >> >> I also have a bunch of "NFSD: client 192.168.10.191 testing state ID with incorrect client ID" >> in my logs (also had in June. Tried to disable nfs 4.2 and 4.1 and that did not >> help). >> >> So anyway I discovered the nfsdcltrack and such and I noticed that whenever >> the kernel calls it, it's always with the same hexid of >> 4c696e7578204e465376342e32206c6f63616c686f7374 >> >> NAturally if I try to list the content of the sqlite file, I get: >> sqlite> select * from clients; >> Linux NFSv4.2 localhost|1473049735|1 >> sqlite> select * from clients; >> Linux NFSv4.2 localhost|1473049736|1 >> sqlite> select * from clients; >> Linux NFSv4.2 localhost|1473049737|1 >> sqlite> select * from clients; >> Linux NFSv4.2 localhost|1473049751|1 >> sqlite> select * from clients; >> Linux NFSv4.2 localhost|1473049752|1 >> sqlite> >> > > Well, not exactly. It sounds like the clients are all using the same > long-form clientid string. The server sees that and tosses out any > state that was previously established by the earlier client, because it > assumes that the client rebooted. > > The easiest way to work around this is to use the nfs4_unique_id nfs.ko > module parm on the clients to give them each a unique string id. That > should prevent the collisions. Hm, but it did work ok in the past. What determines the unique id now by default? The clients do start with a different ip address for one, so that seems to make that a much more good proxy for unique id (or local ip/server ip as is in case of centos7) than whatever local hostname is at any random point in time during boot (where it might not be set yet apparently). > >> (the number keeps changing), so it looks like client id detection broke somehow? >> >> These same clients (and a bunch more) also mount another nfs server (for crashdump >> purposes) that is centos7-based, there everything is detected correctly >> and performance is ok. The select shows: >> sqlite> select * from clients; >> Linux NFSv4.0 192.168.10.219/192.168.10.1 tcp|1472868376|0 >> Linux NFSv4.0 192.168.10.218/192.168.10.1 tcp|1472868376|0 >> Linux NFSv4.0 192.168.10.210/192.168.10.1 tcp|1472868384|0 >> Linux NFSv4.0 192.168.10.221/192.168.10.1 tcp|1472868387|0 >> Linux NFSv4.0 192.168.10.220/192.168.10.1 tcp|1472868388|0 >> Linux NFSv4.0 192.168.10.211/192.168.10.1 tcp|1472868389|0 >> Linux NFSv4.0 192.168.10.222/192.168.10.1 tcp|1473035496|0 >> Linux NFSv4.0 192.168.10.217/192.168.10.1 tcp|1473035500|0 >> Linux NFSv4.0 192.168.10.216/192.168.10.1 tcp|1473035501|0 >> Linux NFSv4.0 192.168.10.224/192.168.10.1 tcp|1473035520|0 >> Linux NFSv4.0 192.168.10.226/192.168.10.1 tcp|1473045789|0 >> Linux NFSv4.0 192.168.10.227/192.168.10.1 tcp|1473045789|0 >> Linux NFSv4.1 fedora1.localnet|1473046045|1 >> Linux NFSv4.1 fedora-1-3.localnet|1473046139|1 >> Linux NFSv4.1 fedora-2-4.localnet|1473046229|1 >> Linux NFSv4.1 fedora-1-1.localnet|1473046244|1 >> Linux NFSv4.1 fedora-1-4.localnet|1473046251|1 >> Linux NFSv4.1 fedora-2-1.localnet|1473046342|1 >> Linux NFSv4.1 fedora-1-2.localnet|1473046498|1 >> Linux NFSv4.1 fedora-2-3.localnet|1473046524|1 >> Linux NFSv4.1 fedora-2-2.localnet|1473046689|1 >> sqlite> >> >> (the first nameless bunch is centos7 nfsroot clients, fedora* bunch are >> the ones on 4.8-rc5). >> If I try to mount the Fedora23 server from one of the centos7 clients, the record >> does not appear in the output either. >> >> Now, while a theory that "aha, it's nfs 4.2 that is broken with Fedora23" >> might look possible, I have another Fedora23 server that is mounted by >> yet another (single) client and there things seems to be fine: >> sqlite> select * from clients; >> Linux NFSv4.2 xbmc.localnet|1471825025|1 >> >> >> So with all of that in the picture, I wonder what is it I am doing wrong just on >> this server? >> >> Thanks. >> >> Bye, >> Oleg > -- > Jeff Layton > -- > Jeff Layton