Received: by 2002:a25:ca44:0:0:0:0:0 with SMTP id a65csp526820ybg; Sun, 26 Jul 2020 12:11:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwbge6scOLNlltHGPj9QSlvkaCEIjHm95L21tA65EZ4j0G4YhL6ntU1GZfdf0k9+jRLSKpA X-Received: by 2002:a17:906:24d7:: with SMTP id f23mr2698195ejb.86.1595790691250; Sun, 26 Jul 2020 12:11:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595790691; cv=none; d=google.com; s=arc-20160816; b=Xft9ajOQkp8LYwKafG+iJZ7y7FWXSdkXZMQBatPAvoa1ZDT9nuDeXau8rQe3cVvQBC H8ELYUEUWBKW8MPmHZ3FJNj+k8DbbiDY1OjV7at0ZRA5jjfsY8UVWU6NZbY5c3mR8WJi WvTviHlK4h4H5EZk7I/bMwJqPF5H0iWgNInN4BiPq3/xRfpi7K+PZTRnDu0esjhr6emH uw1ROQ5PeWNuc2aGdcToRbv+/pN4+4c9z76gxXMw+3n7e1lE4phP6/2khD5MefsW3ovU S2/col3O+Krb0uTqRrcafgXvUfbqk2fw8GvbqynBAf0aGDBB8wvcs4ks7CiUXNOg+laK 1OPA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language:in-reply-to:mime-version :user-agent:date:message-id:from:references:cc:to:subject :dkim-signature; bh=7DpA9fUthc/SbMVBQd+Vo2TxG0ByMuuOrjjFcJZXlH0=; b=kXVG/M1Q9pdXdL2bGirxWJecQjhs1123NJaf3q79uymwdtiCIokibkjfm6akTesMZW Xj2ewhjHv8jIdfYE88udFafK0M7K61AT2HlUGTh84i7dqmi96PTOvJz1NXQRclrcQbdJ hU0+Tcw1gS3hbXsiJz+JKpbBHWDkRxp/CCCxE1S4/NrWDoXeiPaedb20YEg4Qk4eNvhE 32XhN2s5vYKPX0Li7qBnOWyQgOgcqFrcuABj+TNks5hm3ErIOvlwZzG/GrhK6HGdrr11 QHU61FOPnKgKmNr1i8JD4xI8lxlW6XD552c8qTJtGDdU1a/hRu2R2HehARRHaEA6O+sp Lwbg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=Ena7rR7P; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dr11si5264305ejc.752.2020.07.26.12.11.08; Sun, 26 Jul 2020 12:11:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=Ena7rR7P; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726739AbgGZTIQ (ORCPT + 99 others); Sun, 26 Jul 2020 15:08:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726081AbgGZTIQ (ORCPT ); Sun, 26 Jul 2020 15:08:16 -0400 Received: from merlin.infradead.org (merlin.infradead.org [IPv6:2001:8b0:10b:1231::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6395EC0619D2 for ; Sun, 26 Jul 2020 12:08:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=Content-Type:In-Reply-To:MIME-Version: Date:Message-ID:From:References:Cc:To:Subject:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=7DpA9fUthc/SbMVBQd+Vo2TxG0ByMuuOrjjFcJZXlH0=; b=Ena7rR7PcOu3DPBJc+pY2K4yoP Dhy3lL8Eng+GWAI8BKWoa9tfGxOlSXp5h1Ue7LB1NCjdKnduRowe8lKDOu8SZcSbGJrUM+ksSd8hz m0b16Om8PJnhi3QFnZ5y/XeojHCxnS7JtAJkzwCvkhMoC6UPzM2hyNB0p7UHMgoNALFeijCFYR5+c GeBbMlgWg33g5wrmYo4UuvuMoVhpgE4UQWpA5L5XNiWSjV5yBNgdm38D5VGONzhH/D89aGXDvs622 NWbDP4s3EWfxRZp/BgI8uwRAU63ejWX6mwFJ/L9xgMHNkmHIQT3OqKnM0gQCeqQtcpxWIiLSCnlca JRTWcYBw==; Received: from [2601:1c0:6280:3f0::19c2] by merlin.infradead.org with esmtpsa (Exim 4.92.3 #3 (Red Hat Linux)) id 1jzm0L-0001tR-1K; Sun, 26 Jul 2020 19:08:13 +0000 Subject: Re: [PATCH 0/9] powerpc: delete duplicated words To: Joe Perches , Christophe Leroy Cc: linuxppc-dev@lists.ozlabs.org, Paul Mackerras , linux-kernel@vger.kernel.org, Michael Ellerman References: <20200726162902.Horde.TCqHYaODbkzEpM-rFzDd8A2@messagerie.si.c-s.fr> From: Randy Dunlap Message-ID: <4e505c35-8428-89bb-7f9b-bc819382c3cd@infradead.org> Date: Sun, 26 Jul 2020 12:08:08 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------063DA744BA13B2CA39DCBC0B" Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a multi-part message in MIME format. --------------063DA744BA13B2CA39DCBC0B Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 7/26/20 10:49 AM, Joe Perches wrote: > On Sun, 2020-07-26 at 10:23 -0700, Randy Dunlap wrote: >> On 7/26/20 7:29 AM, Christophe Leroy wrote: >>> Randy Dunlap a écrit : >>> >>>> Drop duplicated words in arch/powerpc/ header files. >>> >>> How did you detect them ? Do you have some script for tgat, or you just read all comments ? >> >> Yes, it's a script that finds lots of false positives, so I have to check >> each and every one of them for validity. > > And it's a lot of work too. (thanks Randy) > > It could be something like: > > $ grep-2.5.4 -nrP --include=*.[ch] '\b([A-Z]?[a-z]{2,}\b)[ \t]*(?:\n[ \t]*\*[ \t]*|)\1\b' * | \ > grep -vP '\b(?:struct|enum|union)\s+([A-Z]?[a-z]{2,})\s+\*?\s*\1\b' | \ > grep -vP '\blong\s+long\b' | \ > grep -vP '\b([A-Z]?[a-z]{2,})(?:\t+| {2,})\1\b' Hi Joe, (what is grep-2.5.4 ?) It looks like you tried a few iterations of this -- since it drops things like "long long". There are lots of data types that are repeated & valid. And many struct names, like "struct kref kref", "struct completion completion", and "struct mutex mutex". I handle (ignore) those manually, although that could be added to the Perl script. v0.1 of this script also found lots of repeated numbers and strings of special characters (ASCII art etc.), so now it ignores duplicated numbers or special characters -- since it is really looking for duplicate words. Anyway, I might as well attach it. It's no big deal. And if someone else wants to tackle using it, go for it. -- ~Randy --------------063DA744BA13B2CA39DCBC0B Content-Type: application/x-perl; name="find_dup_words.pl" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="find_dup_words.pl" #! /usr/bin/perl # find duplicate words on one line # also finds repeated last word on lineN and first word of lineN+1 # # For source files (vs. Docs), drop a leading " *" in case it might be # kernel-doc notation. This would facilitate catching repeated words # at the end of one line and the beginning of the next line, after the " *". # TBD: print entire offending line(s) when a repeated word is found. # can this use an environment variable instead of ARGV? $VERSION = "v0.2"; my $infile; my $line; my $line_num; my $last_word; my $ix; my $last_ix; sub usage() { print "find_dup_words {$VERSION}\n"; exit 1; } # test for integer number or hex number (0x0-9a-f) sub is_numeric($) { $var = shift; return 1 if ($var =~ /^[+-]?\d+$/); return 1 if ($var =~ /^0x[0-9A-F]+$/i); return 0; } sub is_special_chars($) { $var = shift; return 1 if ($var =~ /[^[a-zA-Z0-9 ]]*/); ##return 1 if ($var =~ /[[:punct:]]*/); return 0; } sub report_words($$$$$) { $file = $_[0]; $line = $_[1]; $crossline = $_[2]; $word1 = $_[3]; $word2 = $_[4]; $crossing = $crossline ? "/=" : "=="; print "$file:$line: '$word1' $crossing '$word2'\n"; } sub dump_line_words($$$) { $line = shift(@_); $mx = shift(@_); @wrds = @_; print "## $line_num: #wrds=$mx: "; print "@wrds\n"; } # main: if (int(@ARGV) == 0 || $ARGV[0] eq "-h" || $ARGV[0] eq "--help") { usage(); } foreach $infile (@ARGV) { open (INFILE, $infile) or die "cannot open '$infile'\n"; $line_num = 0; $last_word = ""; LINE: while ($line = ) { $line_num++; chomp $line; next LINE if $line eq ""; # drop common punctuation: period, comma, qmark, semi-colon, colon $line =~ tr/.,;:?//d; @words = split(/\s+/, $line); # For a line that begins with " * foobar() does soandso.", # words[0] is "" and words[1] eq "*", so ignore both of them. if ($words[0] eq "") { shift @words; } if ($words[0] eq "*") { shift @words; } next LINE if ($last_word eq "" && $words[0] eq ""); ##dump_line_words($line_num, scalar @words, @words); $numwords = scalar @words; ##print "## $line_num: #wrds=$numwords:=\n"; ##print "@words\n"; if (lc($last_word) eq lc($words[0])) { if (is_numeric($last_word) || is_special_chars($last_word)) {} else { report_words($infile, $line_num, 1, $last_word, $words[0]); } } # note: using /m/ matches succeed on subsets, # e.g., "this" matches "is". Not good. # So I am using lc(word1) eq lc(word2) instead. for ($ix = 1; $ix < scalar @words; $ix++) { if (lc($words[$ix - 1]) eq lc($words[$ix])) { if (is_numeric($words[$ix]) || is_special_chars($words[$ix])) {} else { report_words($infile, $line_num, 0, $words[$ix - 1], $words[$ix]); } } $last_ix = $ix; } $last_word = $words[$last_ix]; } # end one infile close INFILE; print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"; } # end for all infiles # end find_dup_words.pl; --------------063DA744BA13B2CA39DCBC0B--