MoonPoint Support Weblog

Find files containing a string and then extract another string from the files

Amazon changed the format they use for Ads months ago and ads using the old format no longer work, but I've never gotten around to fixing all of the links I have in PHP files on a Linux system, some going back many years, so a lot of pages show a "not found" block on pages where ads for Amazon books related to an article I wrote appear. I've corrected a few when I needed to reference a page again to recall how I resolved a problem in the past when I encountered it again, but I have done that for only a small number of the many pages. So I decided to determine how many such pages exist and make a list of the file locations and the titles that appear in the HTML code for the pages, i.e., the text that appears between <title> and </title>.

To find all the PHP files containing the old ads, I can search for "rcm.amazon", since I know that string is part of the old ad format, but not the new ad format.

$ grep -rwl "rcm.amazon" --include="*.php"
software/database/mysql/field-types.php
software/database/mysql/creating-mysql-db.php
software/database/collectorz/MC-Customization/index.php
security/malware/010210/index.php
security/malware/system_defender/033011/index.php
security/malware/111511/index.php
security/firewalls/netscreen/smtp-vip.php
security/firewalls/netscreen/syslog.php

The -rwl parameters to the grep command have the following meanings:

-r, --recursive
       Read  all  files  under  each  directory, recursively, following
       symbolic links only if they are on the command  line.   This  is
       equivalent to the -d recurse option.

-w, --word-regexp
       Select  only  those  lines  containing  matches  that form whole
       words.  The test is that the matching substring must  either  be
       at  the  beginning  of  the  line,  or  preceded  by  a non-word
       constituent character.  Similarly, it must be either at the  end
       of  the  line  or  followed by a non-word constituent character.
       Word-constituent  characters  are  letters,  digits,   and   the
       underscore.

-l, --files-with-matches
       Suppress  normal  output;  instead  print the name of each input
       file from which output would normally have  been  printed.   The
       scanning  will  stop  on  the  first match.  (-l is specified by
       POSIX.)

The -r parameter performs a recursive search from the directory where I ran the command down through all subdirectories within it. In this case, I could have omitted the -w, but I normally use it when performing such searches. I used -l because I just want to see the file names; I don't want anything else from the grep command.

I used the --include="*.php" because I know that the text I'm searching for will be in files that have a file name ending with .php; I don't want the command to waste time searching in other files. That option will ensure that the grep command searches only for any files that have a file name ending with .php.

If I wanted a count of the number of files that have the text for which I'm searching, i.e., "rcm.amazon", I can pipe the output of the grep command into the wc (word count) utility.

$ grep -rwl "rcm.amazon" --include="*.php" | wc -l
215

The -l parameter to the wc command tells the utility that I only want to see a count of the number of lines.

The PHP files are webpages and I'd like to know the title for each page. The title will appear within the HTML code between the <title> and </title> tags, so I want to feed the output of the grep command into another grep command to show the titles. One way to do that is to put the first grep command within $() and use that in place of the file argument to the outer grep command, which will run the command and substitute its output for the file parameter for the outer grep command - see the answer provided by Gilles at How do I pass a list of files to grep.

$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php")
software/database/mysql/field-types.php:<title>MySQL Field Types</title>
software/database/mysql/creating-mysql-db.php:<title>Creating a MySQL Database</
title>
software/database/collectorz/MC-Customization/index.php:  <title>Movie Collector
 6.4.1 Customization</title>
security/malware/010210/index.php:<title></title>
security/malware/system_defender/033011/index.php:<title>System Defender Infecti
on</title>
security/malware/111511/index.php:<title>AV Security 2012v121.exe Rogue Antiviru
s Program</title>
security/firewalls/netscreen/smtp-vip.php:<title>Configuring a NetScreen Firewal
l for an Internal SMTP Server</title>
security/firewalls/netscreen/syslog.php:<title>Configuring a Netscreen Firewall
for Syslog Server Support</title>

Note: the above command will fail if there are any files with a space in the file name in the search path, i.e. test me.php, but in my case I know no such files exist.

But I just want the title, not the <title> and </title>, so to eliminate those strings, I can feed the output into the sed command.

$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/<title>//' | sed -e 's/<\/title>//'
software/database/mysql/field-types.php:MySQL Field Types
software/database/mysql/creating-mysql-db.php:Creating a MySQL Database
software/database/collectorz/MC-Customization/index.php:  Movie Collector 6.4.1
Customization
security/malware/010210/index.php:
security/malware/system_defender/033011/index.php:System Defender Infection
security/malware/111511/index.php:AV Security 2012v121.exe Rogue Antivirus Progr
am
security/firewalls/netscreen/smtp-vip.php:Configuring a NetScreen Firewall for a
n Internal SMTP Server
security/firewalls/netscreen/syslog.php:Configuring a Netscreen Firewall for Sys
log Server Support

The s in 's/<title>// tells sed that I want to search for whatever appears between the next two / (forward slash) characters and subsitute whatever appears after the second foward slash and the following one in its place - the "s" stands for substitute. In this case since nothing appears between the second and third forward slashes, then "<title>" is eliminated from the line with nothing substituted in its place. I then pipe the output into a second sed command to eliminate the "</title >". Since there is a forward slash withing the string I want to search for, I don't want sed to interpret the line to mean I want it to search just for "<", so I need to "escape" the special meaning the forward slash has for sed in this instance. I can do that by preceding that forward slash with an "escape character". The backslash character, i.e., \, is the escape character that takes away the special meaning for the forward slash before "title", so that it is included in the text for which sed will search.

The above string of commands will provide the file name followed by a colon and then the title for the web page. If I just want the title, I can pipe the output from the above commands into the cut utility.

$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/<title>//' | sed -e 's/<\/title>//' | cut -d":" -f2
MySQL Field Types
Creating a MySQL Database
  Movie Collector 6.4.1 Customization

System Defender Infection
AV Security 2012v121.exe Rogue Antivirus Program
Configuring a NetScreen Firewall for an Internal SMTP Server
Configuring a Netscreen Firewall for Syslog Server Support

That command makes it clear that I have some unnecessary spaces or a tab character at the beginning of the title line for the Movie Collector page and a missing title for security/malware/010210/index.php.

The -d":" option to cut specifies that I want to use a colon as the delimiter between fields. The -f2 option instructs cut to just show me the second field, i.e., the one after the colon.

After inserting the missing title line in the file for which no title was shown, I modified the first sed command to ignore any spaces or tabs that occur on the line with the title tags by using \s which represents spaces or tabs. I followed the \s with an asterisk, *, which, for regular expressions undestood by many Unix/Linux commands, means zero or more of the preceding character, so in this case sed will remove "<title>" or " <title>", i.e., it will remove the beginning title tag or, if there are any spaces or tabs before the tag, it will remove those and the title tag.

$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/\s*<title>//' | sed -e 's/<\/title>//' | cut -d":" -f2 | more
MySQL Field Types
Creating a MySQL Database
Movie Collector 6.4.1 Customization
Malware Scanning on Dell Inspiron 1526
System Defender Infection
AV Security 2012v121.exe Rogue Antivirus Program
Configuring a NetScreen Firewall for an Internal SMTP Server
Configuring a Netscreen Firewall for Syslog Server Support

Note: for POSIX-compliant systems and Mac OS X, you may need to use [[:space:]] instead of \s - see How to match whitespace in sed?.

[/os/unix/commands] permanent link

Sat, Jan 23, 2016 10:56 pm