Amazon changed the format they use for Ads months ago and ads using the old format no longer work, but I've never gotten around to fixing all of the links I have in PHP files on a Linux system, some going back many years, so a lot of pages show a "not found" block on pages where ads for Amazon books related to an article I wrote appear. I've corrected a few when I needed to reference a page again to recall how I resolved a problem in the past when I encountered it again, but I have done that for only a small number of the many pages. So I decided to determine how many such pages exist and make a list of the file locations and the titles that appear in the HTML code for the pages, i.e., the text that appears between
<title>
and </title>
.
To find all the PHP files containing the old ads, I can search for "rcm.amazon", since I know that string is part of the old ad format, but not the new ad format.
$ grep -rwl "rcm.amazon" --include="*.php" software/database/mysql/field-types.php software/database/mysql/creating-mysql-db.php software/database/collectorz/MC-Customization/index.php security/malware/010210/index.php security/malware/system_defender/033011/index.php security/malware/111511/index.php security/firewalls/netscreen/smtp-vip.php security/firewalls/netscreen/syslog.php
The -rwl
parameters to the
grep command have the
following meanings:
-r, --recursive Read all files under each directory, recursively, following symbolic links only if they are on the command line. This is equivalent to the -d recurse option. -w, --word-regexp Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore. -l, --files-with-matches Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is specified by POSIX.)
The -r
parameter performs a recursive search from the
directory where I ran the command down through all subdirectories within it.
In this case, I could have omitted the -w
, but I normally use
it when performing such searches. I used -l
because I just want
to see the file names; I don't want anything else from the grep command.
I used the --include="*.php"
because I know that the text I'm
searching for will be in files that have a file name ending with .php; I don't
want the command to waste time searching in other files. That option will
ensure that the grep command searches only for any files that have a file name
ending with .php
.
If I wanted a count of the number of files that have the text for which
I'm searching, i.e., "rcm.amazon", I can pipe the output of the grep
command into the
wc (word count) utility.
$ grep -rwl "rcm.amazon" --include="*.php" | wc -l 215
The -l
parameter to the wc
command tells the
utility that I only want to see a count of the number of lines.
The PHP files are webpages and I'd like to know the title for each page.
The title will appear within the HTML code between the <title>
and </title>
tags, so I want to feed the output
of the grep command into another grep command to show the titles. One way to
do that is to put the first grep command within $()
and use that
in place of the file argument to the outer grep command, which will
run the command and substitute its output for the file parameter for the outer
grep command - see the answer provided by Gilles at
How do I pass a list of files to grep.
$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") software/database/mysql/field-types.php:<title>MySQL Field Types</title> software/database/mysql/creating-mysql-db.php:<title>Creating a MySQL Database</ title> software/database/collectorz/MC-Customization/index.php: <title>Movie Collector 6.4.1 Customization</title> security/malware/010210/index.php:<title></title> security/malware/system_defender/033011/index.php:<title>System Defender Infecti on</title> security/malware/111511/index.php:<title>AV Security 2012v121.exe Rogue Antiviru s Program</title> security/firewalls/netscreen/smtp-vip.php:<title>Configuring a NetScreen Firewal l for an Internal SMTP Server</title> security/firewalls/netscreen/syslog.php:<title>Configuring a Netscreen Firewall for Syslog Server Support</title>
Note: the above command will fail if there are any files with a space
in the file name in the search path, i.e. test me.php
, but in my
case I know no such files exist.
But I just want the title, not the <title>
and
</title>
, so to eliminate those strings, I can feed
the output into the sed
command.
$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/<title>//' | sed -e 's/<\/title>//' software/database/mysql/field-types.php:MySQL Field Types software/database/mysql/creating-mysql-db.php:Creating a MySQL Database software/database/collectorz/MC-Customization/index.php: Movie Collector 6.4.1 Customization security/malware/010210/index.php: security/malware/system_defender/033011/index.php:System Defender Infection security/malware/111511/index.php:AV Security 2012v121.exe Rogue Antivirus Progr am security/firewalls/netscreen/smtp-vip.php:Configuring a NetScreen Firewall for a n Internal SMTP Server security/firewalls/netscreen/syslog.php:Configuring a Netscreen Firewall for Sys log Server Support
The s
in 's/<title>//
tells sed that I want
to search for whatever appears between the next two /
(forward
slash) characters and subsitute whatever appears after the second
foward slash and the following one in its place - the "s" stands for
substitute. In this case since nothing appears between the second and
third forward slashes, then "<title>" is eliminated from the line with
nothing substituted in its place. I then pipe the output into a
second sed command to eliminate the "</title >". Since there is
a forward slash withing the string I want to search for, I don't want
sed to interpret the line to mean I want it to search just for "<",
so I need to "escape" the special meaning the forward slash has for sed
in this instance. I can do that by preceding that forward slash with
an "escape
character". The
backslash character, i.e., \
, is the escape character
that takes away the special meaning for the forward slash before "title",
so that it is included in the text for which sed will search.
The above string of commands will provide the file name followed by a colon and then the title for the web page. If I just want the title, I can pipe the output from the above commands into the cut utility.
$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/<title>//' | sed -e 's/<\/title>//' | cut -d":" -f2 MySQL Field Types Creating a MySQL Database Movie Collector 6.4.1 Customization System Defender Infection AV Security 2012v121.exe Rogue Antivirus Program Configuring a NetScreen Firewall for an Internal SMTP Server Configuring a Netscreen Firewall for Syslog Server Support
That command makes it clear that I have some unnecessary spaces or a tab
character at the beginning of the title line for the Movie Collector page and
a missing title for security/malware/010210/index.php
.
The -d":"
option to cut
specifies that I want
to use a colon as the delimiter between fields. The -f2
option
instructs cut to just show me the second field, i.e., the one after the colon.
After inserting the missing title line in the file for which no title was
shown, I modified the first sed command to ignore any spaces or tabs that
occur on the line with the title tags by using \s
which
represents spaces or tabs. I followed the \s
with an asterisk,
*
, which, for
regular
expressions undestood by many Unix/Linux commands, means zero or more of
the preceding character, so in this case sed will remove
"<title>" or " <title>", i.e., it will remove
the beginning title tag or, if there are any spaces or tabs before the tag,
it will remove those and the title tag.
$ grep '<title>' $(grep -rwl "rcm.amazon" --include="*.php") | sed -e 's/\s*<title>//' | sed -e 's/<\/title>//' | cut -d":" -f2 | more MySQL Field Types Creating a MySQL Database Movie Collector 6.4.1 Customization Malware Scanning on Dell Inspiron 1526 System Defender Infection AV Security 2012v121.exe Rogue Antivirus Program Configuring a NetScreen Firewall for an Internal SMTP Server Configuring a Netscreen Firewall for Syslog Server Support
Note: for POSIX-compliant
systems and Mac OS X, you may need to use
[[:space:]]
instead of \s
- see
How to match whitespace in sed?.