#!/usr/local/bin/perl 

############################################################################
#
#   Copyright (c) 2000 by the President and Fellows of Harvard College
#
#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or (at
#   your option) any later version.
#
#   This program is distributed in the hope that it will be useful, but
#   WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
#   General Public License for more details.
#
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
#   USA.
#
#   Contact information:
#
#   Phil Mitchell
#   Office for Information Systems
#   Harvard University
#   philip_mitchell at harvard.edu
#
#############################################################################
=head1 README

When called without args, this script reads a list of URLs, one per line, from
$INPUT_FILE and tries to access the url using the appropriate protocol. This
includes following redirects until either:
  1. the target page is successfully received; or
  2. a page cycle is detected; or
  3. a bad server or page request is detected; or
  4. a maximum number of redirects ($MAX_REDIRECTS) is exceeded.

Urls that return an error are rechecked in additional passes. The number of
additional passes and the sleep time between passes are configurable. 
The results are recorded in $OUTPUT_FILE. A summary
of the results is emailed to $ADMIN_EMAIL, if it is defined.

When called with a single url as an arg, this script applies the
same procedure (but only one pass) to that url, and prints the HTTP result code to
stdout, along with verbose output.

Protocols supported: http, https, ftp, gopher, file, telnet.

Output file record format: prefix HTTP (or other) status code to input record,
separated by a space. 

Status codes:
  Success: All successful response codes have the form: 2xx. Because we limit
  the size of responses we accept, we get a lot of 206's in addition to 200's.
  UNSUPPORTED_PROTOCOL:
      Urlcheck handles {http, https, ftp, gopher, file, telnet}. Other                                
	 protocols will get this error. More commonly, it is the result of a 	             
	 typo (eg. "thttp://").                                              
  MALFORMED_URL: The url is syntactically incorrect. EG., "http:/www.domain.com".
  TELNET_FAILURE: Couldn't open the requested telnet connection.
  HTTP_0_9_FAIL: Failed HTTP/0.9 connection (0.9 does not return status codes).
  REDIRECT_LIMIT_EXCEEDED:
      Too many redirections. This error code should not normally be
      received, it is in place to catch infinite redirect cycles.
  UNKNOWN_ERROR: Rarely, LWP or HTTP modules will die, reporting an error that
      is not useful to us. This error code should not normally be received; it
      will generally be corrected in subsequent passes.

There are various configurable parameters documented below. In addition to setting the
input and output filenames, the most important ones are those that control the timeout,
the number of retries, and the time between retries. These settings have an important
effect on the accuracy of results. 

Accuracy of results:

Informal tests (results can be found at the end of this script) have shown that: (1) a
timeout of 30 sec is adequate; increasing to 60 sec is not useful; 10 seconds is too
short. (2) The absolute number of recheck passes is less important than spreading them
over time. Reasonable results are obtained with 3 recheck passes, each separated by 8
hours of sleep.

In our set of about 10,000 urls, a first pass produces about 800 (8%) bad
urls. Subsequent passes will reduce that to about 650 (6.5%). The use of telnet retry
will reach another 30% of those apparently bad urls. The estimate of total bad urls in
our sample is thus 4.5%. That list of bad urls is consistent across distinct runs of
the link checker at greater than 99%. Handchecking of a large sample from this final list
indicates a high degree of accuracy.

Notes: 

 - A "page cycle" is the use of a redirect or refresh tag to cycle through
 a list of one or more pages for data refresh purposes. 

Design Notes:

- Cookies: This version accepts all cookies. This allows it to handle some
URLs which require cookies.

- Timeout bug: Due to an apparent bug in the interaction between Solaris and certain web
servers, some http responses come back improperly terminated. As a result, LWP times out
and reports a server error when a (nearly) valid response has been received. To avoid
this, we open a telnet connection to the relevant port (usually 80) and do a manual GET on
the url. Telnet will also time out in this case, but telnet.pm provides a dump of the
partial response received, and we use this.

- WWW unreliability: Any given access to a server on the web is subject to various kinds
of flakiness. To avoid false reports of bad servers, it is essential to re-test all
errors, preferably over a period of hours or days. This script completes a first pass
through all urls, typically taking 8 hours or more on 10,000 urls. Then it performs 
additional ($RECHECKS) passes on all urls that received error codes. It sleeps
($HOURS_TO_SLEEP) between passes to improve the chances of getting a valid return code.

- Redirects and cycles: The challenge is to follow redirects all the way to the end of
the line, but know when to stop. It is complicated by the fact that some sites use the
meta refresh tag for their redirection, and by the fact that some sites have infinite
loop cycles for page refresh purposes. Five distinct cases have been identified:
   1. Proper redirect, using Location header. (Action: Follow redirect.)
   2. Proper meta refresh, on a single page. (Action: Detect cycle and exit.)
   3. Proper meta refresh, on a cycle of pages. (Action:Detect cycle and exit.)
   4. Redirect using meta refresh. (Action: Follow redirect.)
   5. Redirect loop on a single page for setting cookies. (Action: Follow redirect.)

Maintenance and Future Development Notes:

 - 401's and 403's: Currently does not handle authentication; just reports these as errors.

 - Cookie warnings: With perl's -w option, many warnings will be received about
 Cookies.pm. This seems to be due to the fact that Cookies.pm does not cleanly handle
 incorrectly formatted cookies. As far as I know, these warnings may be safely ignored.

Author: Phil Mitchell
Date: 02/22/01
Version: 1.00

=head1 PREREQUISITES

 This script requires: C<LWP>, C<HTTP>, and C<Net::Telnet>.

=pod SCRIPT CATEGORIES

 Web

=cut

use strict;
use LWP::UserAgent;
use HTTP::Response;
use HTTP::Message;
use HTTP::Headers;
use HTTP::Request;
use HTTP::Cookies;
use Net::Telnet;
#use LWP::Debug qw(+);

###########################################
#  Global variables
###########################################

use vars qw(
	    %url_hash
	    $HTTP_DEFAULT_PORT
	    $HTTP_VERSION
	    $ADMIN_EMAIL
	    $MAX_REDIRECTS
	    $RECHECKS
	    $HOURS_TO_SLEEP
	    $AGENT_TIMEOUT
	    $AGENT_MAX_RESPONSE
	    $INPUT_FILE
	    $OUTPUT_FILE
	    $TMP_FILE
	    $TELNET_LOGFILE
	    $ADMIN_LOGFILE
	    $REDIRECT_LIMIT_EXCEEDED
	    $UNSUPPORTED_PROTOCOL
	    $MALFORMED_URL
	    $HTTP_0_9_OKAY
	    $HTTP_0_9_FAIL
	    $UNKNOWN_ERROR
	    $VERBOSE
	    $DEBUG
	    $LOGGING
	    $TELNET_SUCCESS
	    $TELNET_FAILURE
	    $agent
	    $telnetAgent
	    $cookieJar
	    $redirectCount
	    );

###########################################
#  Configurable parameters
###########################################

$ADMIN_EMAIL = ''; # If non-empty, script will send confirmation and result stats.
$AGENT_TIMEOUT = 10; # In seconds, time for http agent to wait. 10 secs is often too
		     # short, leads to spurious reports of server errors. Longer than 
                     # 30 secs not usually helpful.
$AGENT_MAX_RESPONSE = 524288; # In bytes, max response to accept. Mainly want to
			      # avoid being swamped by something huge. 
$MAX_REDIRECTS = 15; # Number of redirects to tolerate before giving up. Should never hit
		     # this limit; it's here to avoid infinite loop.
$RECHECKS = 3; # Number of recheck passes to recheck urls that return error codes. Note
	       # that every server error automatically gets one retry via telnet.
$HOURS_TO_SLEEP = 0; # Number of hours to sleep between recheck passes.
$HTTP_DEFAULT_PORT = 80;
$HTTP_VERSION = 'HTTP/1.0'; # Perl's HTTP module defaults to 0.9
$INPUT_FILE  = "CURRENT.URLS.TXT";
$INPUT_FILE  = "smalltest.txt";
$OUTPUT_FILE = "OUT.URLS.TXT";
$ADMIN_LOGFILE = "admin_logfile.txt"; # Log for result stats.
$VERBOSE = 1; # If 1, print processing status to stdout
$DEBUG = 0; # If 1, provides additional output to stdout; mainly HTTP headers.
$LOGGING = 1; # Enable logging to $ADMIN_LOGFILE.

###########################################
#  Misc. initializations
###########################################

$TMP_FILE = "tmp.txt";
$TELNET_LOGFILE = "telnet_logfile.txt"; # Used internally to buffer data.

# Response codes. All successful response codes have the form: 2xx.
$REDIRECT_LIMIT_EXCEEDED = 'REDIRECT_LIMIT_EXCEEDED'; 
$UNSUPPORTED_PROTOCOL = 'UNSUPPORTED_PROTOCOL'; 
$MALFORMED_URL = 'MALFORMED_URL'; 
$TELNET_FAILURE = 'TELNET_FAILURE';
$HTTP_0_9_FAIL = 'HTTP_0_9_FAIL';
$UNKNOWN_ERROR = 'UNKNOWN_ERROR';
$TELNET_SUCCESS = 299; # Mimic a successful HTTP code
$HTTP_0_9_OKAY = 298;

###########################################
#  Main routine
###########################################

# Set up the web agents and helpers.
$agent = new LWP::UserAgent;
$agent->timeout($AGENT_TIMEOUT);
$agent->max_size($AGENT_MAX_RESPONSE);
$cookieJar = new HTTP::Cookies;
$telnetAgent = new Net::Telnet(Timeout => $AGENT_TIMEOUT,
			       Errmode => 'return');

my ($url, $result, $newResult, %results, $outputStr, $urlCount, $count, $recheckCount,
    %resultSummary);

# SINGLE URL MODE: Call script with url on command line to process
# single url in verbose mode.
if (@ARGV) {
    $VERBOSE = 1;
    $LOGGING = 0;
    $url = $ARGV[0];
    print "Status code: ", follow_url($url), "\n";
    exit;
}

open (INPUT, $INPUT_FILE) || die "Couldn't open $INPUT_FILE: $!\n";
open (OUTPUT, "+>$OUTPUT_FILE") || die "Couldn't open $OUTPUT_FILE: $!\n";
if ($LOGGING) {
    open (ADMINLOG, ">>$ADMIN_LOGFILE") || die "Couldn't open $ADMIN_LOGFILE: $!\n";
    print ADMINLOG ">> ", scalar(localtime), "\n";
}

# FIRST LOOP: 
#    LOOP THROUGH INPUT FILE AND CHECK EACH URL
while (<INPUT>) {
    chomp;
    $url = $_;
    undef %url_hash; # Clear out urls from previous run
    $result = follow_url($url);
    $urlCount++;
    if ($VERBOSE) { print "First pass, total URLs followed: $urlCount\n"; }
    $results{$result}++; # Keep a tally of each result code.
    print OUTPUT "$result $url\n";
    if ($VERBOSE) { print "Result: $result\n\n"; }   
}
if ($VERBOSE) { print "Total of $urlCount urls processed.\n"; }
if ($LOGGING) { print ADMINLOG "First pass: total URLs followed: $urlCount.\n"; }

# RECHECK LOOP: 
#     LOOP THROUGH OUTPUT FILE AND RECHECK ERRORS
$recheckCount = 0;
while (++$recheckCount <= $RECHECKS) {
    if ($VERBOSE) { print "Starting recheck pass \#$recheckCount ...\n\n"; }
    if ($VERBOSE) { print "Going to sleep for $HOURS_TO_SLEEP hours ...\n"; }
    sleep(60*60*$HOURS_TO_SLEEP);
    open (TMP, ">$TMP_FILE") || die "Couldn't open $TMP_FILE: $!\n";
    $count = 0;
    seek(OUTPUT, 0, 0); # Reset OUTPUT_FILE to beginning
    while (<OUTPUT>) {
	chomp;
	($result, $url) = split / /;
	if ($result !~ /2../)  {
	    undef %url_hash; # Clear out urls from previous run
	    $newResult = follow_url($url);
	    $count++;
	    if ($VERBOSE) { print "Recheck pass \#$recheckCount, total URLs followed: $count\n"; }
	    if ($VERBOSE) { print "Result: $newResult\n"; }   
	    if ($result ne $newResult) {
		if ($VERBOSE) { print "Inconsistency found: $result/$newResult\n"; }   
		# As long as the new result is not a server error, we'll use it. This assumes
		# that it's very unlikely to get a 2xx that is false; so the new result can't
		# be any worse than old. It is useful to save both results for the
		# administrator, but we report only the new result into the output file.
		$results{"$result/$newResult"}++;
		if ($newResult !~ /5../) {
		    $result = "$newResult";
		}
	    }
	    if ($VERBOSE) { print "\n"; }   
	}
	print TMP "$result $url\n";
    }
    if ($LOGGING) { print ADMINLOG "Recheck pass \#$recheckCount, total URLs followed: $count\n"; }
    close(TMP) or warn "Can't close $TMP_FILE: $!\n";
    rename($TMP_FILE, $OUTPUT_FILE) or die "Can't rename $TMP_FILE to $OUTPUT_FILE: $!\n";
    # Need to reopen output file (perl will close it first), to flush buffers.
    open (OUTPUT, "<$OUTPUT_FILE") || die "Couldn't open $OUTPUT_FILE: $!\n";

} # end recheck loop

# FINAL PASS:
#    Collect summary stats.
seek(OUTPUT, 0, 0) || warn "Couldn't reset $OUTPUT_FILE: $!\n"; 
while (<OUTPUT>) {
    chomp;
    ($result, $url) = split / /;
    if ($result =~ /(\d)../) { $resultSummary{"$1xx"}++; }
    else  { $resultSummary{"OTHER"}++; }

} # end final pass

if ($LOGGING) { print_admin_log(\%results, \%resultSummary, $urlCount); }
send_admin_mail("$0 succeeded", \%results, \%resultSummary);
close(INPUT) or warn "Can't close $INPUT_FILE: $!\n";
close(OUTPUT) or warn "Can't close $OUTPUT_FILE: $!\n";
if ($LOGGING) {close(ADMINLOG) or warn "Can't close $ADMIN_LOGFILE: $!\n"; }
if (-e $TELNET_LOGFILE) {
  unlink($TELNET_LOGFILE) or warn "Can't remove $TELNET_LOGFILE: $!\n"; }
exit;

###########################################
#  check_for_meta_refresh
###########################################
# Routine that searches input string for something of the form:
#    <meta http-equiv="refresh" [verbiage] url=http://some.url.com">
# It is tolerant of extra whitespace, single or no quotes instead of
# doublequotes, spaces around equals signs, and extra verbiage, and is
# case-insensitive. 
# Call with: String of content to be searched
# Returns: url, if a meta refresh is found; otherwise returns
# empty string.
sub check_for_meta_refresh {
    if ($DEBUG) { print "check_for_meta_refresh()...\n"; }
    my $inputStr = shift;
    if ($inputStr =~
  m{
      <meta \s+ http-equiv \s* = \s* ["']? refresh ["']? [^>]+? url
               \s* = \s* ["']? ([^"' >]+) ["']? [^>]+? >
           }ix)
     {
       return $1;
     }
     else {
       return "";
     }
}#end check_for_meta_refresh

###########################################
#  follow_url
###########################################
# Tries to access a given url. The main case is HTTP protocol, but also handles any
# protocol handled by LWP, plus telnet. For telnet, just tries to open a connection. 
# For HTTP, follows redirects until a final status code is received or until
# $MAX_REDIRECTS is exceeded. Accepts all cookies. To avoid infinite loops, detects
# page refresh cycles.
# Call with: url, and optional second arg of referring url which is used to absolutize url.
# Returns: HTTP status code, or internal response codes (see above).
sub follow_url  {
    my ($url, $referrer) = @_;
    return $MALFORMED_URL unless $url;
    my ($response, $protocol, $host, $port, $ping, $telnetResult, $request, $statusCode,
	$new_url);
    if ($VERBOSE || $DEBUG) { print "follow_url(): $url\n"; }
    $url_hash{$url} = 1; # Track all urls in each run, to detect cycles. 
    # Note: It is crucial to hash this url BEFORE absolutizing it, b/c we will test for
    # cycles before absolutizing.
    if ($referrer) { $url = make_url_absolute($url, $referrer); }
    if (keys(%url_hash) > $MAX_REDIRECTS) {
	if ($VERBOSE) { print "Redirect limit exceeded.\n"; }
	return $REDIRECT_LIMIT_EXCEEDED;
    }

    # EXTRACT PROTOCOL, HOST, AND (OPTIONAL) PORT.
    $url =~ m{ ^\s* ([a-z]+) :// ([^/:]+) }ix; 
    if (!($1 && $2)) {
	if ($VERBOSE) { print "URL not well-formed.\n"; }
	return $MALFORMED_URL;
    }
    else {
	$protocol = $1;
	$host = $2;
    }
    $url =~ m{ \w+ :// [^/]+ : (\d+) }x; # Extract port
    if ($1) { $port = $1; }

    # HANDLE TELNET REQUESTS -- just see if we can open the connection.
    if ($protocol =~ /^telnet$/i) {
	if ($port) {
	    $ping = $telnetAgent->open(Host => $host,
				       Port => $port);
	}
	else {
	    $ping = $telnetAgent->open(Host => $host);
	}
	if (!$ping) { return $TELNET_FAILURE; }
	else { return $TELNET_SUCCESS; }
    }

    # HANDLE ALL OTHER REQUESTS (HTTP, HTTPS, FTP, GOPHER, FILE)
    if (!$agent->is_protocol_supported($protocol))  {
	if ($VERBOSE) { print "Protocol not supported.\n"; }
	return $UNSUPPORTED_PROTOCOL;
    }
    # Use eval to avoid aborting if LWP or HTTP sends "die".
    eval {
	$request = HTTP::Request->new(GET => $url);
	$request->protocol($HTTP_VERSION);
	$cookieJar->add_cookie_header($request);
	if ($DEBUG) { print "\nRequest: \n", $request->as_string; }

	# Use simple_request so we don't follow redirects automatically
	$response = $agent->simple_request($request);
	$cookieJar->extract_cookies($response);
	$statusCode = $response->code;
    };
    if ($@)  {
	if ($VERBOSE) { print "LWP or HTTP error: $@\n"; }
	if ($LOGGING) { print ADMINLOG "LWP or HTTP error: $@\n"; }
	return $UNKNOWN_ERROR;
    }
    if ($DEBUG) { print "Status: $statusCode\n"; }
    if ($DEBUG) { print "\nResponse Header: \n", $response->headers->as_string; }
    
    # Note: In case of timeout, agent sets $statusCode to server error.
    if ($statusCode =~ /2../) { 
	if ($VERBOSE) { print "Good response, checking for meta refresh tag...\n"; }
	$new_url = check_for_meta_refresh($response->content);
	if ($new_url ne "") {
	    if (exists($url_hash{$new_url}))  { 
		if ($VERBOSE) { print "This url already visited ... returning $statusCode.\n"; }
		return $statusCode; }
	    else  {
		if ($VERBOSE) { print "Refresh to: $new_url\n"; }
		return follow_url($new_url, $url);
	    }
	}
	else { return $statusCode;}
    }
    elsif ($statusCode =~ /3../) { 
	if ($VERBOSE) { print "Proper redirect...\n"; }
	# Note that we don't check for page cycles here. Some sites will redirect
	# to the same page while setting cookies, but eventually they'll stop.
	$new_url = $response->headers->header('Location');
	if ($VERBOSE) { print "Redirect to: $new_url\n"; }
	return follow_url($new_url, $url);
    }
    elsif ($statusCode =~ /4../) { 
	if ($VERBOSE) { print "Client error...\n"; }
	return $statusCode;
    }
    elsif ($statusCode =~ /5../) { 
	if ($VERBOSE) { print "Server error...\n"; }
    # You might be tempted to do a retry right here. It is problematic b/c you need to do
    # another follow_url, but that will clash with url_hash -- it will look like a page
    # cycle. But if you do the retry by hand w/ a simple request, you don't handle all the
    # cases properly. What we do is retry once using telnet, and leave other retries to
    # subsequent passes following main loop.
	if ($protocol =~ /^http$/i) { # Only works for HTTP requests.
	    $telnetResult = telnet_http_retry($host, $url, $request, $port);
	    if ($telnetResult ne 'FAIL') {
		$statusCode = $telnetResult;
	    }
	}
	return $statusCode;
    } # end 5xx case.
    else { # Everything else case.
	return $statusCode;
    }

} # end sub follow_url

###########################################
#  get_location_header
###########################################
# Extracts the url from the Location field of an HTTP redirect.
# Call with: ref to array of header lines, w or w/o body at end.
# Returns: URL found in Location header, or empty string.
sub get_location_header {

    if ($VERBOSE || $DEBUG) { print "Looking for location header... \n"; }	
    my ($headersRef) = @_;
    my $line;

    while ($line = shift @$headersRef) {
	if ($DEBUG) { print "Checking line: $line\n"; }
	last if $line =~ /^\s$/;
	if ($line =~ m{^Location: \s* (\S+)}x) {
	if ($DEBUG) { print "Line found: $line\n"; }	    
	    return $1;
	}
    }
    return "";

} # end sub get_location_header

###########################################
#  make_url_absolute
###########################################
# Make a relative url absolute by appending it to path of old url.
# Call with: a fully qualified url as second arg, which will provide
# path info for relative url which is first arg.
# Returns: new absolute url
sub make_url_absolute {

    if ($DEBUG) { print "make_url_absolute()...\n"; }
    my ($new_url, $old_url) = @_;

    # Test to see if it's already absolute (starts w/ a syntactically correct scheme)
    if ($new_url =~ m{^[a-z]+://}i) {
	return $new_url;
    }
    
    if ($VERBOSE) { print "Adding path to relative url: $new_url\n"; }    
    # Case 1: new url is relative to root; it starts with slash, and
    #         should be appended to raw domain name. 
    if ($new_url =~ m{^/} ) {
	$old_url =~ m{ (\w+ :// [^/]+) }x;	
	if ($VERBOSE) { print "Case 1: append to $1\n"; }    
	return $1 . $new_url;
    }
    # For cases 2 & 3, assume new url is relative to current directory;
    # Case 2: old url contains a trailing slash, eg. http://www.fib.com/bigfib/;
    #         may or may not contain trailing filename
    elsif ($old_url =~ m{ (\w+://\S+/) }x ) {
	if ($VERBOSE) { print "Case 2: append to $1\n"; }    
	return $1 . $new_url;
    }
    # Case 3: old url has no trailing slash, eg. http://www.fab.net
    else {
	if ($VERBOSE) { print "Case 3: append to $old_url/\n"; }    
	return "$old_url/$new_url";
    }
    
} # End make_url_absolute

###########################################
#  print_admin_log
###########################################
# Prints results to $ADMIN_LOGFILE.
# Call with: hashrefs to results and result summary, and total url count.
sub print_admin_log  {

    my ($resultsHash, $summaryHash, $count) = @_;
    for my $code (sort keys %$resultsHash) {
	print ADMINLOG "$code: $resultsHash->{$code}\n";
    }
    for my $code (sort keys %$summaryHash) {
	print ADMINLOG "$code:$summaryHash->{$code}\t";
    }
    print ADMINLOG "\n";
    print ADMINLOG ">> ", scalar(localtime), "\n";
    print ADMINLOG "\n";

} # end sub print_admin_log 

###########################################
#  send_admin_mail
###########################################
# Sends administrator ($ADMIN_EMAIL) email confirmation that script
# has succeeded, and includes tally for each result code.
# Call with: msg for subject line, and hashref of result codes
sub send_admin_mail {

return unless $ADMIN_EMAIL;
if ($DEBUG) { print "send_admin_mail()...\n"; }
my ($subject, $resultsHash, $summaryHash) = @_;
my $message = "HTTP Status Code: Number of instances\n";
my $code;
foreach $code (sort keys %$resultsHash) {
    $message = $message . "$code: $resultsHash->{$code}\n";
}
foreach $code (sort keys %$summaryHash) {
    $message = $message . "$code:$summaryHash->{$code}  ";
}
$message = $message . "\n";

if (open(SENDMAIL, "|/usr/lib/sendmail -t -odq")) {
print SENDMAIL <<EOF;
To: $ADMIN_EMAIL
Subject: Urlcheck automail: $subject
$message
EOF
}
else { warn "Can't open sendmail: $!\n"; }
if ($VERBOSE) { print "\n", $message; }
} # End send_admin_mail

###########################################
#  telnet_http_retry
###########################################
# Open a telnet connection to a host and try an HTTP GET for an url. The response is
# processed according to status code similarly to follow_url, and calls follow_url to
# handle redirects. Uses an LWP request object b/c that's a convenient way to stick
# cookies into the request string.
# Note: Handles the Solaris/LWP bug (cf notes above) by reading the telnet.pm input_log if
# telnet times out.
# Call with: hostname, absolute url, LWP request object, and optional port (default is
# $HTTP_DEFAULT_PORT). 
# Returns: status code, or 'FAIL' if can't make telnet connection.
sub telnet_http_retry {

    if ($VERBOSE || $DEBUG) { print "Telnet HTTP retry...\n"; }
    my ($host, $url, $request, $port) = @_;
    my ($telnetAgent, @lines, @buffer, $statusLine, $line, $logfileHandle, $httpVersion,
	$statusCode, $message, $contentStr, $new_url);
    open(LOGFILE, "+>$TELNET_LOGFILE") || warn "Can't open $TELNET_LOGFILE.\n";
    if (!$port || $port !~ /^\d+$/) { $port = $HTTP_DEFAULT_PORT; }
    # Create agent and open connection.
    $telnetAgent = Net::Telnet->new(Host  => $host,
				    Port  => $port,
				    Input_log => $TELNET_LOGFILE,
				    Timeout => $AGENT_TIMEOUT,
				    Errmode => "return");
    return 'FAIL' unless $telnetAgent; # Can't open telnet connection.
    $telnetAgent->max_buffer_length($AGENT_MAX_RESPONSE);

    # Send the request.
    $telnetAgent->print($request->as_string, "\n");
    # Get the response as array of lines.
    while (@buffer = $telnetAgent->getlines) {
	push (@lines, @buffer);
    }
    if ($telnetAgent->timed_out) {
	if ($VERBOSE) { print "Telnet http timed out. Using input log...\n"; }
	    undef @lines; 
	    while (<LOGFILE>) {
		push (@lines, $_);
	    }
	close LOGFILE or warn "Problem closing $TELNET_LOGFILE.\n";
    }
    if (!@lines) { 
	if ($VERBOSE) { print "No data received.\n"; }
	return 'FAIL';
    }
    if ($DEBUG) { print @lines,"\n"; }
    $statusLine = shift @lines; 
    # We can only process status line and headers if the response is HTTP/1.0 or
    # better. This regexp copied from LWP::Protocol::http.pm.
    if ($statusLine =~ /^(HTTP\/\d+\.\d+)[ \t]+(\d+)[ \t]*([^\012]*)\012/) {
	# HTTP/1.0 response or better
	($httpVersion, $statusCode, $message) = ($1, $2, $3);
	chomp $message;
	if ($VERBOSE) { print "Status line: $httpVersion $statusCode $message \n\n"; }

	if  ($statusCode =~ /2../)  {
	    while ($line = shift @lines) { # Flatten array of lines.
		$contentStr .= $line; 
	    }
	    $new_url = check_for_meta_refresh($contentStr);
	    if ($new_url ne "") {
		if (exists($url_hash{$new_url}))  { 
		    if ($VERBOSE) { print "This url already visited ... returning $statusCode.\n"; }
		    return $statusCode; }
		else  {
		    if ($VERBOSE) { print "Refresh to: $new_url\n"; }
		    # Return whatever status code we get from new url
		    return follow_url($new_url, $url);
		}
	    }
	    else { return $statusCode;}
	}
	elsif  ($statusCode =~ /3../)  {
	    if ($VERBOSE) { print "Proper redirect...\n"; }
	    $new_url = get_location_header(\@lines);
	    if ($new_url ne "") {
		if (exists($url_hash{$new_url}))  { 
		    if ($VERBOSE) { print "This url already visited ... returning $statusCode.\n"; }
		    return $statusCode; }
		else  {
		    if ($VERBOSE) { print "Redirect to: $new_url\n"; }
		    # Return whatever status code we get from new url
		    return follow_url($new_url, $url);
		}
	    }
	    else { return $statusCode;}
	}
	elsif ($statusCode =~ m{4.. | 5..}x)  {
	    return $statusCode;
	}
    } # if valid status line
    else { unshift(@lines, $statusLine); }
    # If no status line, could be HTTP/0.9 server, which just sends back content. If
    # it contains a tag like <html...>, assume it's okay.
    if ($VERBOSE) {  print "Assuming HTTP/0.9 or less... \n"; }		   
    while ($line = shift @lines) { # Flatten array of lines.
	$contentStr .= $line; 
    }
    if ($contentStr =~ /<html /i) {
	return $HTTP_0_9_OKAY;
    }
    else { return $HTTP_0_9_FAIL; }

} # end sub telnet_http_retry

###########################################
#  END (Unused snippets and test results, below)
###########################################

# NOTES:
# 1. It would be nice to have a robust facility for absolutizing URLs. I tried using
# URI.pm for this purpose and found it to be not robust. EG., it allows the construction
# of: http:/www.yahoo.com, which is not well-formed.
# 1. Tolerance of meta refresh tag match?
# 3. some duplicate code went from follow_url to the telnet_http_retry; could be factored.

# USEFUL TEST CASE URLS
#$result = follow_url("http://www.yahoo.com");
#$result = follow_url("http://www.nyt.com");
#$result = follow_url("http://www.dofijdlfkj.com");
# meta redirect, absolute url
#$result = follow_url("http://www.cis.upenn.edu/~db/");
# meta redirect, relative url
#$result = follow_url("http://www.island.net/~pboylan/main/lynx/DOCS/test/refresh_test.html");
#proper simple redirect
#$result = follow_url("http://nrs.harvard.edu/nrs:hul.eresource:britanni");
#proper redirect w/ authentication
#$result = follow_url("http://nrs.harvard.edu/nrs:hul.eresource:genetics");
#$result = follow_url("http://nrs.harvard.edu/nrs:hul.eresource:geomarin");
$result = follow_url("http://ldi.harvard.edu/dcpp/AAL4519.html");

# These are needed for init_HULAccess_cookie, etc.
#  	    $HUL_ACCESS_URL
#  	    $HULACCESS_COOKIE_NAME
#  	    $HULACCESS_COOKIE
#  	    $HULACCESS_COOKIE_TRIES
#	    $FIRST_TIME

#  $HUL_ACCESS_URL = 'http://access.hul.harvard.edu';
#  $HULACCESS_COOKIE_NAME = 'hulaccess'; # Name of cookie that HUL Access sets.
#  $HULACCESS_COOKIE = 0; # Flag indicating whether cookie has been set.
#  $HULACCESS_COOKIE_TRIES = 0; # Keep track of number of tries to init cookie.

###########################################
#  init_HULAccess_cookie
###########################################
#UPDATE this: we now test to see if there's a cookie...
# Initialize a cookie that will be acceptable to HUL Access for all future gets. The first
# time a protected resource is requested from NRS, it will redirect to HUL Access. That
# redirect url should be passed to this routine. We assume that the first get to HUL
# Access will result in a test cookie, and a second get is required to receive the valid
# cookie. Once acquired, this cookie can be used from the $cookieJar via HTTP:Cookies.
# Call with: any url that uses HUL Access mechanism
# Returns: 'SUCCESS', or undef if it fails. 
# Notes: I did not want use a fixed url to initialize this, b/c if that url is later
# changed to unrestricted status, it will no longer invoke the access mechanism.
# Also, we keep track of number of attempts to init this cookie and will die if can't do
# it after 3 tries (on separate urls). In this case, email is sent to $ADMIN_EMAIL.
#  sub init_HULAccess_cookie  {

#      my $url = shift;
#      # Initial request to HUL Access
#      my $request = HTTP::Request->new(GET => $url);
#      my $response = $agent->simple_request($request);
#  #    print "\nTest cookie: \n", $response->headers->as_string;
#      print "\nset cookie: \n", $response->headers->header("set-cookie");
#      print "\nset cookie: \n", $response->headers->header("set-cookie2");
#      #needed here: a test to see if the header contains a set-cookie request
#      # if it doesn't, we should just return

# I was starting to modify this routine; then the spec changed and now we just accept
# all cookies.

#  #      if (is_empty($cookieJar)) {
#  #  	return "NO_COOKIE";
#  #      }

#      # First cookie will be a test cookie
#      $HULACCESS_COOKIE_TRIES++;
#      if ($VERBOSE) { print "Trying to initialize HUL Access cookie...\n"; }
#      $cookieJar->extract_cookies($response);
#      my $new_url = $response->headers->header('Location');

#      $request = HTTP::Request->new(GET => $new_url);
#      $cookieJar->add_cookie_header($request);
#      $cookieJar->clear;
#      my $result = $cookieJar->scan(\&cookie_test);
#      $response = $agent->simple_request($request);
#  #    print "\nReal cookie: \n", $response->headers->as_string;

#      # Second cookie is the real cookie
#      $cookieJar->extract_cookies($response);

#      # Test to make sure we've set the cookie, and set a flag
#      $cookieJar->scan(\&cookie_test);
#      if ($HULACCESS_COOKIE) { return 'SUCCESS'; }
#      else {
#  	if ($VERBOSE) { print "Unable to initialize HULAccess Cookie.\n" };
#  	if ($HULACCESS_COOKIE_TRIES > 2) {
#  	    send_dmin_mail("$0 failed! Unable to initialize HULAccess Cookie.");
#  	    die "Unable to initialize HULAccess cookie. Giving up...!\n";
#  	}	
#  	return 'FAIL'; 
#      }

#  }# End init_HULAccess_cookie

###########################################
#  cookie_test
###########################################
#  Callback routine which is passed to cookieJar->scan. For each cookie, it receives a list
#  of cookie attributes; it tests to see whether there is a cookie that has the name
#  $HULACCESS_COOKIE_NAME. If found, it sets a global flag indicating that the HUL Access cookie has
#  been set. (This somewhat ugly solution was chosen b/c HTTP::Cookies does not seem to
#  provide a nice way to verify the existence of cookies.)
#  sub cookie_test {

#      if ($_[1] eq $HULACCESS_COOKIE_NAME) {
#  	$HULACCESS_COOKIE = 1;
#  	if ($VERBOSE) { print "Cookie initialized.\n"; }
#      }
#  }


###########################################
#  TEST RESULTS
###########################################

#  30 sec, 3 rechecks, 0 sleep, telnet retry disabled.
#  >> Thu Feb  8 16:48:24 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 792
#  Recheck pass #2, total URLs followed: 685
#  Recheck pass #3, total URLs followed: 662
#  2xx: 9690, 4xx: 256, 5xx: 318, OTHER: 75
#  >> Fri Feb  9 06:37:48 2001

#  60 sec, 1 rechecks, 0 sleep
#  Recheck pass #1, total URLs followed: 662
#  2xx: 9867, 4xx: 254, 5xx: 143, OTHER: 75

#  60 sec, 3 rechecks, 0 sleep
#  >> Sun Feb 11 14:13:40 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 463
#  Recheck pass #2, total URLs followed: 449
#  Recheck pass #3, total URLs followed: 440
#  2xx: 9902, 4xx: 257, 5xx: 105, OTHER: 75
#  >> Mon Feb 12 12:53:23 2001

#  30 sec, 3 rechecks, 0 sleep
#  >> Fri Feb  9 16:12:23 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 468
#  Recheck pass #2, total URLs followed: 440
#  Recheck pass #3, total URLs followed: 439
#  2xx: 9900, 4xx: 258, 5xx: 106, OTHER: 75
#  >> Sat Feb 10 10:03:04 2001

#  30 sec, 3 rechecks, 8hrs sleep
#  >> Fri Feb  9 16:14:18 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 468
#  Recheck pass #2, total URLs followed: 440
#  Recheck pass #3, total URLs followed: 435
#  2xx: 9904, 4xx: 258, 5xx: 102, OTHER: 75
#  >> Sun Feb 11 11:11:15 2001

#  30 sec, 3 rechecks, 8hrs sleep (second time)
#  >> Sun Feb 11 14:26:22 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 472
#  Recheck pass #2, total URLs followed: 442
#  Recheck pass #3, total URLs followed: 431
#  2xx: 9908, 4xx: 257, 5xx: 99, OTHER: 75
#  >> Tue Feb 13 05:26:36 2001

#  30 sec, 7 rechecks, 0 sleep
#  >> Fri Feb  9 16:15:13 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 466
#  Recheck pass #2, total URLs followed: 439
#  Recheck pass #3, total URLs followed: 439
#  Recheck pass #4, total URLs followed: 439
#  Recheck pass #5, total URLs followed: 439
#  Recheck pass #6, total URLs followed: 438
#  Recheck pass #7, total URLs followed: 438
#  2xx: 9904, 4xx: 258, 5xx: 102, OTHER: 75
#  >> Sat Feb 10 21:09:51 2001

#  30 sec, 7 rechecks, 0 sleep (second time)
#  >> Sun Feb 11 14:20:51 2001
#  First pass: total URLs followed: 10339.
#  Recheck pass #1, total URLs followed: 472
#  Recheck pass #2, total URLs followed: 446
#  Recheck pass #3, total URLs followed: 443
#  Recheck pass #4, total URLs followed: 442
#  Recheck pass #5, total URLs followed: 441
#  Recheck pass #6, total URLs followed: 439
#  Recheck pass #7, total URLs followed: 439
#  2xx: 9902, 4xx: 257, 5xx: 105, OTHER: 75
#  >> Mon Feb 12 10:58:46 2001