LINKCHECKER

Section: LinkChecker commandline usage (1)
Updated: 2010-07-01
Index Return to Main Contents
 

NAME

linkchecker - command line client to check HTML documents and websites for broken links  

SYNOPSIS

linkchecker [options] [file-or-url]...  

DESCRIPTION

LinkChecker features

recursive and multithreaded checking,
output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats,
support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links,
restriction of link checking with URL filters,
proxy support,
username/password authorization for HTTP, FTP and Telnet,
support for robots.txt exclusion protocol,
support for Cookies
support for HTML5
HTML and CSS syntax check
Antivirus check
a command line, GUI and web interface
 

EXAMPLES

The most common use checks the given domain recursively, plus any URL pointing outside of the domain:
  linkchecker http://www.example.net/
Beware that this checks the whole site which can have thousands of URLs. Use the -r option to restrict the recursion depth.
Don't check mailto: URLs. All other links are checked as usual:
  linkchecker --ignore-url=^mailto: mysite.example.org
Checking a local HTML file on Unix:
  linkchecker ../bla.html
Checking a local HTML file on Windows:
  linkchecker c:\temp\test.html
You can skip the http:// url part if the domain starts with www.:
  linkchecker www.example.com
You can skip the ftp:// url part if the domain starts with ftp.:
  linkchecker -r0 ftp.example.org
Generate a sitemap graph and convert it with the graphviz dot utility:
  linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps  

OPTIONS

 

General options

-fFILENAME, --config=FILENAME
Use FILENAME as configuration file. As default LinkChecker uses ~/.linkchecker/linkcheckerrc.
-h, --help
Help me! Print usage information for this program.
--stdin
Read list of white-space separated URLs to check from stdin.
-tNUMBER, --threads=NUMBER
Generate no more than the given number of threads. Default number of threads is 100. To disable threading specify a non-positive number.
-V, --version
Print version and exit.
 

Output options

--check-css
Check syntax of CSS URLs with cssutils. If it's not installed, check with the W3C online validator.
--check-html
Check syntax of HTML URLs with HTML tidy. If it's not installed, check with the W3C online validator.
--complete
Log all URLs, including duplicates. Default is to log duplicate URLs only once.
-DSTRING, --debug=STRING
Print debugging output for the given logger. Available loggers are cmdline, checking, cache, gui, dns and all. Specifying all is an alias for specifying all available loggers. The option can be given multiple times to debug with more than one logger. For accurate results, threading will be disabled during debug runs.
-FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
Output to a file linkchecker-out.TYPE, $HOME/.linkchecker/blacklist for blacklist output, or FILENAME if specified. The ENCODING specifies the output encoding, the default is that of your locale. Valid encodings are listed at http://docs.python.org/library/codecs.html#standard-encodings.
The FILENAME and ENCODING parts of the none output type will be ignored, else if the file already exists, it will be overwritten. You can specify this option more than once. Valid file output types are text, html, sql, csv, gml, dot, xml, sitemap, none or blacklist. Default is no file output. The various output types are documented below. Note that you can suppress all console output with the option -o none.
--no-status
Do not print check status messages.
--no-warnings
Don't log warnings. Default is to log warnings.
-oTYPE[/ENCODING], --output=TYPE[/ENCODING]
Specify output type as text, html, sql, csv, gml, dot, xml, sitemap, none or blacklist. Default type is text. The various output types are documented below.
The ENCODING specifies the output encoding, the default is that of your locale. Valid encodings are listed at http://docs.python.org/library/codecs.html#standard-encodings.
-q, --quiet
Quiet operation, an alias for -o none. This is only useful with -F.
--scan-virus
Scan content of URLs for viruses with ClamAV.
--trace
Print tracing information.
-v, --verbose
Log all checked URLs. Default is to log only errors and warnings.
-WREGEX, --warning-regex=REGEX
Define a regular expression which prints a warning if it matches any content of the checked link. This applies only to valid pages, so we can get their content.
Use this to check for pages that contain some form of error, for example "This page has moved" or "Oracle Application error".
Note that multiple values can be combined in the regular expression, for example "(This page has moved|Oracle Application error)".
See section REGULAR EXPRESSIONS for more info.
--warning-size-bytes=NUMBER
Print a warning if content size info is available and exceeds the given number of bytes.
 

Checking options

-a, --anchors
Check HTTP anchor references. Default is not to check anchors. This option enables logging of the warning url-anchor-not-found.
-C, --cookies
Accept and send HTTP cookies according to RFC 2109. Only cookies which are sent back to the originating server are accepted. Sent and accepted cookies are provided as additional logging information.
--cookiefile=FILENAME
Read a file with initial cookie data. The cookie data format is explained below.
--ignore-url=REGEX
URLs matching the given regular expression will be ignored and not checked.
This option can be given multiple times.
See section REGULAR EXPRESSIONS for more info.
-NSTRING, --nntp-server=STRING
Specify an NNTP server for news: links. Default is the environment variable NNTP_SERVER. If no host is given, only the syntax of the link is checked.
--no-follow-url=REGEX
Check but do not recurse into URLs matching the given regular expression.
This option can be given multiple times.
See section REGULAR EXPRESSIONS for more info.
-p, --password
Read a password from console and use it for HTTP and FTP authorization. For FTP the default password is anonymous@. For HTTP there is no default password. See also -u.
-PNUMBER, --pause=NUMBER
Pause the given number of seconds between two subsequent connection requests to the same host. Default is no pause between requests.
-rNUMBER, --recursion-level=NUMBER
Check recursively all links up to given depth. A negative depth will enable infinite recursion. Default depth is infinite.
--timeout=NUMBER
Set the timeout for connection attempts in seconds. The default timeout is 60 seconds.
-uSTRING, --user=STRING
Try the given username for HTTP and FTP authorization. For FTP the default username is anonymous. For HTTP there is no default username. See also -p.
--user-agent=STRING
Specify the User-Agent string to send to the HTTP server, for example "Mozilla/4.0". The default is "LinkChecker/X.Y" where X.Y is the current version of LinkChecker.

 

CONFIGURATION FILES

Configuration files can specify all options above. They can also specify some options that cannot be set on the command line. See linkcheckerrc(5) for more info.

 

OUTPUT TYPES

Note that by default only errors and warnings are logged. You should use the --verbose option to get the complete URL list, especially when outputting a sitemap graph format.

text
Standard text logger, logging URLs in keyword: argument fashion.
html
Log URLs in keyword: argument fashion, formatted as HTML. Additionally has links to the referenced pages. Invalid URLs have HTML and CSS syntax check links appended.
csv
Log check result in CSV format with one URL per line.
gml
Log parent-child relations between linked URLs as a GML sitemap graph.
dot
Log parent-child relations between linked URLs as a DOT sitemap graph.
gxml
Log check result as a GraphXML sitemap graph.
xml
Log check result as machine-readable XML.
sitemap
Log check result as an XML sitemap whose protocol is documented at http://www.sitemaps.org/protocol.html.
sql
Log check result as SQL script with INSERT commands. An example script to create the initial SQL table is included as create.sql.
blacklist
Suitable for cron jobs. Logs the check result into a file ~/.linkchecker/blacklist which only contains entries with invalid URLs and the number of times they have failed.
none
Logs nothing. Suitable for debugging or checking the exit code.
 

REGULAR EXPRESSIONS

LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction.

An addition is that a leading exclamation mark negates the regular expression.  

COOKIE FILES

A cookie file contains standard HTTP header (RFC 2616) data with the following possible names:
Scheme (optional)
Sets the scheme the cookies are valid for; default scheme is http.
Host (required)
Sets the domain the cookies are valid for.
Path (optional)
Gives the path the cookies are value for; default path is /.
Set-cookie (optional)
Set cookie name/value. Can be given more than once.

Multiple entries are separated by a blank line. The example below will send two cookies to all URLs starting with http://example.com/hello/ and one to all URLs starting with https://example.org/:


 Host: example.com
 Path: /hello
 Set-cookie: ID="smee"
 Set-cookie: spam="egg"


 Scheme: https
 Host: example.org
 Set-cookie: baggage="elitist"; comment="hologram"

 

PROXY SUPPORT

To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy environment variables to the proxy URL. The URL should be of the form http://[user:pass@]host[:port]. LinkChecker also detects manual proxy settings of Internet Explorer under Windows systems. On a Mac use the Internet Config to select a proxy. You can also set a comma-separated domain list in the $no_proxy environment variables to ignore any proxy settings for these domains. Setting a HTTP proxy on Unix for example looks like this:


  export http_proxy="http://proxy.example.com:8080"

Proxy authentication is also supported:


  export http_proxy="http://user1:mypass@proxy.example.org:8081"

Setting a proxy on the Windows command prompt:


  set http_proxy=http://proxy.example.com:8080

 

PERFORMED CHECKS

All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.
HTTP links (http:, https:)
After connecting to the given HTTP server the given path or query is requested. All redirections are followed, and if user/password is given it will be used as authorization when necessary. Permanently moved pages issue a warning. All final HTTP status codes other than 2xx are errors. HTML page contents are checked for recursion.
Local files (file:)
A regular, readable file that can be opened is valid. A readable directory is also valid. All other files, for example device files, unreadable or non-existing files are errors. HTML or other parseable file contents are checked for recursion.
Mail links (mailto:)
A mailto: link eventually resolves to a list of email addresses. If one address fails, the whole list will fail. For each mail address we check the following things:
  1) Check the adress syntax, both of the part before and after
     the @ sign.
  2) Look up the MX DNS records. If we found no MX record,
     print an error.
  3) Check if one of the mail hosts accept an SMTP connection.
     Check hosts with higher priority first.
     If no host accepts SMTP, we print a warning.
  4) Try to verify the address with the VRFY command. If we got
     an answer, print the verified address as an info.
FTP links (ftp:)

  
  For FTP links we do:
  
  1) connect to the specified host
  2) try to login with the given user and password. The default
     user is ``anonymous``, the default password is ``anonymous@``.
  3) try to change to the given directory
  4) list the file with the NLST command

Telnet links (``telnet:``)

  
  We try to connect and if user/password are given, login to the
  given telnet server.

NNTP links (``news:``, ``snews:``, ``nntp``)

  
  We try to connect to the given NNTP server. If a news group or
  article is specified, try to request it from the server.

Unsupported links (``javascript:``, etc.)

  
  An unsupported link will only print a warning. No further checking
  will be made.
  
  The complete list of recognized, but unsupported links can be found
  in the unknownurl.py source file. The most prominent of them
  should be JavaScript links.

 

RECURSION

Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:

1. A URL must be valid.

2. A URL must be parseable. This currently includes HTML files,
   Opera bookmarks files, and directories. If a file type cannot
   be determined (for example it does not have a common HTML file
   extension, and the content does not look like HTML), it is assumed
   to be non-parseable.

3. The URL content must be retrievable. This is usually the case
   except for example mailto: or unknown URL types.

4. The maximum recursion level must not be exceeded. It is configured
   with the ``--recursion-level`` option and is unlimited per default.

5. It must not match the ignored URL list. This is controlled with
   the ``--ignore-url`` option.

6. The Robots Exclusion Protocol must allow links in the URL to be
   followed recursively. This is checked by searching for a
   "nofollow" directive in the HTML header data.

Note that the directory recursion reads all files in that directory, not just a subset like ``index.htm*``.

 

NOTES

URLs on the commandline starting with ftp. are treated like ftp://ftp., URLs starting with www. are treated like http://www.. You can also give local files as arguments.

If you have your system configured to automatically establish a connection to the internet (e.g. with diald), it will connect when checking links not pointing to your local host. Use the --ignore-url option to prevent this.

Javascript links are not supported.

If your platform does not support threading, LinkChecker disables it automatically.

You can supply multiple user/password pairs in a configuration file.

When checking news: links the given NNTP host doesn't need to be the same as the host of the user browsing your pages.  

ENVIRONMENT

NNTP_SERVER - specifies default NNTP server
http_proxy - specifies default HTTP proxy server
ftp_proxy - specifies default FTP proxy server
no_proxy - comma-separated list of domains to not contact over a proxy server
LC_MESSAGES, LANG, LANGUAGE - specify output language  

RETURN VALUE

The return value is 2 when
a program error occurred.

The return value is 1 when

invalid links were found or
link warnings were found and warnings are enabled

Else the return value is zero.  

LIMITATIONS

LinkChecker consumes memory for each queued URL to check. With thousands of queued URLs the amount of consumed memory can become quite large. This might slow down the program or even the whole system.  

FILES

~/.linkchecker/linkcheckerrc - default configuration file
~/.linkchecker/blacklist - default blacklist logger output filename
linkchecker-out.TYPE - default logger file output name
http://docs.python.org/library/codecs.html#standard-encodings - valid output encodings
http://docs.python.org/howto/regex.html - regular expression documentation
http://linkchecker.git.sf.net/git/gitweb.cgi?p=linkchecker/linkchecker;:a=blob;f=linkcheck/checker/unknownurl.py;hb=HEAD - the unknown.py source file

 

SEE ALSO

linkcheckerrc(5)  

AUTHOR

Bastian Kleineidam <bastian.kleineidam@web.de>  

COPYRIGHT

Copyright © 2000-2013 Bastian Kleineidam


 

Index

NAME
SYNOPSIS
DESCRIPTION
EXAMPLES
OPTIONS
General options
Output options
Checking options
CONFIGURATION FILES
OUTPUT TYPES
REGULAR EXPRESSIONS
COOKIE FILES
PROXY SUPPORT
PERFORMED CHECKS
RECURSION
NOTES
ENVIRONMENT
RETURN VALUE
LIMITATIONS
FILES
SEE ALSO
AUTHOR
COPYRIGHT

This document was created by man2html, using the manual pages.