36 Internet Robots

JumpStation
RBSE Spider
The WebCrawler
The NorthStar Robot
W4 (the World Wide Web Wanderer)
html_analyzer-0.02
MONspider
HTMLgobble
WWWW—the World Wide Web Worm
W3M2
Websnarf
The Webfoot Robot
Lycos
ASpider (Associative Spider)
SG-Scout
EIT Link Verifier Robot
NHSE Web Forager
WebLinker
Emacs-w3 Search Engine
Arachnophilia
Mac WWWWorm
Churl
Tarspider
The Peregrinator
Checkbot
Webwalk
Harvest
Katipo
InfoSeek Robot 1.0
GetURL
Open Text Corporation Robot
The TkWWW Robot
A Tcl W3 Robot
Titan
CS-HKUST WWW Index Server
WizRobot

F

36 Internet Robots

This appendix contains basic information about 36 Internet robots.

JumpStation

Maintained by Jonathan Fletcher, e-mail: j.fletcher@stirling.ac.uk.

Jumpstation's goal is to generate a Resource Discovery database. The HTTP User-agent field is set to JumpStation-Robot, and the From field is also set. Usually run from *.stir.ac.uk. The proposed standard for robot exclusion is supported.

RBSE Spider

Maintained by David Eichmann, e-mail: eichmann@rbse.jsc.nasa.gov.

RBASE Spider's purpose is to generate a Resource Discovery database and generate statistics. The HTTP User-agent field is set to RBSE Spider v. 1.0, and the From field is also set. Usually run from rbse.jsc.nasa.gov (192.88.42.10). The Proposed Standard for Robot Exclusion is supported.

The WebCrawler

Run by Brian Pinkerton, e-mail: bp@biotech.washington.edu.

Runs from webcrawler.cs.washington.edu and uses WebCrawler/0.00000001 in the HTTP User-agent field.

The NorthStar Robot

Run by Fred Barrie, e-mail: barrie@unr.edu and Billy Barron.

Recent runs will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data), as well as indexing. Run from frognot.utdallas.edu, possibly other sites in utdallas.edu, and from cnidir.org. Now uses HTTP From fields and sets User-agent to NorthStar.

W4 (the World Wide Web Wanderer)

Run by Matthew Gray mkgray@mit.edu.

Run initially in June 1993, its aim is to measure the growth of the Web. W4's purpose is to discover resources on the fly. The HTTP User-agent field is set to Fish-Search-Robot, but the From field isn't set. This is usually run from www.win.tue.nl. The HTTP User-agent field is set to WWWWanderer v3.0 by Matthew Gray. The Proposed Standard for Robot Exclusion is not supported.

html_analyzer-0.02

Run by James E. Pitkow pitkow@aries.colorado.edu.

html_analyzer-0.02's aim is to check validity of Web servers.

MONspider

Maintained by Roy T. Fielding, fielding@ics.uci.edu.

MONspider's purpose is to validate links and generate statistics. The HTTP User-agent field is set to MOMspider/1.00 libwww-perl/0.40, and the From field is also set. This is usually run from anywhere. The Proposed Standard for Robot Exclusion is supported.

HTMLgobble

Maintained by Andreas Ley, e-mail: ley@rz.uni-karlsruhe.de.

This is a mirroring robot, configured to stay within a directory and sleep between requests. The next version will use HEAD to check if the entire document needs to be retrieved. The HTTP User-agent is set to HTMLgobble v2.2, and it sets the From field. This is usually run by the author, from tp70.rz.uni-karlsruhe.de.

WWWW—the World Wide Web Worm

Maintained by Oliver McBryan, e-mail: mcbryan@piper.cs.colorado.edu.

Run from piper.cs.colorado.edu.

W3M2

Maintained by Christophe Tronche tronche@lri.fr.

W3M2's purpose is to generate a Resource Discovery database, validate links, validate HTML, and generate statistics. The HTTP User-agent field is set to W3M2/x.xxx, and the From field is also set. This is usually run from anyhost.lri.fr. The Proposed Standard for Robot Exclusion is supported.

Websnarf

Maintained by Charlie Stross charless@sco.com.

Websnarf is a WWW mirror designed for off-line browsing of sections of the Web. It is run from ruddles.london.sco.com.

The Webfoot Robot

Run by Lee McLoughlin at L.McLoughlin@doc.ic.ac.uk.

It was first spotted in Mid February 1994 and is run from phoenix.doc.ic.ac.uk.

Lycos

Owned by Dr. Michael L. Mauldin, fuzzy@cmu.edu at Carnegie Mellon University.

Lycos is a research program providing information retrieval and discovery in the WWW, using a finite memory model of the Web to guide intelligent, directed searches for specific information needs. The HTTP The User-agent is set to Lycos/x.x. This is run from fuzine.mt.cs.cmu.edu. The Proposed Standard for Robot Exclusion is supported.

ASpider (Associative Spider)

Written and run by Fred Johansen fred@nvg.unit.no.

Currently under construction, this spider is a CGI script that searches the Web for keywords given by the user through a form. The HTTP User-Agent is set to ASpider/0.09, with a From field. Contact

fredj@nova.pvv.unit.no

SG-Scout

Introduced by Peter Beebee: ptbb@ai.mit.edu, beebee@parc.xerox.com.

This has run since June 27, 1994, for an internal XEROX research project. The HTTP User-agent is set to SG-Scout, with a From field set to the operator. The Proposed Standard for Robot Exclusion is supported. This is run from beta.xerox.com.

EIT Link Verifier Robot

Written by Jim McGuire mcguire@eit.com.

Announced on July 12, 1994, this is a combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it from running with no limits or going off-site). From version 0.2 up, the User-agent is set to EIT-Link-Verifier-Robot/0.2. This can be run by anyone from anywhere.

NHSE Web Forager

Maintained by Robert Olson at olson@mcs.anl.gov.

Web Forager's purpose is to generate a Resource Discovery database. The HTTP User-agent field is set to NHSEWalker/3.0, and the From field is also set. This is usually run from *.mcs.anl.gov. The Proposed Standard for Robot Exclusion is supported.

WebLinker

Written and run by James Casey at jcasey@maths.tcd.ie.

WebLinker is a tool that traverses a section of Web, doing URN to URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. The HTTP User-agent is set to WebLinker/0.0 libwww-perl/0.1.

Emacs-w3 Search Engine

Maintained by William M. Perry at wmperry@spry.com.

Emacs-w3 Search Engine's purpose is to generate a Resource Discovery database. The HTTP User-agent field is set to Emacs-w3/v[0-9\.]+, and the From field is also set. This is usually run from a variety of machines. The Proposed Standard for Robot Exclusion is not supported.

Arachnophilia

Run by Vince Taluskie at taluskie@utpapa.ph.utexas.edu.

Arachnophilia's purpose is to collect approximately 10 K HTML documents for testing automatic abstract generation. This program will honor the robot exclusion standard and wait one minute between requests to a given server. The HTTP User-agent field is set to Arachnophilia. This is run from halsoft.com.

Mac WWWWorm

Written by Sebastien Lemieux at lemieuse@ERE.UMontreal.CA.

Mac WWWWorm is a French keyword-searching robot for the Mac, written in HyperCard. No other information currently available.

Churl

Maintained by Justin Yunke at yunke@umich.edu.

This is a URL-checking robot that stays within one step of the local server.

Tarspider

Run by Olaf Schreck at chakl@fu-berlin.de.

This is a mirroring robot. It sets User-agent to tarspider version and From to chakl@fu-berlin.de.

The Peregrinator

This is run by Jim Richardson, jimr@maths.su.oz.au.

This robot, written in Perl V4, commenced operation in August 1994 and is being used to generate an index called MathSearch of documents on Web sites connected with mathematics and statistics. It ignores offsite links, so it does not stray from a list of servers specified initially. The HTTP User-agent field is set to Peregrinator-Mathematics/0.7. Peregrinator also sets the From field. The Proposed Standard for Robot Exclusion is supported.

Checkbot

Maintained by Hans de Graaff at j.j.degraaff@twi.tudelft.nl.

Checkbot's purpose is to validate links. The HTTP User-agent field is set to checkbot.pl-x.xx, and the From field is also set. This is usually run from dutifp.twi.tudelft.nl. The Proposed Standard for Robot Exclusion is not supported.

Webwalk

Maintained by Rich Testardi at rpt@fc.hp.com.

Its purpose is to generate a Resource Discovery database, validate links, validate HTML, perform mirroring, copy document trees, and generate statistics. The HTTP User-agent field is set to webwalk, and the From field is also set. The Proposed Standard for Robot Exclusion is supported.

Harvest

Run by hardy@bruno.cs.colorado.edu.

Harvest is a Resource Discovery Robot, part of the Harvest Project. It runs from bruno.cs.colorado.edu and sets User-agent and From fields.

Katipo

Maintained by Michael Newbery at Michael.Newbery@vuw.ac.nz.

The HTTP User-agent field is set to Katipo/1.0, and the From field is also set. The Proposed Standard for Robot Exclusion is not supported.

InfoSeek Robot 1.0

Maintained by Steve Kirsch at stk@infoseek.com.

InfoSeek's purpose is to generate a Resource Discovery database. The HTTP User-agent field is set to InfoSeek Robot 1.0, and the From field is also set. This is usually run from corp-gw.infoseek.com. The Proposed Standard for Robot Exclusion is supported.

GetURL

Maintained by James Burton at burton@cs.latrobe.edu.au.

GetURL's purpose is to validate links, perform mirroring, and copy document trees. The HTTP User-agent field is set to GetURL.rexx v1.05 by burton@cs.latrobe.edu.au, and the From field is not set. The Proposed Standard for Robot Exclusion is not supported.

Open Text Corporation Robot

Run by Tim Bray at tbray@opentext.com.

This sets User-agent to OMW/0.1 libwww/217. The Proposed Standard for Robot Exclusion is supported.

The TkWWW Robot

Implemented by Scott Spetka at scott@cs.sunyit.edu.

TkWWW is designed to search Web neighborhoods to find pages that might be logically related. The robot returns a list of links that looks like a hot list. The search can be by keyword or all links at a distance of one or two hops may be returned.

A Tcl W3 Robot

Maintained by De-mailly at dl@hplyot.obspm.fr.

Tcl W3's purpose is to validate links and generate statistics. The HTTP User-agent field is set to dlw3robot/x.y, and the From field is also set. This is usually run from hplyot.obspm.fr. The Proposed Standard for Robot Exclusion is supported.

Titan

Maintained by Yoshihiko Hayashi at hayashi@nttnly.isl.ntt.jp.

Titan's purpose is to generate a Resource Discovery database and copy document trees. The primary goal is to develop an advanced method for indexing the WWW documents. The HTTP User-agent field is set to TITAN/0.1, and the From field is also set. This is usually run from nttnly.isl.ntt.jp. The Proposed Standard for Robot Exclusion is supported.

CS-HKUST WWW Index Server

Maintained by Budi Yuwono at yuwono-b@cs.ust.hk.

CS-HKUST's purpose is to generate a Resource Discovery database and validate HTML. The HTTP User-agent field is set to CS-HKUST-IndexServer/1.0, and the From field is also set. This is usually run from dbx.cs.ust.hk. The Proposed Standard for Robot Exclusion is supported.

WizRobot

Maintained by Spry at info@spry.com.

WizRobot's purpose is to generate a Resource Discovery database. Neither User-agent nor From HTTP fields are set. This is usually run from tiger.spry.com. The Proposed Standard for Robot Exclusion is not supported.