HarvestMan

Legacy Wiki Page

This page was migrated from the old MoinMoin-based wiki. Information may be outdated or no longer applicable. For current documentation, see python.org.

Description

A www crawler(robot) program in python.

Information

“Freecode Project Page”

“HarvestMan Home Page” link gone

version

1.4 (2005-05-27)

licence

GNU GPL

Python versions

2.2, 2.3, 2.4

Platforms

Any platform supported by python

Binaries

None

How it spins its web

  • HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.

Features

  • Fully Multithreaded

  • Number of threads configurable by user

  • Support for robots exclusion protocol

  • Filtering of urls using regular expressions

  • Filtering of server names using regular expressions

  • Control download by specifying depth of fetching

  • Configure by number of files downloadable

  • Specify timeout for individual threads

  • Control download speed by changing thread/depth options.

  • HTTP/FTP/HTTPS support & support for servers in LAN.

  • XML project files which can be re-read

  • Smart reconnection

  • Support for proxies/firewalls

  • File limits, server limits

  • Projects browser page

  • Command line/config file support

  • Use as a program or as a web-spider module

  • OO architecture

Who should use it

  • HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

  • Species: HarvestMan Genus: (Internet) Spiders

Developers

  • Anand B Pillai,