HarvestMan¶
Legacy Wiki Page
This page was migrated from the old MoinMoin-based wiki. Information may be outdated or no longer applicable. For current documentation, see python.org.
Description¶
A www crawler(robot) program in python.
Information¶
“HarvestMan Home Page” link gone
version
1.4 (2005-05-27)
licence
GNU GPL
Python versions
2.2, 2.3, 2.4
Platforms
Any platform supported by python
Binaries
None
How it spins its web¶
HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.
Features¶
Fully Multithreaded
Number of threads configurable by user
Support for robots exclusion protocol
Filtering of urls using regular expressions
Filtering of server names using regular expressions
Control download by specifying depth of fetching
Configure by number of files downloadable
Specify timeout for individual threads
Control download speed by changing thread/depth options.
HTTP/FTP/HTTPS support & support for servers in LAN.
XML project files which can be re-read
Smart reconnection
Support for proxies/firewalls
File limits, server limits
Projects browser page
Command line/config file support
Use as a program or as a web-spider module
OO architecture
Who should use it¶
HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.
Taxonomy¶
Species: HarvestMan Genus: (Internet) Spiders
Developers¶
Anand B Pillai,