PickyBot web crawler
Hi
PickyBot is (will hopefully be) the searchbot/crawler/spider of the site "pickysear.ch"
The site doesn't exist yet as I'm still creating it.
If you accessed this page then it's probably because you noticed (e.g. when analyzing your logs) that your website was accessed by PickyBot.
Example of a Apache webserver log line generated by PickyBot accessing the site:
12.12.12.12 - - [11/Sep/2022:16:59:17 +0200] www.mydomain.com "GET /some_url.html HTTP/1.1" 200 12345 "PickyBot/0.4.0 (http://www.pickysear.ch/pickybot.html)"
How to get rid of PickyBot?
To cut it short, if for any reason you want PickyBot to stop accessing your website then please add the following to your "robots.txt":
User-agent: PickyBot
Disallow: /
PickyBot's behaviour
In general the bot/crawler is supposed to comply to the directives found in your "robots.txt", as long as they can be understood by Rust's "robotparser-rs" parser and the bot's code works as intended.
Additional informations:
- Directives
- "Dis/Allow"
- All disallowed URLs should not be scanned.
- "Crawl-delay"
- The bot will comply as long as its range lies within "0" and "60".
- A value higher than 60 will be ignored and will be reduced to 60.
- A default of 10 will be used if the value is not specified (or if "robots.txt" doesn't exist, couldn't be downloaded, etc...).
- File "robots.txt"
- A copy of the site's "robots.txt" will be cached for 24 hours.
I will therefore refrain from downloading it more than once per day, respectively, changes will be taken into consideration the latest after 24h.
- If your "robots.txt" is bigger than 3MiB then it will be truncated and, if possible, only the informations present in the first 3MiB will be taken into consideration.
- The bot has a download timeout of 30 seconds (including DNS resolution); after that the download will be aborted and the default values will be used.
- Other
- The bot focuses on FQDNs, therefore "Crawl-delay" won't be applied across subdomains.
This means that if a site hosts "site1.mydomain.com" and "site2.mydomain.com" the bot might crawl them in parallel (respecting their individual "Crawl-delay" as in the case of unrelated websites).
- The current default max total amount of pages/files/documents per FQDN that the bot is supposed to crawl is 10000.
This will be higher for "interesting" websites (e.g. Wikipedia).
Infos about PickyBot and its (potential) future website
I'm trying to build something similar to one of the usual public search engines, but which would focus on the precision of the search and its results.
I'm experimenting: e.g. right now I will index as well pure binary data (e.g. pics, videos, compressed files, ...), therefore you might notice that the bot won't ignore such files (the current download limit is ~4MiB).
I created in 2019/2020 an initial proof-of-concept by using some millions of pages downloaded from "commoncrawl.org"; as the result was interesting it's now time see what happens when I try to index real/live data.
The future site will not try to stop users from accessing your website but the opposite: to just show the best exact matches related to a search (like a classical/simple search engine) together with the link to access the original content hosted on your site.
I don't therefore intend to "steal" any of your content nor users, the motivations being:
- Technical:
already an "exact index" needs a huge amount of resources => I have to keep the ratio of "indexed data vs. storage used" as low as possible to keep costs as low as possible => cannot integrate fancy stuff (e.g. a locally stored page clone) without increasing those costs.
- Ethical:
it's not great to offer directly 3rd party contents through an intermediate website. Personally, I love seeing users hitting my own website for stuff that I wrote.
- Balance:
it's a "win-win-win"-situation for everybody: for who offers the content, for who indexes it, and for who searches (and hopefully finds) it.
Contact/Complaints
If PickyBot doesn't impact your limits/resources then please don't ban it by default 😛
If you have complaints/questions/requests/remarks/want_to_say_hello/anything you can contact me by submitting this form (no spam please - I'm not interested in anything).
Log
- 13.Aug.2023: changed selection criteria for FQDNs to be crawled (0.6.1)
- Crawling priority is given to FQDNs that are part of URLs which previously displayed text written in the least indexed languages (at the point in time of when FQDNs are selected to be crawled).
Reason:
By using a pure FIFO-approach (the oldest URL that was extracted from any document was as well the first one that was crawled) I ended up having an extremely skewed index containing docs mostly written in english.
(currently English is at 1st place with 15M URLs crawled&indexed successfully, 2nd place German with 0.8M, French 0.6M, Spanish 0.5M, Japanese 0.4M, ..., at the absolute last place "Northern Frisian" with 1 doc 😀)
I am aware that this new approach might break (there might be very few FQDNs that host docs written e.g. in "Latin" therefore that particular language might make the crawler loop on it indefinitely => I'll try to implement something more sophisticated in the future, for the time being this seems to be good enough (indirectly as well because of some other FQDN selection settings).
- 01.Apr-21.May.2023: changed IP address(es) from which the bot's requests originate (0.6.0)
- My dedicated server provider told me that they weren't happy about the many requests issued by my bot/crawler (they were fair & kind & I can understand them, we're still on very good terms).
I therefore had to stop crawls that ran/originated on/from their server.
I took into consideration and I then tested multiple alternatives (many pros&cons, turned out not to be too simple to have a public IP for crawling purposes).
The result is that I'm now using 50 dedicated proxies (each having their own dedicated IP address) provided by IPRoyal.
So far things seems to work, I hope that this does not causes problems to you (in theory my proxies are dedicated therefore they should hopefully not being used as well by bad actors to perform illegal/unethical activities).
- 30.Jan.2023: new feature (0.6.0)
- Setting in the request's headers the value "Referer", containing the URL of the document from which the requested URL was first extracted
This is relevant only for the "first contact", meaning when the contents of a URL are downloaded for the first time - in the future, when the bot will check if the URL's contents have changed (most probably by querying the "etag" value), this info might not be provided as it might not be true anymore (as the contents of the document from which the URL was originally extracted/discovered might have changed).
- 20.Jan.2023: bugfix (0.5.2)
- Multiple changes in chained programs (e.g. to extract links contained in downloaded data, to lower CPU&disk usage, etc...).
- 04.Nov.2022: bugfix (0.5.1)
- Finished swapping Rust HTTP-client from "isahc" to "reqwest".
- 27.Oct.2022: new version (0.5.0)
- Partially swapped Rust HTTP-client from "isahc" to "reqwest".
- 11.Sep.2022: new version (0.4.0)
- Added default & max "Crawl-delay", other improvements.
- 15.May.2022: new version (0.3.0)
- Rewrote code migrating from Python to Rust.
- Now taking into consideration "Crawl-delay".
- 25.Apr.2021: new version (version 0.2)
- Initial code (Python, based on "scrapy").