PickyBot web crawler

Hi

PickyBot is (will hopefully be) the searchbot/crawler/spider of the site "pickysear.ch"
The site doesn't exist yet as I'm still creating it.

If you accessed this page then it's probably because you noticed (e.g. when analyzing your logs) that your website was accessed by PickyBot.
Example of a Apache webserver log line generated by PickyBot accessing the site:

12.12.12.12 - - [11/Sep/2022:16:59:17 +0200] www.mydomain.com "GET /some_url.html HTTP/1.1" 200 12345 "PickyBot/0.4.0 (http://www.pickysear.ch/pickybot.html)"

How to get rid of PickyBot?

To cut it short, if for any reason you want PickyBot to stop accessing your website then please add the following to your "robots.txt":

User-agent: PickyBot
Disallow: /

PickyBot's behaviour

In general the bot/crawler is supposed to comply to the directives found in your "robots.txt", as long as they can be understood by Rust's "robotparser-rs" parser and the bot's code works as intended.
Additional informations:

Infos about PickyBot and its (potential) future website

I'm trying to build something similar to one of the usual public search engines, but which would focus on the precision of the search and its results.
I'm experimenting: e.g. right now I will index as well pure binary data (e.g. pics, videos, compressed files, ...), therefore you might notice that the bot won't ignore such files (the current download limit is ~4MiB).
I created in 2019/2020 an initial proof-of-concept by using some millions of pages downloaded from "commoncrawl.org"; as the result was interesting it's now time see what happens when I try to index real/live data.

The future site will not try to stop users from accessing your website but the opposite: to just show the best exact matches related to a search (like a classical/simple search engine) together with the link to access the original content hosted on your site.
I don't therefore intend to "steal" any of your content nor users, the motivations being:

  1. Technical:
    already an "exact index" needs a huge amount of resources => I have to keep the ratio of "indexed data vs. storage used" as low as possible to keep costs as low as possible => cannot integrate fancy stuff (e.g. a locally stored page clone) without increasing those costs.
  2. Ethical:
    it's not great to offer directly 3rd party contents through an intermediate website. Personally, I love seeing users hitting my own website for stuff that I wrote.
  3. Balance:
    it's a "win-win-win"-situation for everybody: for who offers the content, for who indexes it, and for who searches (and hopefully finds) it.

Contact/Complaints

If PickyBot doesn't impact your limits/resources then please don't ban it by default 😛

If you have complaints/questions/requests/remarks/want_to_say_hello/anything you can contact me by submitting this form (no spam please - I'm not interested in anything).


Log