I considered it, but for some reasons I didn't do it:
1. the main closure is useless if the .php file isn't included manually using require(_once)/include(_once) functions ( https://github.com/EmanueleMinotto/crystal/wiki/Tip ) because it returns directly at line 144
2. the autoloader become useless too, because there's Composer that does it very well
3. the PHP world is full of great frameworks, microframeworks, routers and other components, it will become rapidly a common library and not a good library because it doesn't follow development best practices but only a set of rules (defined in the CONTRIBUTING.md file)
Crystal can't be packed in a Composer package, but Composer can be used in a project including crystal.
They're not strange/crazy enough to make it a fun programming challenge, and for something that's intended for use (even non-production use) they make the code less usable.
"No namespaces" and "no new functions or classes" in particular don't make any sense. Anonymous functions are just like functions, except they're harder to identify in a stack trace.
You're right, I forgot to write it explicitly in the article but if someone will follow istructions (extract all <a> tags and add them to the index) that method is tacit.
If it's not included in robots.txt rules and doesn't have a canonical link that's not a problem, because the bot can't know if those pages are different or not so those pages you linked are different.
This is the reason why crawlers can't try to fill forms.
If you are really sure that these pages are the same, try checking the body content (if two or more pages have the same MD5 of the content, those pages are the same) or look for a form that generate those URLs.
A NoSQL solution is good because it's a DBMS (allows you to order collections). :)
Filesystem is not good because you would need to order links' files in visited descendent order (not allowed in much filesystems), and to check if an URL is in the index you must store it with the MD5 as file name.
A small DBMS like SQLite is not good for obvious reasons.
I would highly recommend storing the working set of links in RAM (with checkpointing to write it out to disk periodically). A Redis Set (for visited links) + Sorted Set (for unvisited links, ordered by priority) is perfect for this, since it lets you take up one full machine's RAM and does checkpointing automatically. If your crawl is too big to fit in RAM, get more machines and shard by URL hash. As others have pointed out, the file content itself should go in files, ideally ones that you can write to with straight appends.
The reason you don't want to hit the disk with each link (as both MySQL and PostGres usually do, barring caching) is that there can be hundreds to thousands of links on a page. A disk hit takes ~10ms; if you need to run hundreds of those, it's well over a second per page just to figure out which links on it are unvisited. Accessing main memory is about 100,000 times faster; even with sharding and RPC overhead for a distributed memory cache, you end up way ahead.
The reason to write the crawl text to an append-only log file is because disk seek times are bound by the rotation speed of the disk, which hasn't changed much recently, while disk bandwidth is bound by the rotation time of the disk divided by capacity, which has gone way up. So appends are much more efficient on disk than seeks are.
1. the main closure is useless if the .php file isn't included manually using require(_once)/include(_once) functions ( https://github.com/EmanueleMinotto/crystal/wiki/Tip ) because it returns directly at line 144 2. the autoloader become useless too, because there's Composer that does it very well 3. the PHP world is full of great frameworks, microframeworks, routers and other components, it will become rapidly a common library and not a good library because it doesn't follow development best practices but only a set of rules (defined in the CONTRIBUTING.md file)
Crystal can't be packed in a Composer package, but Composer can be used in a project including crystal.