about summary refs log tree commit diff
path: root/crawler/src/main.rs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Crawler: Fix bad error handling with match handlingBaitinq2022-10-251-6/+9
|
* Crawler: Use async ClientBaitinq2022-10-251-6/+11
|
* Crawler: Shuffle crawled urlsBaitinq2022-10-251-3/+2
|
* Crawler: Add "correct" error handlingBaitinq2022-10-251-21/+23
|
* Crawler: Parse urls with the "url" crateBaitinq2022-10-251-25/+24
| | | | | This fixes relative urls, makes url filtering and validation better, and many other improvements.
* Crawler: Add crawled url filterBaitinq2022-10-241-1/+8
| | | | This filters hrefs such as "/", "#" or "javascript:"
* Crawler: Set queue size to 2222Baitinq2022-10-241-1/+1
|
* Crawler+Indexer: Rust cleanupBaitinq2022-10-231-3/+2
| | | | | | Getting more familiar with the language so fixed some non optimal into_iter() usage, unnecessary .clone()s and unnecessary hack when we could just get a &mut for inserting into the indexer url database.
* Crawler: Replace println! with dbg!Baitinq2022-10-231-7/+7
|
* Crawler: Remove prepending of https:// to each urlBaitinq2022-10-231-5/+5
| | | | | We now prepend it to the top-1000-urls list. This fixes crawled urls having two https://
* Crawler: Only crawl 2 urls per urlBaitinq2022-10-231-0/+6
| | | | This makes it so that we dont get rate limited from websites.
* Crawler: Change blockingqueue to channelsBaitinq2022-10-231-11/+11
| | | | | We now use the async-channel channels implementation. This allows us to have bounded async channels.
* Crawler: Implement basic async functionalityBaitinq2022-10-221-39/+45
|
* Crawler: Add basic indexer communicationBaitinq2022-10-211-10/+46
|
* Crawler: Add Err string in the craw_url methodBaitinq2022-10-201-3/+3
|
* Crawler: Add indexer interaction skeletonBaitinq2022-10-201-1/+5
|
* Crawler: Wrap crawl response in Result typeBaitinq2022-10-201-18/+23
|
* Crawler: Normalise relative urlsBaitinq2022-10-201-2/+17
| | | | | We now normalise urls starting with / (relative to root) and // (relative to protocol)
* Crawler: Remove duplicate parsed urlsBaitinq2022-10-201-0/+3
|
* Crawler: Add basic html parsing and link-followingBaitinq2022-10-201-9/+34
| | | | | Extremely basic implementation. Needs max queue size, error handling, formatting of parsed links.
* Crawler: Add skeleton crawler implementationBaitinq2022-10-201-0/+40
| | | | | Starts by filling a queue with the top 1000 most visited sites. "Crawls" each one (empty fn), and blocks for new elements on the queue.
* Misc: Separate OSSE into componentsBaitinq2022-10-191-0/+3
We now have a cargo workspace with the Crawler, Client and Indexer packages.