about summary refs log tree commit diff
path: root/crawler (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Crawler: Only accept HTTP_STATUS_CODE: 200 as success in crawl_url()Baitinq2022-10-281-3/+4
|
* Misc: Add TODOsBaitinq2022-10-281-1/+0
|
* Crawler: Replace String::from with .to_string()Baitinq2022-10-271-3/+6
|
* Crawler: Fix bad error handling with match handlingBaitinq2022-10-251-6/+9
|
* Crawler: Use async ClientBaitinq2022-10-252-7/+12
|
* Crawler: Shuffle crawled urlsBaitinq2022-10-252-4/+4
|
* Crawler: Add "correct" error handlingBaitinq2022-10-251-21/+23
|
* Crawler: Parse urls with the "url" crateBaitinq2022-10-252-25/+25
| | | | | This fixes relative urls, makes url filtering and validation better, and many other improvements.
* Crawler: Add crawled url filterBaitinq2022-10-241-1/+8
| | | | This filters hrefs such as "/", "#" or "javascript:"
* Crawler: Set queue size to 2222Baitinq2022-10-241-1/+1
|
* Crawler+Indexer: Rust cleanupBaitinq2022-10-231-3/+2
| | | | | | Getting more familiar with the language so fixed some non optimal into_iter() usage, unnecessary .clone()s and unnecessary hack when we could just get a &mut for inserting into the indexer url database.
* Crawler: Replace println! with dbg!Baitinq2022-10-231-7/+7
|
* Crawler: Remove prepending of https:// to each urlBaitinq2022-10-232-1006/+1006
| | | | | We now prepend it to the top-1000-urls list. This fixes crawled urls having two https://
* Crawler: Only crawl 2 urls per urlBaitinq2022-10-231-0/+6
| | | | This makes it so that we dont get rate limited from websites.
* Crawler: Change blockingqueue to channelsBaitinq2022-10-232-12/+12
| | | | | We now use the async-channel channels implementation. This allows us to have bounded async channels.
* Crawler: Implement basic async functionalityBaitinq2022-10-222-39/+46
|
* Crawler: Add basic indexer communicationBaitinq2022-10-212-11/+48
|
* Crawler: Add Err string in the craw_url methodBaitinq2022-10-201-3/+3
|
* Crawler: Add indexer interaction skeletonBaitinq2022-10-201-1/+5
|
* Crawler: Wrap crawl response in Result typeBaitinq2022-10-201-18/+23
|
* Crawler: Normalise relative urlsBaitinq2022-10-201-2/+17
| | | | | We now normalise urls starting with / (relative to root) and // (relative to protocol)
* Crawler: Remove duplicate parsed urlsBaitinq2022-10-202-0/+4
|
* Crawler: Add basic html parsing and link-followingBaitinq2022-10-202-9/+36
| | | | | Extremely basic implementation. Needs max queue size, error handling, formatting of parsed links.
* Crawler: Add skeleton crawler implementationBaitinq2022-10-203-0/+1042
| | | | | Starts by filling a queue with the top 1000 most visited sites. "Crawls" each one (empty fn), and blocks for new elements on the queue.
* Misc: Separate OSSE into componentsBaitinq2022-10-192-0/+15
We now have a cargo workspace with the Crawler, Client and Indexer packages.