about summary refs log tree commit diff
path: root/crawler (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Crawler: Add crawled url filterBaitinq2022-10-241-1/+8
| | | | This filters hrefs such as "/", "#" or "javascript:"
* Crawler: Set queue size to 2222Baitinq2022-10-241-1/+1
|
* Crawler+Indexer: Rust cleanupBaitinq2022-10-231-3/+2
| | | | | | Getting more familiar with the language so fixed some non optimal into_iter() usage, unnecessary .clone()s and unnecessary hack when we could just get a &mut for inserting into the indexer url database.
* Crawler: Replace println! with dbg!Baitinq2022-10-231-7/+7
|
* Crawler: Remove prepending of https:// to each urlBaitinq2022-10-232-1006/+1006
| | | | | We now prepend it to the top-1000-urls list. This fixes crawled urls having two https://
* Crawler: Only crawl 2 urls per urlBaitinq2022-10-231-0/+6
| | | | This makes it so that we dont get rate limited from websites.
* Crawler: Change blockingqueue to channelsBaitinq2022-10-232-12/+12
| | | | | We now use the async-channel channels implementation. This allows us to have bounded async channels.
* Crawler: Implement basic async functionalityBaitinq2022-10-222-39/+46
|
* Crawler: Add basic indexer communicationBaitinq2022-10-212-11/+48
|
* Crawler: Add Err string in the craw_url methodBaitinq2022-10-201-3/+3
|
* Crawler: Add indexer interaction skeletonBaitinq2022-10-201-1/+5
|
* Crawler: Wrap crawl response in Result typeBaitinq2022-10-201-18/+23
|
* Crawler: Normalise relative urlsBaitinq2022-10-201-2/+17
| | | | | We now normalise urls starting with / (relative to root) and // (relative to protocol)
* Crawler: Remove duplicate parsed urlsBaitinq2022-10-202-0/+4
|
* Crawler: Add basic html parsing and link-followingBaitinq2022-10-202-9/+36
| | | | | Extremely basic implementation. Needs max queue size, error handling, formatting of parsed links.
* Crawler: Add skeleton crawler implementationBaitinq2022-10-203-0/+1042
| | | | | Starts by filling a queue with the top 1000 most visited sites. "Crawls" each one (empty fn), and blocks for new elements on the queue.
* Misc: Separate OSSE into componentsBaitinq2022-10-192-0/+15
We now have a cargo workspace with the Crawler, Client and Indexer packages.