about summary refs log tree commit diff
path: root/crawler (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Crawler: Add loging with env_loggerBaitinq2022-11-062-15/+14
|
* Indexer: Switch back to not serving frontend with actixBaitinq2022-11-051-1/+1
| | | | | | | This previously caused the frontend to be unresponsive when the crawler was passing results to the indexer. Now the frontend is again independently served by trunk and the api by actix, which makes them separate processes and the frontend can remain responsive.
* Indexer+Frontend: Integrate with actixBaitinq2022-11-051-1/+1
|
* Misc: Cargo fmtBaitinq2022-10-301-6/+6
|
* Crawler: Set 4 as the maximum "crawl depth"Baitinq2022-10-301-0/+1
| | | | Its not really crawl depth as we just count the path segments.
* Crawler: Accept max_queue_size as an argument for crawler()Baitinq2022-10-301-3/+5
| | | | | | | We also now set the max queue size to the max of the root url list or the max_queue_size. This is useful because if someone changes the root url list the crawler would previously hang if it had more entries than the max_queue_size.
* Frontend: Move app-specific code to app.rsBaitinq2022-10-301-0/+1
|
* Misc: Remove unneeded dependenciesBaitinq2022-10-301-1/+0
|
* Misc: Add local lib crate to share common structsBaitinq2022-10-302-7/+2
|
* Crawler+Indexer+Frontend: Rename structs to follow logical relationsBaitinq2022-10-291-2/+2
| | | | | | Now Resource is CrawledResource as it is created by the crawler, and the previous CrawledResource is now IndexedResource as its created by the indexer.
* Crawler: Only accept HTTP_STATUS_CODE: 200 as success in crawl_url()Baitinq2022-10-281-3/+4
|
* Misc: Add TODOsBaitinq2022-10-281-1/+0
|
* Crawler: Replace String::from with .to_string()Baitinq2022-10-271-3/+6
|
* Crawler: Fix bad error handling with match handlingBaitinq2022-10-251-6/+9
|
* Crawler: Use async ClientBaitinq2022-10-252-7/+12
|
* Crawler: Shuffle crawled urlsBaitinq2022-10-252-4/+4
|
* Crawler: Add "correct" error handlingBaitinq2022-10-251-21/+23
|
* Crawler: Parse urls with the "url" crateBaitinq2022-10-252-25/+25
| | | | | This fixes relative urls, makes url filtering and validation better, and many other improvements.
* Crawler: Add crawled url filterBaitinq2022-10-241-1/+8
| | | | This filters hrefs such as "/", "#" or "javascript:"
* Crawler: Set queue size to 2222Baitinq2022-10-241-1/+1
|
* Crawler+Indexer: Rust cleanupBaitinq2022-10-231-3/+2
| | | | | | Getting more familiar with the language so fixed some non optimal into_iter() usage, unnecessary .clone()s and unnecessary hack when we could just get a &mut for inserting into the indexer url database.
* Crawler: Replace println! with dbg!Baitinq2022-10-231-7/+7
|
* Crawler: Remove prepending of https:// to each urlBaitinq2022-10-232-1006/+1006
| | | | | We now prepend it to the top-1000-urls list. This fixes crawled urls having two https://
* Crawler: Only crawl 2 urls per urlBaitinq2022-10-231-0/+6
| | | | This makes it so that we dont get rate limited from websites.
* Crawler: Change blockingqueue to channelsBaitinq2022-10-232-12/+12
| | | | | We now use the async-channel channels implementation. This allows us to have bounded async channels.
* Crawler: Implement basic async functionalityBaitinq2022-10-222-39/+46
|
* Crawler: Add basic indexer communicationBaitinq2022-10-212-11/+48
|
* Crawler: Add Err string in the craw_url methodBaitinq2022-10-201-3/+3
|
* Crawler: Add indexer interaction skeletonBaitinq2022-10-201-1/+5
|
* Crawler: Wrap crawl response in Result typeBaitinq2022-10-201-18/+23
|
* Crawler: Normalise relative urlsBaitinq2022-10-201-2/+17
| | | | | We now normalise urls starting with / (relative to root) and // (relative to protocol)
* Crawler: Remove duplicate parsed urlsBaitinq2022-10-202-0/+4
|
* Crawler: Add basic html parsing and link-followingBaitinq2022-10-202-9/+36
| | | | | Extremely basic implementation. Needs max queue size, error handling, formatting of parsed links.
* Crawler: Add skeleton crawler implementationBaitinq2022-10-203-0/+1042
| | | | | Starts by filling a queue with the top 1000 most visited sites. "Crawls" each one (empty fn), and blocks for new elements on the queue.
* Misc: Separate OSSE into componentsBaitinq2022-10-192-0/+15
We now have a cargo workspace with the Crawler, Client and Indexer packages.