Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | Crawler: Add loging with env_logger | Baitinq | 2022-11-06 | 2 | -15/+14 |
| | |||||
* | Indexer: Switch back to not serving frontend with actix | Baitinq | 2022-11-05 | 1 | -1/+1 |
| | | | | | | | This previously caused the frontend to be unresponsive when the crawler was passing results to the indexer. Now the frontend is again independently served by trunk and the api by actix, which makes them separate processes and the frontend can remain responsive. | ||||
* | Indexer+Frontend: Integrate with actix | Baitinq | 2022-11-05 | 1 | -1/+1 |
| | |||||
* | Misc: Cargo fmt | Baitinq | 2022-10-30 | 1 | -6/+6 |
| | |||||
* | Crawler: Set 4 as the maximum "crawl depth" | Baitinq | 2022-10-30 | 1 | -0/+1 |
| | | | | Its not really crawl depth as we just count the path segments. | ||||
* | Crawler: Accept max_queue_size as an argument for crawler() | Baitinq | 2022-10-30 | 1 | -3/+5 |
| | | | | | | | We also now set the max queue size to the max of the root url list or the max_queue_size. This is useful because if someone changes the root url list the crawler would previously hang if it had more entries than the max_queue_size. | ||||
* | Frontend: Move app-specific code to app.rs | Baitinq | 2022-10-30 | 1 | -0/+1 |
| | |||||
* | Misc: Remove unneeded dependencies | Baitinq | 2022-10-30 | 1 | -1/+0 |
| | |||||
* | Misc: Add local lib crate to share common structs | Baitinq | 2022-10-30 | 2 | -7/+2 |
| | |||||
* | Crawler+Indexer+Frontend: Rename structs to follow logical relations | Baitinq | 2022-10-29 | 1 | -2/+2 |
| | | | | | | Now Resource is CrawledResource as it is created by the crawler, and the previous CrawledResource is now IndexedResource as its created by the indexer. | ||||
* | Crawler: Only accept HTTP_STATUS_CODE: 200 as success in crawl_url() | Baitinq | 2022-10-28 | 1 | -3/+4 |
| | |||||
* | Misc: Add TODOs | Baitinq | 2022-10-28 | 1 | -1/+0 |
| | |||||
* | Crawler: Replace String::from with .to_string() | Baitinq | 2022-10-27 | 1 | -3/+6 |
| | |||||
* | Crawler: Fix bad error handling with match handling | Baitinq | 2022-10-25 | 1 | -6/+9 |
| | |||||
* | Crawler: Use async Client | Baitinq | 2022-10-25 | 2 | -7/+12 |
| | |||||
* | Crawler: Shuffle crawled urls | Baitinq | 2022-10-25 | 2 | -4/+4 |
| | |||||
* | Crawler: Add "correct" error handling | Baitinq | 2022-10-25 | 1 | -21/+23 |
| | |||||
* | Crawler: Parse urls with the "url" crate | Baitinq | 2022-10-25 | 2 | -25/+25 |
| | | | | | This fixes relative urls, makes url filtering and validation better, and many other improvements. | ||||
* | Crawler: Add crawled url filter | Baitinq | 2022-10-24 | 1 | -1/+8 |
| | | | | This filters hrefs such as "/", "#" or "javascript:" | ||||
* | Crawler: Set queue size to 2222 | Baitinq | 2022-10-24 | 1 | -1/+1 |
| | |||||
* | Crawler+Indexer: Rust cleanup | Baitinq | 2022-10-23 | 1 | -3/+2 |
| | | | | | | Getting more familiar with the language so fixed some non optimal into_iter() usage, unnecessary .clone()s and unnecessary hack when we could just get a &mut for inserting into the indexer url database. | ||||
* | Crawler: Replace println! with dbg! | Baitinq | 2022-10-23 | 1 | -7/+7 |
| | |||||
* | Crawler: Remove prepending of https:// to each url | Baitinq | 2022-10-23 | 2 | -1006/+1006 |
| | | | | | We now prepend it to the top-1000-urls list. This fixes crawled urls having two https:// | ||||
* | Crawler: Only crawl 2 urls per url | Baitinq | 2022-10-23 | 1 | -0/+6 |
| | | | | This makes it so that we dont get rate limited from websites. | ||||
* | Crawler: Change blockingqueue to channels | Baitinq | 2022-10-23 | 2 | -12/+12 |
| | | | | | We now use the async-channel channels implementation. This allows us to have bounded async channels. | ||||
* | Crawler: Implement basic async functionality | Baitinq | 2022-10-22 | 2 | -39/+46 |
| | |||||
* | Crawler: Add basic indexer communication | Baitinq | 2022-10-21 | 2 | -11/+48 |
| | |||||
* | Crawler: Add Err string in the craw_url method | Baitinq | 2022-10-20 | 1 | -3/+3 |
| | |||||
* | Crawler: Add indexer interaction skeleton | Baitinq | 2022-10-20 | 1 | -1/+5 |
| | |||||
* | Crawler: Wrap crawl response in Result type | Baitinq | 2022-10-20 | 1 | -18/+23 |
| | |||||
* | Crawler: Normalise relative urls | Baitinq | 2022-10-20 | 1 | -2/+17 |
| | | | | | We now normalise urls starting with / (relative to root) and // (relative to protocol) | ||||
* | Crawler: Remove duplicate parsed urls | Baitinq | 2022-10-20 | 2 | -0/+4 |
| | |||||
* | Crawler: Add basic html parsing and link-following | Baitinq | 2022-10-20 | 2 | -9/+36 |
| | | | | | Extremely basic implementation. Needs max queue size, error handling, formatting of parsed links. | ||||
* | Crawler: Add skeleton crawler implementation | Baitinq | 2022-10-20 | 3 | -0/+1042 |
| | | | | | Starts by filling a queue with the top 1000 most visited sites. "Crawls" each one (empty fn), and blocks for new elements on the queue. | ||||
* | Misc: Separate OSSE into components | Baitinq | 2022-10-19 | 2 | -0/+15 |
We now have a cargo workspace with the Crawler, Client and Indexer packages. |