Let's start by looking at the existing process and how it can be improved
Previous problems:
1. Data Completeness
Clues with videos/images/audio that become unanswerable with text only. Perfect world we could use all these clues, but want to at least be able to identify and exclude them in text-only use cases.
Most clues with supplementary material have the material linked, which could both provide a way to use the supplementary material in the future and also a way to identify which clues are unusable right now.
Some clues require supplementary material but do not have the supplementary material archived. These clues are not only unusable now and in the future, but they are hard to find as they are not denoted in a standardized way, usually with some [editorial notes] or ellipsis..., undifferentiated from other irrelevant commentary that takes the same form.
2. Coupled Scrape/Load process
Previously the scraping was coupled with parsing and loading into the db. This made changing any parsing/loading/formatting processes require downloading all existing html pages all over again, which takes hours while under any rate limit respectful to the fragile j-archive servers.
I would like to ideally have a "data lake" of raw HTML, which can be processed and built upon by separate processes. Doing so will also greatly aid in solving the challenges in the actual load process, which will be the next blog entry.
Plan:
Basic web crawler/scraper BFS approach:
Start at the root of the site. Download the root page, and grab all links from the page. Then, for each link, download the page, and repeat.
Dev progress:
Currently: Focus on keeping good raw HTML data.
- Downloading season list HTML
- Scraping season list HTML for season links, then downloading those HTML files
- Scraping season links for game links, then downloading those HTML files
Current problems I'm running into are related to the process being long and flaky. While the current flakiness comes from me starting and stopping the process frequently due to active development, network and file actions are inherently flaky, so a long running process dependent on the stability of these is asking for problems.
The constant start and stop of the process raises the following concerns:
- Data integrity: How do we know a file was fully written, or properly written. How do we know a request completed successfully, and returned the full HTML we want.
- Processing status: How do we know when we've already processed an HTML link. Can we skip files we already have to try to just pick back where we left off? If we do, how do we know when a file is out of date, and actually needs to be updated?
TODO: add some validation checks on html files. Most of the
Only intelligence is that it will not overwrite a html file that already exists
- is this how it should work? what if a page exists but is only partially complete (this happens every night!)
- maybe won't matter if there's a scheduled "refresh" that ignores overwrite protection