Cluebase - A Case Study
by Luke Lavin, published October 30th, 2024

It's been about 5 years since I created Cluebase, a general-use API for retrieving Jeopardy! data of all sorts--clues, episodes, contestants, and more. Since then, I've received more questions, requests, and even success stories (Congrats, Kate!) than I could have ever expected. I'm always impressed by the internet's ability to help even the most niche projects find their target audience. Thank you to everyone who ever tried it out!

Unfortunately, Cluebase's usefulness was at times in spite of its own quality (or lack thereof). Since its creation, it has been limited by oversights, incompleteness, and just plain bugs.

Now, years later with a fresh set of eyes and professional experience in data engineering, I'm going to attempt to rebuild Cluebase from the ground-up, to hopefully learn some do's and don'ts and give the project the love it deserves.

A vision for Cluebase 2.0

Part of what hindered Cluebase's development was this lack of direction and committment. It was just a toy project for practice. Cluebase's primary purpose was not to be a quality API, or to be useful for general-public consumer application but rather to help me gain some practical software engineering experience during a college summer where I had no internship. At some point, I had an idea of using Cluebase after its creation to build a Jeopardy! practice app. Conveniently, the prospect of developing of this higher-level consumer would provide an opportunity for me to revisit Cluebase and soften its rough edges. This turned a lot of "I should do this better" moments into "I'll save that for later" moments. Then, for one reason or another, that follow-up project fell by the wayside, leaving Cluebase without any updates or improvements since its release.

If Cluebase was held back by its lack of direction, then it makes sense to start "Cluebase 2.0" with an examination of what exactly it should accomplish:

Clue data should be the focus of cluebase. Ideally, Cluebase is useful as a source-of-truth for Jeopardy clues. If supplementary data like episode and contestants are to be accessible at all, they should be only where it doesn't complicate maintaining good clue data.

Maintenance of Cluebase should be simple. It is likely that this project could fall to the wayside for years yet again. The project needs to be positioned to need little intervention and, in the case of necessary intervention, to be easy to jump back into at any moment. Infrastructure setup should be scripted/containerized where possible. Design should be documented (more simply than a wordy blog post 😶). All processes should have robust error handling and recovery. Ingestion and transformation of new data should be automated, transparent, and modular--when working properly it should be hands off, and when it fails it should be easy to pick back up at the point of failure. APIs should be versioned so changes can be deployed without breaking consumer services. Code should be open-source for committed users to maintain Cluebase for their own personal use in my absence.

Clients (human users or other services) should be able to easily use Cluebase to get clues for targeted research or practice. The primary application of Cluebase should be retrieving clues by category and or by difficulty. Additional, more client-tailored applications of the clue data are to be expected, but should be handled as a secondary data layer on top of the base level clue data, with processes and concerns separate from the base data processing. The base Cluebase processing should, however, do its best to not complicate such processes.

Cluebase data should be useful for training/tuning natural language models, or doing other machine-learning-tailored tasks. Jeopardy's hundreds of thousands of question/answer pairs should be made useful even beyond the scope of trivia games. It should be possible to export the cluebase dataset in a practical format for ML use cases.

Cluebase should utilize tools that, all else being equal, are used in the industry. Although I still want to keep the project uncomplicated and its upkeep straightforward, I would still like to be gaining experience I can talk about, be it on a resume, in interviews, or even on this blog. Plus, it's just more fun for me if I get to learn new technology along the way. 😅

Moving from 1.0 to 2.0

So, if that's what 2.0 should look like, how do I get there from 1.0?

Well to start, we should look at Cluebase 1.0's implementation and capabilities.

An simplified architecture diagram of Cluebase 1.0

Cluebase 1.0 was an API offering alone. Cluebase was separate from the scraping process, and did not offer any user-facing app to consume the API.

The entire project was entirely contained in an EC2 instance, composed (pun-intended) of 4 docker images--the Flask service, PostgresQL, Redis, and nginx. The deployment was done by running now-defunct docker machine commands on my local machine with latest changes.

Even beyond that, the J! Archive scraping was done only once ahead of time. Clues were scraped once to load into a local database, then the database image was created pre-loaded with the local data.

Although several users reported that they were able to use it to practice trivia, users who did had to write their own consumers, ranging from command line tools to discord bots.

For a inexperienced developer, this format allowed me to focus just on the API development. However, the missing pieces here mean there's a decent amount of work to do to upgrade Cluebase to its full potential and support its continued existence. Thankfully, it also means that there's a lot of development to write about, so let's get started!