Goodreads + SQL = Fun

January 12, 2020 books reading SQL

I've been learning SQL (and particularly SQLite) slowly and casually over the last couple months and as I mentioned in my New Year's post, I want to continue that and use it in more ways and in more projects. Thus far I've read through a lot of SQLite documentation, done a decent number of Hackerrank problems, gone through the W3Schools questions, and more importantly used SQLite in some random Python scripts and added a SQLite based implementation to CPIM2.

I realized soon after starting with SQL that one dataset that would be fun to play with would be my Goodreads library. Using their export function which gives you a csv file and a relatively simple Python script, I could create a SQLite3 database. Since I had another few books I wanted to read before the end of the year and plenty more SQL exercises and such to occupy me I decided to wait so I could have all of 2019 available for analysis and statistics, and finally it's time.

I've been using Goodreads for several years now and while I've probably missed a book here and there, I've been reasonably consistent about adding books since I first created an account. When I first started I added 100's of books immediately in a futile attempt to create a definitive list of every book (or at least every fictional one) I've ever read.

Given that, take my ratings for the first couple hundred or so books with a grain of salt. I have gone back and adjusted some occasionally to be more inline with my current ratings but even if I remembered all of them it's impossible to rate things perfectly, especially years or decades after you experienced it.

There are a few multi-book items in there, so items is not quite equivalent to books.

One last thing; unfortunately the goodreads export system is not perfect and for some reason does not export date_read reliably at all (just leaves it blank most of the time) so that precludes some more interesting analysis I wanted to do for the last few years.

Without further ado, here are some statistics:

Statistic Result
total_items 515
total_pages 197367.0
avg(pages) 383.982490272374

And here is a break down by binding (aka format):

binding count(binding)
Paperback 211
Kindle Edition 152
Hardcover 70
Mass Market Paperback 58
ebook 20
Audible Audio 3
1

If you want to see the data and perform some SQL queries of your own (in the browser!), I put the project up on REPL so you can see it here. Just select which repl "Goodreads + SQL".


© Robert Winkler, 2010 - 2022. Site design based on an old version of Michael Fogelman's awesome site.