Instead of downloading the full Wikipedia dump, extracting it, then running a ragel script over the XML file, can we just do it all in memory? Pseudocode: curl -s http://dumps.wikimedia.org/.../enwiki-20170220-pages-articles-multistream.xml.bz2 | bzcat | ./extract-movies enwiki?
Rationale: Having 100 GB of free space is a rare occurence for me.