I like reading papers on arXiv, but I like discovering them more through Andrej Karpathy’s arxiv-sanity-lite. The little challenge with the latter is that there is no way to get those papers in a RSS feed, that I can then hook up to an RSS reader, like feedly or NetNewsWire if you’re on a MacOS machine.
So, as a starting step, I thought I’d try and fix this with an open-source project, called arxiv-sanity-feeds. The idea is very simple - there is a GitHub Action job that runs daily and scrapes the content from arxiv-sanity-lite. The scraper is a Python script that reads the content and re-structures it into a RSS feed. That feed is then published on my DigitalOcean Spaces bucket, from where it can be consumed by anyone around the globe.
But hold on - does that mean that there is an API that I can tap into? Kind of. What Andrej did for his project is expose every paper that is in the view in JSON format. So, let’s say you see this page:
If you view the source through your browser, you might notice this little snippet that will end up being very helpful for any future exploration:
The entire page is represented as structured JSON, so I don’t really need to fiddle with HTML parsing, which is a can of worms I try to delay opening as much as possible. If I beautify this JSON, it will look like this (I trimmed the full text a bit):
This is wonderful, because my parser can now extract structured data and simply assign it to proper nodes that make up the final RSS feed. You can see the code in action on GitHub. When that part of the process is complete, I am using the boto3 library (DigitalOcean Spaces mirrors the AWS S3 API surface) to push the content to a pre-defined bucked that is already connected to a CDN and my custom domain.
If you view the arxiv-sanity-feeds README file, you’ll see quick links to all available feeds that you can subscribe to:
Or you can just click them from this blog:
- Home page feed - this is what gets rendered when you go to arxiv-sanity-lite.
- Most recent papers for the week
- Random papers from last week
As part of the process, I also wanted to make sure that I validate the feeds. W3C offers a validator, but there is no API so I wrapped the call to the web service in a GitHub Action that boils down to this Bash script:
It’s a crude way to check if the response contains markers of a valid feed, and fail or succeed based on this data. This job is also scheduled to make sure that every generated feed is automatically verified for issues.
Any feedback on the tool is welcome - just open an issue in the repo. At some point I might consider extending this API with a serverless component, that enables passing through query parameters (e.g., if you are interested in just ML papers), but that’s for another blog post.