My commute these days is long. I mean 1.5 - 2 hours each way long. Yeah, I know.
So since it’s so long, I wanted to look into what content I could bring for my long journey each day. I checked out podcasts, and that’s cool. Audible was fine, but I didn’t want to pay (yeah, yeah, I’m cheap). And coding in the car produces some stomach churning results. So instead, I usually end up reading blogs or articles on some of my favorite sites. But just as I’m able to preload podcasts, I wanted to do the same with my reading.
Now I know Pocket and Instapaper exist, but the dev in me said, why don’t you do something about it?
Pythoning my way
Python has been in the back burner of my mind since college. I fooled around a bit for a little analytics project I was doing in college, but not much since. So when I decided I wanted to make a site scraper, I decided this could be my opportunity to mess with it.
What makes my project more tickey from just hitting an RSS feed was that for certain sites (e.g. relevantmagazine.com) they did not provide the actual contents, just info about and a link to the article. So the goal was to take a feed and use it to grab articles themselves and scrape the info I wanted.
What I Used
- Python
- Newspaper by codelucas
- FeedParser by kurtmckee
- BeautifulSoup
The code
Overall, the code is pretty short. The reason is because I found three open source packages that significantly reduced my effort and proved to be powerful (when combined).
Feedparser is simple enough. It pulls the rss feed from a URL you give it and allows you to explore it. This is how I was able to get the URL’s for my articles.
Next was Newspaper. This package specifically tries to be an open source Instapaper. However, it doesn’t work with sites with contentless rss feeds, hence my project remained valuable. However, it was fantastic at grabbing the full HTML of the page and parsing into the data I wanted (author, title, text, and top image URL in this case).
The last package, BeautifulSoup was necessary because as much as Newspaper has it’s own html parser for the body of an article, it wasn’t always reliable. BS proved to be more powerful and in a single line parsed the articles wonderfully.
So with these three packages and the use of built-in python code, I was bale to create a script that scraped all the articles and output them in JSON, ready to be consumed in an app or inserted into a database.
feed = feedparser.parse(str(sys.argv[1]))
print(feed.feed.title)
print(repr(len(feed.entries)) + " articles compiled on " + time.strftime("%m/%d/%Y %H:%M:%S") + "\n\n")
for entry in feed.entries:
article = Article(entry.link)
article.download()
article.parse()
try:
html_string = ElementTree.tostring(article.clean_top_node)
except:
html_string = "Error converting html to string."
soup = BeautifulSoup(html_string, 'html.parser')
a = {
'authors': str(', '.join(article.authors)),
'title': article.title,
'text': soup.get_text(),
'top_image': article.top_image
}
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(a)
To see the full code and clone it, check it out on github here!