Getting started with RSS FeedParser
Welcome, to the first blog in the RSS FeedParser series. In this blog we will be getting started with FeedParser. We will start out by installing the FeedParser module followed by parsing a news website and understanding how RSS works. After doing the basics we will be moving on the to the how to parse a website, where we will be starting out by understanding the structure and then parsing the website and getting titles of the first article on the new website.
By the end of this blog, you will have installed the FeedParser module on your computer, followed by understanding how the FeedParser works and interacting with a website by getting the structure of the website and some details from the website.
Installing FeedParser
Before we get into the depth of what FeedParser is and how to get started with it. The first step would be installing FeedParser on your computer. To install FeedParser, follow these steps.
- Open your Terminal(if you are using a Mac) and Command Prompt(if you are using a Windows.
- On opening the terminal you need to make sure, you have pip installed on your computer in order to install FeedParser.
- After checking all the requirements go ahead and paste in the following command, pip install feedparser.
- On entering the command you will see the following output.
Since I already have FeedParser installed on my computer it is showing me this output. But in your case it will be showing a lot of other details.
After installing FeedParser, its time checkout what FeedParser is and how does it work!
What is RSS FeedParser?
RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it. In python we take help of the below package to read and process these feeds. Thats where FeedParser comes in, we use the FeedParser library to get blogs, articles from various websites using python.
In this blog I will be showing you how to build an entire Feedparser that will be used to get blogs and articles from websites using python.
At the end of the FeedParser series, you will be able to get latest articles from your bloggers on medium into a TXT file show below.
Build a basic FeedParser
Before we get into the depth of building a FeedParser, lets just make a simple FeedParser that interacts with a web-page and gets the data from the website.
Get the Structure of the website
Before we get the articles from the website, we need a better view of the website. We are going to do that by getting the structure of the website.
import feedparser
url = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"
NewsFeed = feedparser.parse(url)
Here I have started out by importing the FeedParser library followed by declaring the link I want to parse and get the latest articles from. After declaring the website, we are going to go ahead and parse the website using the feedparser.parse function.
entry = NewsFeed.entries[0]print(entry.keys())
After importing the FeedParser library followed by declaring the link we want to parse, we are going to be getting the entries in the website so that we know the structure of the website for further analysis of what parts we want to get from the website.
After getting the entries using News, we go ahead and print the keys which will be in a form of a list containing the elements of the website.
Our code to get the structure of the website is complete. Lets go ahead and run the code.
On running the code, you will see that the structure of the website has been printed. In this you can see the various elements that are present in the website. Since this is a news website, the website has a title, summary, date of publishing and more.
Get the Title of Posts
Now that we know the structure of the website, it’s time to get the titles of the posts in the news website.
import feedparserurl = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"NewsFeed = feedparser.parse(url)entry = NewsFeed.entries[0]print("Post Title: ", entry.title)
Here we are going to start out by importing the FeedParser library followed by declaring the url that we are going to parse.
After declaring the url variable, we are going to pass that as a parameter into the feedparser.parse function.After parsing the link, we are going to get the first entry from the website using the NewsFeed.entries[0] function. (We have specified 0 since we want to get the first post on the website)
As we discussed we want to get the title of the news article from the website. To do that we are going print the entry.title function which will print the first article from the website.
Our code is complete and good to go, lets go ahead and run the code.
On running the code, we can see that we have successfully got the title of the first post from the website.
This is all for this blog, I hope you have understood how to install and get basic details out of a website using the RSS FeedParser and python.In the next part of the RSS FeedParser series we will be building a feed parser for medium and getting blog posts of medium blogger using python and the FeedParser module.
Feel free to reach out to me if you have any issues/feedback at aryanirani123@gmail.com