Python html.parser
module provides us with the HTMLParser
class, which can be sub-classed to parse HTML-formatted text files. We can easily modify the logic to process the HTML from a HTTP request as well using HTTP Client.
The class definition for HTMLParser
looks like:
class html.parser.HTMLParser(*, convert_charrefs=True)
In this lesson, we will be sub-classing HTMLParser
class to observe the behaviour presented by its functions and play with it. Let’s get started.
Table of Contents
Python HTML Parser
As we saw in the class definition of HTMLParser
, when the value for convert_charrefs
is True, all of the character references (except the ones in script
/style
elements) are converted to the respective Unicode characters.
The handler methods of this class (which we will see in next section) are called automatically once the instance of the class encounters start tags, end tags, text, comments, and other markup elements in the HTML String passed to it.
When we want to use this class, we should sub-class it to provide our own functionality. Before we present an example for the same, let us also mention all the functions of the class which are available for customisation. Here are they:
handle_startendtag
: This function manages both the start and end tags of the HTML document when encountered by passing control to other functions, which is clear in its definition:
def handle_startendtag(self, tag, attrs):
self.handle_starttag(tag, attrs)
self.handle_endtag(tag)
handle_starttag
: This function is meant to handle the start tag encounter:
def handle_starttag(self, tag, attrs):
pass
handle_endtag
: This function manages the end tag in the HTML String:
def handle_endtag(self, tag):
pass
handle_charref
: This function handle character references in the String passed to it, its definition is given as:
def handle_charref(self, name):
pass
handle_entityref
: This function handle entity reference, its definition is given as:
def handle_entityref(self, name):
pass
handle_data
: This function manages the data in HTML String and is one of the most important function in this class, its definition is given as:
def handle_data(self, data):
pass
handle_comment
: This function manages the comments in the HTML, its definition is given as:
def handle_comment(self, data):
pass
handle_pi
: This function manages the processing instructions in the HTML, its definition is given as:
def handle_pi(self, data):
pass
handle_decl
: This function manages the declarations in the HTML, its definition is given as:
def handle_decl(self, decl):
pass
Let’s get started by providing a sub-class of HTMLParser
to see some of these functions in action.
Making a sub-class for HTMLParser
In this example, we will create a subclass of HTMLParser
and see how are the most common handler methods for this class are called. Here is a sample program which subclasses the HTMLParser
class:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Found a start tag:", tag)
def handle_endtag(self, tag):
print("Found an end tag :", tag)
def handle_data(self, data):
print("Found some data :", data)
parser = MyHTMLParser()
parser.feed('<title>JournalDev HTMLParser</title>'
'<h1>Python html.parse module</h1>')
Let’s see the output for this program:

Subclassing HTMLParser class
The three handler functions we showed above are the functions which are available for customisation from the class. But these are not the only functions which can be overidden. In the next example, we will cover all the overiddable functions.
Overidding HTMLParser methods
In this example, we will overide all the functions of the HTMLParser class. Let’s look at a code snippet of the class:
from html.parser import HTMLParser
from html.entities import name2codepoint
class JDParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
def handle_endtag(self, tag):
print("End tag :", tag)
def handle_data(self, data):
print("Data :", data)
def handle_comment(self, data):
print("Comment :", data)
def handle_entityref(self, name):
c = chr(name2codepoint[name])
print("Named ent:", c)
def handle_charref(self, name):
if name.startswith('x'):
c = chr(int(name[1:], 16))
else:
c = chr(int(name))
print("Num ent :", c)
def handle_decl(self, data):
print("Decl :", data)
parser = JDParser()
We will now use this class to parse various parts of an HTML script. Here is a beginning with a doctype String:
parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
'"https://www.w3.org/TR/html4/strict.dtd">')
Let’s see the output for this program:

HTMLParser Doctype Parsing
Let’s look at a code snippet which passes an img
tag:
parser.feed('<img src="https://cdn.journaldev.com/wp-content/uploads/2014/05/Final-JD-Logo.png" alt="The Python logo">')
Let’s see the output for this program:
Notice how tag was broken and attributes for the tag were also extracted.
Let’s try the script
/style
tags as well whose elements are not parsed:
parser.feed('<script type="text/javascript">'
'alert("<strong>JournalDev Python</strong>");</script>')
parser.feed('<style type="text/css">#python { color: green }</style>')
Let’s see the output for this program:
Parsing comments is also possible with this instance:
parser.feed('<!-- This marks the beginning of samples. -->'
'<!--[if IE 9]>IE-specific content<![endif]-->')
With this method, we can manage many IE related properties as well and see if some webpages supports IE or not:

Parsing Comments
Parsing Named and Numeric references
Here is a sample program with which we can parse character references as well and convert them to correct character at runtime:
parser.feed('>>>')
Let’s see the output for this program:

Parsing Character references
Parsing Invalid HTML
To an extent, we can also feed invalid HTML data to feed function as well. Here is a sample program with no quotes around the link in an anchor
tag:
parser.feed('<h1><a class="link" href="#main">Invalid HTML</h1></a>')
Let’s see the output for this program:

Parsing Invalid HTML
That’s all for parsing html data in python using html.parser
module.
Reference: API Doc