Python HTML Parser

Filed Under: Python

Python html.parser module provides us with the HTMLParser class, which can be sub-classed to parse HTML-formatted text files. We can easily modify the logic to process the HTML from a HTTP request as well using HTTP Client.

The class definition for HTMLParser looks like:


class html.parser.HTMLParser(*, convert_charrefs=True)

In this lesson, we will be sub-classing HTMLParser class to observe the behaviour presented by its functions and play with it. Let’s get started.

Python HTML Parser

As we saw in the class definition of HTMLParser, when the value for convert_charrefs is True, all of the character references (except the ones in script/style elements) are converted to the respective Unicode characters.

The handler methods of this class (which we will see in next section) are called automatically once the instance of the class encounters start tags, end tags, text, comments, and other markup elements in the HTML String passed to it.

When we want to use this class, we should sub-class it to provide our own functionality. Before we present an example for the same, let us also mention all the functions of the class which are available for customisation. Here are they:

  • handle_startendtag: This function manages both the start and end tags of the HTML document when encountered by passing control to other functions, which is clear in its definition:
  • 
    def handle_startendtag(self, tag, attrs):
        self.handle_starttag(tag, attrs)
        self.handle_endtag(tag)
    
  • handle_starttag: This function is meant to handle the start tag encounter:
  • 
    def handle_starttag(self, tag, attrs):
        pass
    
  • handle_endtag: This function manages the end tag in the HTML String:
  • 
    def handle_endtag(self, tag):
        pass
    
  • handle_charref: This function handle character references in the String passed to it, its definition is given as:
  • 
    def handle_charref(self, name):
        pass
    
  • handle_entityref: This function handle entity reference, its definition is given as:
  • 
    def handle_entityref(self, name):
        pass
    
  • handle_data: This function manages the data in HTML String and is one of the most important function in this class, its definition is given as:
  • 
    def handle_data(self, data):
        pass
    
  • handle_comment: This function manages the comments in the HTML, its definition is given as:
  • 
    def handle_comment(self, data):
        pass
    
  • handle_pi: This function manages the processing instructions in the HTML, its definition is given as:
  • 
    def handle_pi(self, data):
        pass
    
  • handle_decl: This function manages the declarations in the HTML, its definition is given as:
  • 
    def handle_decl(self, decl):
        pass
    

Let’s get started by providing a sub-class of HTMLParser to see some of these functions in action.

Making a sub-class for HTMLParser

In this example, we will create a subclass of HTMLParser and see how are the most common handler methods for this class are called. Here is a sample program which subclasses the HTMLParser class:


from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Found a start tag:", tag)

    def handle_endtag(self, tag):
        print("Found an end tag :", tag)

    def handle_data(self, data):
        print("Found some data  :", data)

parser = MyHTMLParser()
parser.feed('<title>JournalDev HTMLParser</title>'
            '<h1>Python html.parse module</h1>')

Let’s see the output for this program:

python html parser example

Subclassing HTMLParser class


The three handler functions we showed above are the functions which are available for customisation from the class. But these are not the only functions which can be overidden. In the next example, we will cover all the overiddable functions.

Overidding HTMLParser methods

In this example, we will overide all the functions of the HTMLParser class. Let’s look at a code snippet of the class:


from html.parser import HTMLParser
from html.entities import name2codepoint

class JDParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = JDParser()

We will now use this class to parse various parts of an HTML script. Here is a beginning with a doctype String:


parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
             '"http://www.w3.org/TR/html4/strict.dtd">')

Let’s see the output for this program:

python html parser doctype

HTMLParser Doctype Parsing

Let’s look at a code snippet which passes an img tag:


parser.feed('<img src="https://cdn.journaldev.com/wp-content/uploads/2014/05/Final-JD-Logo.png" alt="The Python logo">')

Let’s see the output for this program:
python html parser img tag

Notice how tag was broken and attributes for the tag were also extracted.

Let’s try the script/style tags as well whose elements are not parsed:


parser.feed('<script type="text/javascript">'
             'alert("<strong>JournalDev Python</strong>");</script>')
parser.feed('<style type="text/css">#python { color: green }</style>')

Let’s see the output for this program:

python html parser style and script tags

Parsing comments is also possible with this instance:


parser.feed('<!-- This marks the beginning of samples. -->'
            '<!--[if IE 9]>IE-specific content<![endif]-->')

With this method, we can manage many IE related properties as well and see if some webpages supports IE or not:

python html parser comments

Parsing Comments

Parsing Named and Numeric references

Here is a sample program with which we can parse character references as well and convert them to correct character at runtime:


parser.feed('>>>')

Let’s see the output for this program:

python html parser char references

Parsing Character references

Parsing Invalid HTML

To an extent, we can also feed invalid HTML data to feed function as well. Here is a sample program with no quotes around the link in an anchor tag:


parser.feed('<h1><a class="link" href="#main">Invalid HTML</h1></a>')

Let’s see the output for this program:

python html parser invalid html

Parsing Invalid HTML

That’s all for parsing html data in python using html.parser module.

Reference: API Doc

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages