The goal of this session is to build a program that is able to analyse a website and output what it believes the website is about.
Installation & Setup
As always, we're using Python3. If you have not already downloaded and installed Python 3, please let one of the committee know - you will need it for future sessions to it's best we sort out any installation problems now.
For this session, we will also need to grab a few libraries. For this, we are going to need to open a Command Prompt or Terminal (depending on your Operating System) and run the following commands individually:
pip install nltk
pip install beautifulsoup4
pip install html5lib
Again, if any of these commands do not work please let the committee know now so that we can resolve these issues. We're going to be using pip installs in the majority of sessions as it is an incredibly simple and useful tool so we need to ensure everyone has it up and working.
Step by Step guide
Once you've installed the necessary libraries (see the Installation & Setup section), it's time to start programming. First, open up a Python IDE of your choice and create a new Python file.
At the top of the page, we need to import all of the libraries that we just installed into our file so that we can use them later on. Go ahead and copy the following at the top of the file:
import urllib.request from bs4 import BeautifulSoup from nltk.corpus import stopwords import nltk import operator
Downloading the web page
With the libraries imported, we need to start using them. The first task is to take a page from the internet and turn it into a format that our code can read.
To do this, we're going to use URLLib. URLLib is a built-in Python library that has a range of different functions related to accessing internet pages - one of which being downloading HTML pages. For those of you unfamiliar, most internet pages use a "markup language" called HTML (or HyperText Markdown Language) to determine the contents of a web page. It works on a system of tags such as
Copy and paste the following code into your program underneath the import statements. Run it.
# Get HTML document from URL # returns HTML file string def getPage(url: str): response = urllib.request.urlopen(url) return response.read() print(getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.'))
From the output, you can see a rough outline of the page. This isn't in a particularly useful format as it still has all of the tags and other "extra" code in it - we're only interested in the actual text content of the page.
To clean up this output, we're going to use the library "Beautiful Soup". The idea is that Beautiful Soup will take the HTML file and take out all of the tags and code, returning a "Beautiful Soup" of text.
First, remove the line
print(getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.')) from your code. We don't need that any more.
Next, copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code. Run it.
# Cleanup HTML to just get the text # pip install beautifulsoup4 # pip install html5lib def toSoup(html): soup = BeautifulSoup(html, 'html5lib') return soup.get_text(strip=True) print(toSoup(getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.')))
Looking at the output, we can see that most of the HTML has been cleaned up. There's still some extra code in there but it's good enough for today. In reality, you should always get your input data as clean as possible when doing any form of machine learning or language processing.
One last thing we should do is to convert this wall of text output into an even more usable format. We can do this by breaking the output into individual words - or "tokens" - and store those in a list.
First, delete the line
print(toSoup(getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.'))) from your code.
Then, replace it with the code below. Run it.
text = toSoup(getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.')) tokens = [t for t in text.split()] print(tokens)
This may look worse to you for now, but bare with me. We now need to write a function that counts how many times each of those tokens occur in the list.
It's also helpful to remove 'stopwords' from the list. Stopwords are words like 'and', 'the' etc.
Remove the code that you just inserted and replace it with the following to complete the program. Run it.
# Count the frequency of tokens within the page # Returns first X most frequent tokens # pip install nltk def countWords(tokens, num_to_get): # Remove 'stopwords' from the page - words such as 'and', 'the', etc; sr = stopwords.words('english') clean_tokens = tokens[:] for token in tokens: if token in sr: clean_tokens.remove(token) # Use the NLTK to return a dictionary of the token and the number of occurrences freq = nltk.FreqDist(clean_tokens) items = freq.items() # Sort the dictionary by the number of occurrences, in descending order sorted_items = sorted(items, key=operator.itemgetter(1)) sorted_items.reverse() # Add the X most common words to a list ret =  sorted_list = list(sorted_items) for i in range(num_to_get): ret.append(sorted_list[i]) # Return that list return ret if __name__ == '__main__': # Print out HTML page, with HTML included html = getPage('https://en.wikipedia.org/wiki/Tesla,_Inc.') # Cleanup HTML to just get the text text = toSoup(html) # Convert text into tokens tokens = [t for t in text.split()] # Count the frequency of words words = countWords(tokens, 10) for key, val in words: print(str(key) + ': ' + str(val)) # Print statement about what the page is about print("\nI'm thinking that this page is about " + str(words) + ".")
Ta da! The program now outputs the 10 most frequent tokens on the page along with how many times they occur. It then takes the token that occurs most frequently and declares that it believes the page is about that token.
Try playing around with different web pages or check out the challenges below.
If you want to download my code in full, you can go over to https://github.com/PortAISociety/nlp-semantic-analysis.
Feeling confident? Feel free to tweak this code or try it on a different webpage. Here are some ideas to get you started:
- Tokens are case-sensitive at the moment. Does changing them to not be case-sensitive make a difference? (hint: convert all of the tokens into lower-case)
- There are still some extra tokens in the dataset that shouldn't be there - can you remove those? (hint: try using a different parser, or write your own parser algorithm)
- Is there a way to quantify how confident our program is that the outputted value is correct?
If you do any of the above challenges let us know. We want to see them done!