Making a Chatbot

You will need the following txt file for this tutorial. Please download it now by clicking here.

Objective

The goal of this session is to build a chatbot, we will be taking an input and then having the bot return relevant information when given a cue.

Installation & Setup

As always, we're using Python3. If you have not already downloaded and installed Python 3, please let one of the committee know - you will need it for future sessions to it's best we sort out any installation problems now.

For this session, we will also need to grab a few libraries. For this, we are going to need to open a Command Prompt or Terminal (depending on your Operating System) and run the following commands individually:

pip install nltk
pip install numpy
pip install scikit-learn

Again, if any of these commands do not work please let the committee know now so that we can resolve these issues. We're going to be using pip installs in the majority of sessions as it is an incredibly simple and useful tool so we need to ensure everyone has it up and working.

Guide to Making Chatbot

Importing libraries

Once you've installed the necessary libraries (see the Installation & Setup section), it's time to start programming. First, open up a Python IDE of your choice and create a new Python file.

At the top of the page, we need to import all of the libraries that we just installed into our file so that we can use them later on. Go ahead and copy the following at the top of the file:


import io # to read / write to file system
import random
import string # to process standard python strings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore') # Ignoring unnecesary warnings
import nltk
from nltk.stem import WordNetLemmatizer
            

Natural Language Toolkit Setup

Now that we have imported the required libraries, we need to use nltk to download some datasets that we will be using later.

Copy and paste the following code into your program underneath the import statements.


nltk.download('popular', quiet=True) # for downloading packages

# the 2 lines below only need to be run once, comment them out after this
nltk.download('punkt')
nltk.download('wordnet')
            

Reading in the Data

Now that we have imported everything we need, it's time to start creating our chatbot!

The first thing we need to do is to input our data, for this I will be using a plain text version of the Tesla Wikipedia page. This can be downloaded from the link at the top of the page.

Next, copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code.


#Reading in the text
with open('chatbot.txt','r', encoding='utf8', errors ='ignore') as fin:
    raw = fin.read().lower()

            

We now have a variable, raw which contains the txt file, chatbot.txt. So what can we do with it now?

Well first we need to tokenize, or split up, the file into sentences and words. This is going to be used later when we build our model.

Next, copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code.


#Tokenisation
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words
            

Pre-processing the data

If you recall from last week, we covered a lot of concepts used within NLP, One of those was Lemmatization. We will be using Lemmatization to give context to our input data.

For example, let's say we have the following words, Lemmatization can be used to infer that:

  • Rocks is very similar to rock
  • Corpora is very similar to corpus
  • Better is very similar to good

Copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code.


# Preprocessing
lemmer = WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
            

Generating a response

Okay, now that we have written all the preprocessing code we can begin to formulate the responses of our chatbot. For this we will be using the TfidfVectorizer from sklearn.

The TfidfVectorizer transforms text into feature vectors that we can then used as an input when estimating the best response. We are also taking out any stop-words, such as "and", "the" etc here.

This feature vector will then be used in an algorithm called a Cosine Similarity, don't worry you don't need to know what it is, but you can find out more here.

All we need to know is we are using it to generate our responses.

Copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code.


# Generating response
def response(user_response):
    robot_response=''
    sent_tokens.append(user_response)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robot_response=robot_response+"I am sorry! I don't understand you"
        return robot_response
    else:
        robot_response = robot_response+sent_tokens[idx]
        return robot_response

            

Pulling it all together

Great, we are nearly finished, we just need to get the Chatbot to actually print these responses.

This is accomplished using a while loop, which just waits for an input and then returns a response from our previous function

Copy and paste the following code into your program underneath the code from before. Make sure to leave a couple lines between your old code and this new code.


flag=True
print("TeslaBot: Hello! I am a Chatbot which will try to tell you stuff about Tesla!  If you want to quit, type 'bye' \n")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
       print("\nTeslaBot: ",end="")                   
       print(response(user_response))
       sent_tokens.remove(user_response)
       print("\n")
    else:
        flag=False
        print("\nTeslaBot: Bye! take care..")    
            

Now that we have a chatbot, run the code, when prompted enter some cues, such as "elon musk" or "roadster", the bot should return information pertaining to those topics!

Further adaptation

Okay great, we have written a very simple chatbot, that while working is not the most accurate. Why not try making it more accurate?

  • The dataset we have provided is quite limited, try to include more Tesla related content, see if this improves the model?
  • As of now you need to create a txt file for this. Why not try and incorporate web scraping and then pull directly from Wikipedia.
  • Take this code, and build a web application that looks and feels more like an actual chatbot! (hint: use a flask web server and http requests to pass the data)

If you do any of the above challenges let us know. We want to see them done!