Build your own smart chatbot

The answer to life, universe and everything is 42 –deepThought

Review images

At this year’s F8 developer conference,Facebook talked up its vision for the future of chatbots. With these chatbots, users can complete many tasks in a conversation, such as shopping online, inquiring about flights, organizing meetings and more. Instead of downloading a bunch of apps, you can open a simple text dialog box and say, ‘Oh, my god, my third wish is three more wishes.

You might think: Hi Siri

Perhaps another reshuffle of user entry, which explains the rush of big tech companies

origin

I’ve always been interested in natural language processing (NLP), and I’ve been interested in machine learning/deep learning for the better part of a year, and chatbots are a combination of both.

The earliest interest in chatbots probably dates back to college. At that time, I paid attention to the little yellow chicken that swept Renren for a while, but later found that it just called a closed source cloud service, and turned to play with AIML.

Recently, I like to take a look at the courses at Starbucks after work (recently, I am working on Deep Learning with Udacity) and write my blog. I am also doing the same today. I am afraid that I will spend more time on Deep Learning in the future (ESPECIALLY interested in RNN)

Chatbot & Open Source framework

There are plenty of cloud services out there, from Facebook to Microsoft, that have their own frameworks. Open source projects, by contrast, are less glamorous, perhaps because of their early start, and they are still holding out for big ideas.

We went to Github and ChatterBot looked cool, with active projects, clean documentation and clean code.

Given the small size of the project, the source code is easy to read, making it a good scaffold for building your own smart chatbot

ChatterBot

ChatterBot is a machine learning-based chatbot engine built on Python that can learn from existing conversations. The project is designed to allow it to tap into any language

The principle of

An untrained ChatterBot does not have the knowledge needed to talk to its users. Each time the user types a sentence, the robot stores it, along with a reply sentence. As the robot receives more input, the number and accuracy of questions it can answer increases. How does the program respond to user input? First, match the sentence that is closest to the user’s input from the known sentence (how to measure the similarity, you can think about it), then find the most likely response, so how to get the most likely response? The frequency of each response to the input question (matched) was determined by all the people who communicated with the machine

Installation and use

The installation

pip install chatterbot

The basic use

from chatterbot import ChatBot from chatterbot.training.trainers import ChatterBotCorpusTrainer chatbot = ChatBot (" myBot ") ChatBot. Set_trainer (ChatterBotCorpusTrainer) # use English corpus training it ChatBot. "train" (" chatterbot. Corpus. English ") # Chatbot. Get_response ("Hello, how are you today?" )Copy the code

Using Chinese corpus

I have filled this with the Chinese corpus, the author has merged my submission into the master, it is not yet packaged and published to pYPI, if you want to use the default Chinese expectation training you need to do this:

https://github.com/gunthercox/ChatterBot pip3 install. / ChatterBot python3 # be used, otherwise there will be a unicode problem, temporarily doesn't have time to do python2 compatibleCopy the code

Using a Chinese corpus to train robots

from chatterbot import ChatBot from chatterbot.training.trainers import ChatterBotCorpusTrainer deepThought = ChatBot("deepThought") deepthought.set_trainer (ChatterBotCorpusTrainer) # Train it using a Chinese corpus DeepThought. "train" (chatterbot. Corpus. "Chinese") # corpusCopy the code

To start playing

Print (deepThought. Get_response (" Nice to meet you ") print(deepThought. Get_response (" Hi, how are you? ) print(deepThought. Get_response (" Complex is better than obscure ") # Resist the temptation to guess.")) # print(deepThought. Get_response (" What is the ultimate answer to life, the universe, and everything in it?" ))Copy the code

FAQ (Unofficial)

The default configuration

By default, ChatterBot uses JsonDatabaseAdapter as storage adapter and ClosestMatchAdapter as logic adapter, Use VariableInputTypeAdapter as the input Adapter

Read-only mode

Chatbot = chatbot (“wwjtest”, read_only=True) // Otherwise the bot learns every input

Create your own training classes

chatterbot/training

Create your own Adapters

Refer to the ClosestMatchAdapter and VariableInputTypeAdapter used by default

For example, we can write an input/output adapters for connecting to wechat (I prefer Werobot).

An example of IO is Chatterbot-Voice. This adapters let us use voice to communicate with our robot. It’s very simple

case

There are already many kinds of robots in the case

How are trained models distributed

/database.db (see jsonDatabase.py), which is not an SQLite database, but jsonDB, which encapsulates JSON (see jsondb/db.py)

Algorithm related

By default, the ClosestMatchAdapter is used as the logic adapter to find the sentence that is closest to the user’s input

The core code is:

from fuzzywuzzy import process

closest_match, confidence = process.extract(
            input_statement.text,
            text_of_all_statements,
            limit=1
)[0]
Copy the code

Here we use fuzzywuzzy, refer to fuzzywuzzy#process

Fuzzywuzzy is used to calculate the direct similarity of sentences, and the string similarity algorithm adopted is Levenshtein Distance(editing Distance algorithm).

Levenshtein Distance

Edit Distance, also known as Levenshtein Distance (also known as Edit Distance), is the minimum number of Edit operations required to convert two strings from one to the other. The greater the Distance between them, the more different they are. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character. (Quoted from Wikipedia)

From the above description, we can see that this algorithm is applicable to any text, and when we use process.extract, the accuracy of similarity measurement will not be affected by using Chinese. Of course, we can also see the shortcomings of this algorithm, it can not understand semantic similarity, even synonyms completely unable to deal with. This is an obvious shortcoming, and it is necessary to re-implement a logic Adapter that measures text similarity

From Fuzzywuzzy import fuzz fuzz. Ratio (u" Hello ", u" hello!" Ratio (u" Hello ", u" hello ") #100Copy the code

Other algorithms

Time_adapter. Py uses the NaiveBayes: From Textblob. Classifiers Import NaiveBayesClassifier, which is currently the only place where textblob is referenced

[(” What time is it”, 1), XXX, XXX…]

The use of me

Currently, NLTK’s word_tokenize, Wordnet and stopWords are mainly used

todo

Make this project more suitable for training Chinese corpus
Write a Logic Adapter using other text similarity algorithms
Add Chinese stop words, etc. (instead of NLTK stop words)
Use SnownLP and Jieba to replace existing dependencies (NLTK and Textblob)
Fork is a project that uses its architecture to rewrite a more suitable One for Chinese

Chat corpus

Chat corpus involves privacy. There are almost no publicly available Chinese corpus on the Internet. We open our imagination:

Siri to Xiaoice (with wechat API is dialogue programmable)
Plato’s Dialogues
The analects of Confucius

pit

ChatterBot itself supports PYTHon2 /python3, and currently only supports Python2 if you want to use Chinese

Python2中文解决 :

statement_list = self.context.storage.get_response_statements()

The resulting statement_list is a list of incorrectly encoded sentences (codec problem)

For a solution, see my blog: Notes on coding

conclusion

The current project provides a beautiful bot skeleton and plug-in design, which is very convenient to insert powerful functions, which is also my favorite part of the project. In terms of chat Bot functions, the functions are relatively simple and clear