Friday, June 19, 2015

A Gentle Introduction to TextBlob for NLP

English: Python logo Deutsch: Python Logo
English: Python logo Deutsch: Python Logo (Photo credit: Wikipedia)
TextBlob is a Python (2 and 3) library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. TextBlob objects can be treated as if they were Python strings that learned how to do Natural Language Processing. TextBlob heavily depends on Python NLTK and pattern module by CLIPS. Corpora used by NLTK is the default corpora for TextBlob as well. For Installation insturctions Click Here
Today we will have an overview of this library in this part and our main focus will be to cover the different properties and methods of BaseBlob class.

Basic's of TextBlob and Tokenization


First we import the TextBlob class which can be said as the most important class.
In [2]:
# We import the most important class TextBlob
from textblob import TextBlob

We will analyse and apply all the methods and functions of TextBlob on the following paragraph and sometimes with some additional sentences.

In [4]:
data = """
Hello, My name is Animesh Shaw and I am an undergraduate and studying Computer Science (upcoming Graduation in 2015). I Love programming and Computer science subjects of topics. 
My field of interest include Computational Linguistics. I Love watching anime specially One Piece and Naruto Shippunden. Animes specially those two shows
an extravagant amount of dedication, passion love, and amibition towards achieving one's goal and aim's in life. These have always inpired me a lot.
Giving up on your own dreams to fulfill others and the same feeling that other carry along with friends or something which I call as an eternal bond.
I recommend everyone to watch Naruto and One Piece. I have learnt a lot from there. "People are not always born intelligent or powerfull but with hard work
great heights can be achieved in life." Yes its true, dedication and hard work are the key goals to success less than 1% people are born with the blessing
of being a prodigy. The world was shaken mostly by those non-prodigy people which have had a massive impact in every individuals lives. 
With great goals and constant dedication and passion you can achieve 
the unachievable.
"""

To use any functions or methods of TextBlob we first create a TextBlob object

In [5]:
tblob = TextBlob(data)
We will store all the words of the paragraph along with their POS tags in a variable tags. tags is a property in TextBlob class which returns a list of tuples. The tuple format being (, ). All strings or return values in TextBlob are unicode encoded.
In [6]:
tags = tblob.tags #We have stored the words of the text with the respective parts of speech tags
In [8]:
tags[:6] # Now that the tags are stored we will display the first 6 tags.
Out[8]:
[(u'Hello', u'UH'),
 (u'My', u'PRP$'),
 (u'name', u'NN'),
 (u'is', u'VBZ'),
 (u'Animesh', u'NNP'),
 (u'Shaw', u'NNP')]

NNP stands for proper noun. It is used for name, place, animals etc. etc. PRP stands for Pronoun.

Now lets prints all the tags which was earlier stored in tags variable. We will just print the first 20 tags.
In [12]:
for tag in tags[:20]:
    print(str(tag[1]) + " ")
UH 
PRP$ 
NN 
VBZ 
NNP 
NNP 
CC 
PRP 
VBP 
DT 
JJ 
CC 
VBG 
NNP 
NNP 
JJ 
NNP 
IN 
CD 
PRP 

We can do the above by writing a single line of code too.
In [15]:
print("\n".join([tag[1] for tag in tags[:20]]))
UH
PRP$
NN
VBZ
NNP
NNP
CC
PRP
VBP
DT
JJ
CC
VBG
NNP
NNP
JJ
NNP
IN
CD
PRP

Now let us have a look at all the tags in the data above
In [16]:
print(" ".join([tag[1] for tag in tags]))
UH PRP$ NN VBZ NNP NNP CC PRP VBP DT JJ CC VBG NNP NNP JJ NNP IN CD PRP NNP NN CC NNP NN NNS IN NNS PRP$ NN IN NN VBP NNP NNP PRP NNP VBG NN RB CD NNP CC NNP NNP NNP RB DT CD VBZ DT JJ NN IN NN NN NN CC NN IN VBG CD POS PRP NN CC NN POS PRP IN NN DT VBP RB VBN PRP DT NN VBG IN IN PRP$ JJ NNS TO VB NNS CC DT JJ NN IN JJ VB IN IN NNS CC NN WDT PRP VB IN DT JJ NN PRP VB NN TO VB NNP CC CD NNP PRP VBP NN DT NN IN EX NNS VBP RB RB VBN JJ CC NN CC IN JJ NN JJ NNS MD VB VBN IN NN UH PRP$ JJ NN CC JJ NN VBP DT JJ NNS TO NN JJR IN CD NNS VBP VBN IN DT NN IN VBG DT NN DT NN VBD VBN RB IN DT JJ NNS WDT VBP VBD DT JJ NN IN DT NNS NNS IN JJ NNS CC JJ NN CC NN PRP MD VB DT JJ

In [17]:
# Now let us have a look at the total no of tags. We store all the tags in a variable named pos_tags
pos_tags = [tag[1] for tag in tags]
#Now we will print the length
print("No. of tags : " + str(len(pos_tags)))
No. of tags : 199

In [21]:
#if you have noticed in entry no. 16 that a lot of tags are repeating. We would like to get all the unique tags from them.
#We can simply use he set() data structure to do so which will remove the duplicates.
unique_poses = set(pos_tags)
print(" ".join([ i for i in unique_poses ]))
print("\nNo of unique POS's : " + str(len(unique_poses)))
PRP$ VBG VBD VBN VBP WDT JJ VBZ DT NN POS TO PRP RB NNS NNP VB CC CD EX IN MD JJR UH

No of unique POSes : 24

So now you can see that there are only 24 POS tags which have been used and the rest are just repetition. Using TextBlob we can even print all the noun phrases in the sentence.
In [23]:
# print all the noun phrases
tblob.noun_phrases
Out[23]:
WordList(['hello', u'animesh shaw', 'computer', 'graduation', 'love', 'computer', u'science subjects', u'computational linguistics', 'love', 'piece', u'naruto shippunden', 'animes', u"'s goal", u"aim 's", u'own dreams', u'eternal bond', 'naruto', 'piece', u'hard work', u'great heights', u'hard work', u'key goals', u'% people', u'non-prodigy people', u'massive impact', u'great goals', u'constant dedication'])
We can get all the words as a WordList as well, by using the words property as follows which returns a list of all words as a class of WordList. WordList is a list-like collection of words. Its no different from Python lists but with additional methods.
In [26]:
tblob.words #returns the data as word tokenized form in a list.
Out[26]:
WordList(['Hello', 'My', 'name', 'is', 'Animesh', 'Shaw', 'and', 'I', 'am', 'an', 'undergraduate', 'and', 'studying', 'Computer', 'Science', 'upcoming', 'Graduation', 'in', '2015', 'I', 'Love', 'programming', 'and', 'Computer', 'science', 'subjects', 'of', 'topics', 'My', 'field', 'of', 'interest', 'include', 'Computational', 'Linguistics', 'I', 'Love', 'watching', 'anime', 'specially', 'One', 'Piece', 'and', 'Naruto', 'Shippunden', 'Animes', 'specially', 'those', 'two', 'shows', 'an', 'extravagant', 'amount', 'of', 'dedication', 'passion', 'love', 'and', 'amibition', 'towards', 'achieving', 'one', "'s", 'goal', 'and', 'aim', "'s", 'in', 'life', 'These', 'have', 'always', 'inpired', 'me', 'a', 'lot', 'Giving', 'up', 'on', 'your', 'own', 'dreams', 'to', 'fulfill', 'others', 'and', 'the', 'same', 'feeling', 'that', 'other', 'carry', 'along', 'with', 'friends', 'or', 'something', 'which', 'I', 'call', 'as', 'an', 'eternal', 'bond', 'I', 'recommend', 'everyone', 'to', 'watch', 'Naruto', 'and', 'One', 'Piece', 'I', 'have', 'learnt', 'a', 'lot', 'from', 'there', 'People', 'are', 'not', 'always', 'born', 'intelligent', 'or', 'powerfull', 'but', 'with', 'hard', 'work', 'great', 'heights', 'can', 'be', 'achieved', 'in', 'life', 'Yes', 'its', 'true', 'dedication', 'and', 'hard', 'work', 'are', 'the', 'key', 'goals', 'to', 'success', 'less', 'than', '1', 'people', 'are', 'born', 'with', 'the', 'blessing', 'of', 'being', 'a', 'prodigy', 'The', 'world', 'was', 'shaken', 'mostly', 'by', 'those', 'non-prodigy', 'people', 'which', 'have', 'had', 'a', 'massive', 'impact', 'in', 'every', 'individuals', 'lives', 'With', 'great', 'goals', 'and', 'constant', 'dedication', 'and', 'passion', 'you', 'can', 'achieve', 'the', 'unachievable'])
See its so easy. Noun Phrases gives us a lot of important and relevant information which can be further used to analyse the meaning. When we use the tblob.noun_phrases it returns the noun_phrases as an WordList which a class used to store words and manipulate them or operate with different functions etc. etc. Lets do something more.

Language Detection and Translation


Suppose that you want to detect the language used in the text above, TextBlob provides a detect_language() method to detect language used. The methods uses the Google Langauge Translate API for the purpose.
In [26]:
tblob.detect_language()
Out[26]:
u'en'

Lets try some more and in different ways.

In [31]:
#Okay lets try some more.
TextBlob("Bonjour").detect_language()
Out[31]:
u'fr'
In [36]:
#Another one
TextBlob("Ciao").detect_language()
Out[36]:
u'it'
In the last two "fr" stands for french and "it" stands for italian. Now lets move on. Now that you know that TextBlob can detect language. You might have a question whether it can even do the translation or not. As a matter of fact it can Lets take a simple example we will Convert "Thanks" in english to Japanese.
In [38]:
TextBlob("Thanks").translate(to="ja")
Out[38]:
TextBlob("感謝")
You can see we got the translated text in Japanese.

Lets try another example with a bigger sentence

In [41]:
TextBlob("Hello, My name is Animesh P Shaw. I will become the Programming King").translate(to="fr")
Out[41]:
TextBlob("Bonjour , Mon nom est Animesh P. Shaw . Je vais devenir le roi de programmation")
You might be having a doubt whether these returned values are true or not. Well you can always Google you know. Lets see if the last french translated sentence is detected as french or not.
In [42]:
TextBlob("Bonjour , Mon nom est Animesh P. Shaw . Je vais devenir le roi de programmation").detect_language()
Out[42]:
u'fr'
Ta da! The langauge of the above sentence has been detected as french since fr is the french langauge code. ## Raw Text Handling
Let's explore more and see what we have got. We will print the complete text as raw which means all the escape characters like or or * will also be printed. There is an builtin property for that purpose.
In [44]:
tblob.raw
Out[44]:
'\nHello, My name is Animesh Shaw and I am an undergraduate and studying Computer Science (upcoming Graduation in 2015). I Love programming and Computer science subjects of topics. \nMy field of interest include Computational Linguistics. I Love watching anime specially One Piece and Naruto Shippunden. Animes specially those two shows\nan extravagant amount of dedication, passion love, and amibition towards achieving one\'s goal and aim\'s in life. These have always inpired me a lot.\nGiving up on your own dreams to fulfill others and the same feeling that other carry along with friends or something which I call as an eternal bond.\nI recommend everyone to watch Naruto and One Piece. I have learnt a lot from there. "People are not always born intelligent or powerfull but with hard work\ngreat heights can be achieved in life." Yes its true, dedication and hard work are the key goals to success less than 1% people are born with the blessing\nof being a prodigy. The world was shaken mostly by those non-prodigy people which have had a massive impact in every individuals lives. \nWith great goals and constant dedication and passion you can achieve \nthe unachievable.\n'
raw_sentances is another property which returns a list of raw sentences which means all the escape characters like or or will also be printed
In [45]:
tblob.raw_sentences #
Out[45]:
['\nHello, My name is Animesh Shaw and I am an undergraduate and studying Computer Science (upcoming Graduation in 2015).',
 'I Love programming and Computer science subjects of topics.',
 'My field of interest include Computational Linguistics.',
 'I Love watching anime specially One Piece and Naruto Shippunden.',
 "Animes specially those two shows\nan extravagant amount of dedication, passion love, and amibition towards achieving one's goal and aim's in life.",
 'These have always inpired me a lot.',
 'Giving up on your own dreams to fulfill others and the same feeling that other carry along with friends or something which I call as an eternal bond.',
 'I recommend everyone to watch Naruto and One Piece.',
 'I have learnt a lot from there.',
 '"People are not always born intelligent or powerfull but with hard work\ngreat heights can be achieved in life."',
 'Yes its true, dedication and hard work are the key goals to success less than 1% people are born with the blessing\nof being a prodigy.',
 'The world was shaken mostly by those non-prodigy people which have had a massive impact in every individuals lives.',
 'With great goals and constant dedication and passion you can achieve \nthe unachievable.\n']
Let us look at another property which is sentences. Now this is different from raw_sentences. The former will return a list of all the sentences of class Sentence(). We will have a look at it.
In [46]:
tblob.sentences
Out[46]:
[Sentence("
Hello, My name is Animesh Shaw and I am an undergraduate and studying Computer Science (upcoming Graduation in 2015)."),
 Sentence("I Love programming and Computer science subjects of topics."),
 Sentence("My field of interest include Computational Linguistics."),
 Sentence("I Love watching anime specially One Piece and Naruto Shippunden."),
 Sentence("Animes specially those two shows
an extravagant amount of dedication, passion love, and amibition towards achieving one's goal and aim's in life."),
 Sentence("These have always inpired me a lot."),
 Sentence("Giving up on your own dreams to fulfill others and the same feeling that other carry along with friends or something which I call as an eternal bond."),
 Sentence("I recommend everyone to watch Naruto and One Piece."),
 Sentence("I have learnt a lot from there."),
 Sentence(""People are not always born intelligent or powerfull but with hard work
great heights can be achieved in life.""),
 Sentence("Yes its true, dedication and hard work are the key goals to success less than 1% people are born with the blessing
of being a prodigy."),
 Sentence("The world was shaken mostly by those non-prodigy people which have had a massive impact in every individuals lives."),
 Sentence("With great goals and constant dedication and passion you can achieve 
the unachievable.
")]

Sentiment Analysis with TextBlob


TextBlob is specially helpful for Sentiment Analysis with all the built in methods and properties which you can modify by configuring and extend with different taggers or Analyzers.

What is Senitiment Analysis ?

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. With TextBlob we can see both the Polarity and Subjectivity of the information in a sentence or data.
Now lets do something interesting and important. Note the following produces important results. We will now see how to measure the polarity of a sentence. Now what is polarity. Polarity is a measure which gives a numerical value depending on which we can understand whether a sentence is postive or negetive. Its more like someone says bad about you feel sad and it means negetive and when someone praises you, you feel joy which is positive polarity.
In [20]:
for sent in tblob.sentences:
    print(sent.sentiment.polarity)
0.0
0.5
0.0
0.428571428571
0.428571428571
0.0
0.158333333333
0.0
0.0
0.436111111111
0.0383333333333
0.25
0.4

A value of 0.0 indicates neutral, 0.5 indicates positive. Note that the word "Love" indicates postiveness. Values which fall in between 0.4 and 0.5 are almost undecidatble or more or less positve. Let's consider the second last value 0.25, it is low because of the words shaken or massive impact which infuses a negetive sense.
Now lets display both the polarity and subjectivity. The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
In [21]:
for sent in tblob.sentences:
    print(sent.sentiment)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=0.6)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.4285714285714286, subjectivity=0.5857142857142856)
Sentiment(polarity=0.4285714285714286, subjectivity=0.5857142857142856)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.15833333333333333, subjectivity=0.5)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.4361111111111111, subjectivity=0.7305555555555555)
Sentiment(polarity=0.03833333333333332, subjectivity=0.45166666666666666)
Sentiment(polarity=0.25, subjectivity=0.75)
Sentiment(polarity=0.4, subjectivity=0.5416666666666666)


Dumping Data Properties as JSON

Suppose that you want to get all the properties together as one in some format which is efficient and easy to parse. To solve such cases TextBlob provides a way to dump all the properties as a JSON file. For this example we will create a text blob instance with a smaller sentence "Nico Robin is the most sexy anime character I have ever encountered."
In [39]:
blob = TextBlob("Nico Robin is the most sexy anime character I have ever encountered.")
print(blob.json)
[{"polarity": 0.5, "stripped": "nico robin is the most sexy anime character i have ever encountered", "noun_phrases": ["nico robin", "sexy anime character"], "raw": "Nico Robin is the most sexy anime character I have ever encountered.", "subjectivity": 0.75, "end_index": 68, "start_index": 0}]

If you want to represent the JSON data in a serialized manner then you can do this in the following manner.
In [25]:
blob.serialized
Out[25]:
[{u'end_index': 68,
  u'noun_phrases': WordList([u'nico robin', u'sexy anime character']),
  u'polarity': 0.5,
  u'raw': 'Nico Robin is the most sexy anime character I have ever encountered.',
  u'start_index': 0,
  u'stripped': 'nico robin is the most sexy anime character i have ever encountered',
  u'subjectivity': 0.75}]
Now lets test something nice. Suppose that we add the following " and beautiful lady " after "sexy" in In [39] then what changes do you expect to happen. As you know that beautiful is a postive word and so what it does is it increases the polarity value. This technique can be used in different ways in research.
In [40]:
blob = TextBlob("Nico Robin is the most sexy and beautiful lady anime character I have ever encountered.")
print(blob.json)
[{"polarity": 0.6166666666666667, "stripped": "nico robin is the most sexy and beautiful lady anime character i have ever encountered", "noun_phrases": ["nico robin", "beautiful lady anime character"], "raw": "Nico Robin is the most sexy and beautiful lady anime character I have ever encountered.", "subjectivity": 0.8333333333333334, "end_index": 87, "start_index": 0}]


Summary

So thats all we have came to the end of our first encounter with TextBlob. As you can see I have explained the stuffs in a very detailed manner. This is definatly in more depth than what has been covered in the official tutorials. Stay tuned for more and I will continue this series and explain almost everthing in the TextBlob library. Next time we will discuss about Spelling Corrections, N-Grams, Taggers and maybe lemmatization.
Thank you for reading and I hope you have had a nice read.

Read this Article in IPython-Notebook format here.

 

0 comments :

Post a Comment

Follow Me!

Follow by Email

Blog Archive

Followers

Visitor Map