Analyze your Facebook messages


Creation date: 2016-01-18

Tags: analyze, facebook

What are your favorite words in a chat and what are the favorite words of your chatpartners?

Facebook allows us to download our archive therefore we are able to analyze the messages without an Internet connection or any kind of API. Unfortunately there is not a real structure in the archive. It's a mess! How to download my Facebook archive?

After waiting for a while you will get an email with the zipped archive which looks like this:

photos/
  ...
  profile.png
html/
  videos.htm
  ads.htm
  ...
  messages.htm
index.htm

Yeah, all your probably thousands of messages are in that one messages.htm file. The first step is to parse the messages and write them into several CSV files, one for each chat group.

File: parse.py

from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import pandas as pd

BeautifulSoup (BS) helps us to parse the html file.

After parsing we want to save the csv files:

def save_obj(obj):
    for key in obj:
        name = re.sub(r"[ ,]+","_",key)
        obj[key] = obj[key].sort_values(['timestamp'])
        obj[key].to_csv('msg/'+name+'.csv',index=False)

obj will be a dictionary of pandas dataframes. Unfortunately the messages.htm isn't even sorted in any way. So we need to sort by the timestamp column which we fill in a different function.

I think the first part isn't that relevant for this project, if you want to learn how to parse using BS, you can have a look on there documentation

def parse_messages(soup,messages_dict):
    divOfThreads = list(soup.select('.contents > div'))

    #for each group of divs containing <div class="thread">
    for div in range(0, len(divOfThreads)):
        #for each thread of messages inside each div#
        threads = divOfThreads[div].select('.thread')
        for threadDiv in range (0, len(threads)):

The following part is a bit more interesting, because here we will parse the messages and fill the messages_dict.

"""select all the messages inside each thread"""
p = list(threads[threadDiv])
users = p[0]
if len(users) < 100:
    if users not in messages_dict:
        messages_dict[users] =  pd.DataFrame(columns=['user','timestamp','message'])

        for x in range(1,len(p),2):

p holds the messages in the following structure. The first element lists the name of the partners of the chat including your own. All the elements after that one are splitted into meta information (user and datetime) and the real message. We can iterate over them using range(1,len(p),2) and then using the index and index+1. For me only messages with one partner are relevant for my analysis but the script is valid for more than one partner except that I limited the length of the list of partners to 100, which I changed cause it it isn't allowed to store files with a very long filename.

The second parameter of our function is the dictionary of pandas dataframes and we have to check if there already is a dataframe for the current chat. If that isn't the case we generate a dataframe with three columns. One for the user who wrote the message, one for the timestamp and one for the actual text message.

The next step is to fill the dataframe. As mentioned before the first element of the 2-tupel holds the information of the user and of the timestamp.

user = p[x].select('.user')[0].text
meta = p[x].select('.meta')[0].text

The meta timestamp looks like the following example: Thursday, April 12, 2012 at 9:31pm UTC+2

We don't need the UTC stuff and we want to change the format for our sortable structure: 2012-04-12 21:31

Now we append the new row to the existing dataframe.

meta = meta[:meta.find("UTC")-1]
date_object = datetime.strptime(meta, '%A, %B %d, %Y at %I:%M%p')
meta = "{:%Y-%m-%d %H:%M}".format(date_object)
text = p[x+1].text
messages_dict[users] =  messages_dict[users].append(pd.DataFrame([[user,meta,text]],columns=['user','timestamp','message']), ignore_index=True)

At the end we want to return the dictionary.

return messages_dict

Our main just calls the function and saves the dataframes afterwards.

if __name__ == "__main__":

    soup = BeautifulSoup(open("messages_en.htm"), "html.parser")


    msg_dict = dict()
    msg_dict = parse_messages(soup,msg_dict)
    save_obj(msg_dict)

Before you can run the script you need to create a folder msg in your root folder.

The next step will be to read one CSV file and analyze the messages.

What are the words that are used most by you and your partner?

File: all_words.py

import sys
import csv
import pandas as pd
import matplotlib.pyplot as plt
from collections import OrderedDict
from stop_words import get_stop_words

We'll use a simple sys.argv[1] to get the chat we want to analyze.

name = sys.argv[1]
messages = pd.read_csv("msg/"+name+".csv").fillna(" ")

to_be_removed = ".,!?"

stop_words = get_stop_words('english')

If you chat in a different language you might want to change the stop words. Let's fill a dictionary which holds the information of how often some words are used.

In some languages it is meaningful to change all messages to lower case. We created a variable to_be_removed at the top to replace .,!? with nothing so we have real words or smileys like :).

For each user and word we want to make a tally sheet inside our dictionary.

count_dict = dict()
for index,row in messages.iterrows():
    if row['user'] not in count_dict:
        count_dict[row['user']] = dict()

    row['message'] = row['message'].lower()
    for c in to_be_removed:
        row['message'] = row['message'].replace(c, '')

    list_of_words = row['message'].split()
    for word in list_of_words:
        if word not in count_dict[row['user']]:
            count_dict[row['user']][word] = 0
        count_dict[row['user']][word] += 1

At the end we want to create a bar chart of the words used by a user. To make it readable we will only present the first 50 most used words.

Dictionaries in Python are unsorted but we want to have the most common words so we have to sort the dictionaries by the value and generate an OrderedDict structure.

noWords = 51
for user in count_dict:
    ord_dict = OrderedDict(sorted(count_dict[user].items(), key=lambda t: t[1], reverse=True))
    keys = []
    values = []
    counterUniqueWords = len(ord_dict.items())
    counterAllWords = 0
    counter = 0

For each word we add the word itself to a list of keys (x labels in our chart) and the number of occurrences as values. To compare the charts it might be interesting to see the number of words used in total and the number of distinct words. The plot would look not so interesting if we draw the stop words as well, so we will not add them to the arrays.

for i, (key, value) in enumerate(ord_dict.items()):
      counterAllWords += value
      if key in stop_words:
          continue
      counter += 1
      if counter >= noWords:
          continue
      keys.append(key)
      values.append(value)

Now we can draw the plots (one for each user):

fig = plt.figure()
plt.title('User: '+user+' words: '+str(counterAllWords)+', different: '+str(counterUniqueWords))
plt.bar(range(len(keys)), values, align='center')
plt.xticks(range(len(keys)), keys)
locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.xlim(0, noWords)
fig.tight_layout()
fig.savefig('img/all_words_'+name+'_'+user+'.jpg')

We stored the plot as an image so we need to create a folder img before we start the script.

Would be great if you can share your code, if you code some similar tool for a different message platform. You can use the second part of this blog to analyse it, if you can parse the messages to the same csv file format.

You can download all the code on my OpenSourcES GitHub repo.

If you enjoy the blog in general please consider a donation via Patreon. You can read my posts earlier than everyone else and keep this blog running.



Blog Comments powered by Disqus.
Subscribe to RSS