Profiling on social networks 🤹‍♂️

Last weekend I was coding an application for Twitter timelines analysis which I called Profiler (I was just bored). Some years ago I was working on probabilistic models and there was one which got my attention: Latent Dirichlet Allocation.

This model was developed by David Blei, Andrew Ng y Michael I. Jordan and tries to find topics in document collections. In other words, it groups text documents into topics that the model itself discovers. In this post I'm not going to explain the model structure and its inference, for this what better than to read the paper itself.

This model is based on estimating Dirichlet distributions. This kind of distributions model the probability of membership to a set of classes. Specifically the model builds (using an iterative procedure) Dirichlet distributions to model the probability of a word referring to a concrete topic and the probability of membership from each document to each of the topics. The parameters estimation of these distributions can be done using different types of Bayesian inference like Variational Inference or sampling methods as Markov Chain Monte Carlo.

Profiler uses this probabilistic model to identify, given a Twitter timeline (set of user tweets), several topics that this user writes about. From this idea I coded a Python application that downloads all these tweets of a user, stores them, preprocesses them and finally looks for its main topics.

Technological stack

  • Twitter data is downloaded using Tweepy.
  • To store the timelines I used a MongoDB database with which I interacted using Pymongo.
  • Textual data preprocessing was done using libraries as Pandas and NLTK. This preprocessing consisted on cleaning tweets to avoid deviations (delete emoticons, capital letters, symbols, digits, ...)
  • To infer the model I used Gensim library. This library has a concurrent LDA implementation.
  • To plot obtained results I used pyLDAvis library which generates interactive HTML that avoids to explore the results and check which words are the most important in each topic.
  • I coded also a command interface to ease interact with the library. To do this I used Google Fire library.
  • I configured the development environment using Docker, Docker compose and Travis to execute automatic tests.

Command interface

Profiler has a command interface. With the next command you could analize some Spain politicians:

make run timelines=Albert_Rivera,sanchezcastejon,Pablo_Iglesias_,pablocasado_ topics=5

You can find Profiler installation steps at the repository. In this file you can configure some application behaviours about the data preprocessing and about the model.

Example of obtained results

Results
Group 1
Results
Group 4

Obtained results using tweets are not as good as obtained using posts or article news. This is because tweets are short text documents. This reduced length causes a much smaller vocabulary and therefore a difficulty for the model to identify differentiated topics. However in this example we can see some interesting groups. These results are from Pedro Sánchez timeline (president of Spain). At group 1 we could see that it refers to tweets about last campaign and at group 4 we see that refers to the issue of sexist violence.

Write a comment! 😀

Loader