Dismantling Political Echo Chambers Using Natural Language Processing and Binary Classifiers

Nilai Vemula

Echo chambers, or places where citizens are exclusively exposed to information that reinforces their prior beliefs, are becoming more prevalentWhy Echo Chambers Are Becoming Louder due to the spread of social media and more relevant due to increased political engagement in the United States. These echo chambers have been criticized for accelerating polarizationEcho Chambers are Getting Worse and are hotspots for spreading misinformationSee this interesting study modeling fake news in echo chambers.. The goal of this project is to create a tool that recommends news articles about a given topic with diverse perspectives to encourage users to break out of their bubbles and consider other perspectives.

The hardest part of any machine learning project is finding a good source of annotated data to train a model with. For this, I turned to popular Reddit groups (subreddit) sorted by political leaning. From these groups, I collected the most popular news articles shared in each group. All code for this project can be found on GitHub.The text of these articles was then transformed into a vector space, so that a binary classifier could digest that data and predict the political lean of the subreddit that it was shared in. Finally, I put this ML model to use in a web application. The user can search for a topic, and this application scrapes articles found on Google News and puts them through the model to classify these articles and recommend them to the user.

Collecting Data

The source of my annotated data was in the form of Reddit posts. Forums on Reddit called subreddits have topics that span the spectrum in discussing science to relationships. This is an excellent data choice for this project as these subreddits are essentially echo chambers as nearly all members of the group join to read and engage with material that is supportive of the subreddit's overall topic. I chose 8 of these subreddits with explicit political lean:

The first four of these subreddits I classified as having a "right-leaning" bias whereas the last four have a "left-leaning" bias. These communities were chosen because they all have a generally large reach (>100k members) and embody the diversity in each of these two categories. Out of the four, one was based off of general ideology (conservative or liberal), one on political party (Republican or Democrat), one on political leader (Donald Trump or Joe Biden), and one representing a more fringe ideology (late stage capitalism or anarcocapitalism). Very popular subreddits such as r/politics and r/libertarian were excluded because they do not have an explicit political lean in the "right"-"left" binary.

My hypothesis is that news articles that become popular in these subreddits are representative of articles that make up the media diet of people of that group's corresponding political lean. In some cases, this theory breaks down. For example, the most popular post in r/conservative is a news article proclaiming Joe Biden's 2020 Presidential election victory. This article is not particularly informative for categorizing the population of r/conservative, and was likely popular due to the larger Reddit audience "upvoting" the post. For this reason, I selected more than just the top few most popular posts, and I did not weight further analysis on how popular each post was.

From each of these subreddits, I collected the 1,000 most popular posts (or "top" in Reddit lingo) using a Python library called praw, which is a wrapper on the Reddit API. I then checked if a post contained a link to a news article, and used the newspaper3k library to scrape and parse these articles. The final output of the data collection phase was a spreadsheet of over 2,000 links to news articles, a cleaned version of their content, their corresponding Reddit post, and some data about that post.

Overall, 2,321 articles were collected from the 8,000 Reddit posts analyzed, but these articles were not found evenly across the subreddits above. After filtering out articles with less than 50 characters of content and posts with messing data, we were left with 2,037 total articles from the following subreddits:

subreddit Number of News Articles
r/Liberal 850
r/Democrats 386
r/Republican 349
r/Conservative 325
r/JoeBiden 103
r/donaldtrump 14
r/LateStageCapitalism 9
r/Anarcho_Capitalism 1

We have more articles from left-leaning subreddits and more articles from the broader groups. The subreddits focused on candidates and fringe ideologies had few news articles that became popular in the subreddit when shared.

Preparing Our Data

In order to use the content of these news articles in a machine learning model, we must first transform this text data into some sort of numeric format. I made heavy use of several Python libraries for this purpose. nltk is a general purpose Natural Language Processing (NLP) package with the ability to organize text data, remove stop words, preform tagging, etc., and gensim combines vector-based modeling with topic modeling and document indexing.

First, I used nltk to tokenize (or separate) the article into individual words; remove stop words such as "a", "the", or punctuation; and compute word frequencies of the most used words in each article. For example, this article about President Trump's infamous $750 income tax payment uses the word "president" 12 times, "donald" 3 times, and "trump" 20 times.

Then, I loaded a word2vec model which was pre-trained on a Google News dataset of about 100 billion words. This model fits well with our goal of vectorizing news articles. A word2vec model works by creating a 300-dimension vector space. A word is a point in that vector space and is defined as a linear combination of 300 basis vectors. The similarity between two words can be computed using the distance between those two words in the vector space. Technically, the cosine similarity between two vectors is representative of the semantic similarity between those two words in the context of the corpus that the model was trained on.This model has a few shortcoming, though. From a performance perspective, it is over 6 GB and takes several minutes to load into RAM on my computer. Additionally, the model is from 2013, so many recent developments such as COVID-19 and the Black Lives Matter movement are not part of the model.

In order to represent a news article as a set of numbers, I decided to create a "document vector" for each article. I approximated a doc2vec approach by using the previous word2vec model to generate a 300-dimensional vector for each word in each article. I then averaged these vectors from all words in an article, weighted by the frequency of those words. In the end, each article is represented by a 300-dimensional document vector.

It is important to note that these models are "bag-of-word" approaches. This means that they consider each news article as a collection of terms. There is no analysis of the relationship between words or the context that the words are used. While this is different from how you and I might assess bias of an article, this "bag-of-word" approach is common in NLP and has the added bonus of being computationally efficient.

Repeating the aforementioned steps for all 2,000+ articles, we can fill up a 300 dimensional vector space with 2,000+ points. This vector space is obviously impossible to properly visualize as it is more than 3-dimensions; however, I can use principal component analysis (PCA) as a dimensionality reduction technique. This algorithm allows us to extract the two orthogonal basis vectors that explain the most variance in the dataset.

We can plot this data in terms of these two orthogonal basis vectors and color each point by the political lean of the article. Red points represent articles collected from right-leaning subreddits, and blue points represent articles collected from left-leaning subreddits.

Principal Components 1 and 2 PCA

It is disappointing that there is not a clear separation between the blue and red points. This indicates that our classification model will have a difficult time distinguishing between the two types of articles. However, if we look at the plot below, we can see that the first two principal components make up a very small percent of the overall variance in the dataset. Our classification model will therefore likely use many more than 2 out of 300 features. If we were to visualize this dataset in 100 or 150 dimensions, it is possible, we would see two distinct groupings of blue dots and red dotes.

Explained Cumulative Variance by the Number of Principal Components PCA2

Building a Binary Classifier

Now that our data can be represented in a numeric format, we can build a machine learning model to serve as a binary classifier. This model will have two outputs (right-lean or left-lean), and the input will be a the document vector of a news article. We have a few limitations on our model because our dataset is imbalanced. From earlier, we had more left-leaning news articles than right-leaning ones. For the purpose of encoding our data, the right-leaning articles are positives (1) and the left-leaning articles are negatives (0).

First, we split our dataset using a 75-25 split. 75% (1527 articles) of our data will be used to train our model, and the remaining 25% (510 articles) of the data will be saved until the end to test/validate our model.

Each model was trained using 10-fold cross validation, and the accuracy was computed. I used a combination of simple models and ensemble models. The results are below:

Model Accuracy (mean +/- standard deviation)
Logistic Regression 0.7341 +/- 0.0248
K-Nearest Neighbors 0.7498 +/- 0.0362
Naive Bayes 0.7099 +/- 0.0428
Support Vector Machine 0.7623 +/- 0.0351
AdaBoost 0.7328 +/- 0.0332
Bagging 0.7256 +/- 0.0430
Gradient Boosted 0.7636 +/- 0.0253
Random Forest 0.7570 +/- 0.0370
XGBoost 0.7727 +/- 0.0535

We can see that the ensemble models (last five) did better on average than the simple models (first four). XGBoost was the model with the highest accuracy of about 77%. For the purposes of a binary classifier, accuracy is the sum of true positives and true negatives divided by the total number of samples.

To further evaluate these models, I chose a few of them and generated AUC-ROC (Area Under the Curve of a Receiver Operator Characteric Curve) plots. The higher the AUC value, the better the model is able to distinguish between the two categories at lower significance thresholds.

Area Under the Curve of the Receiver Operating Characteristic Curve AUC-ROC

All of the models selected here have good AUC scores, but again, the XGBoost is the best model. Next, I validated the XGBoost model on the remaining 25% of the dataset.

Confusion Matrix of the XGBoost model on the Test Data confusion matrix

In the confusion matrix above, we can see that the majority of the data were true negatives or true positives. We have more false negatives than false positives, likely because we had more left-leaning articles to begin with. This means our model skews slightly in favor of classifying articles as left-leaning. The testing accuracy of this model is 78%, which is comparable to the training accuracy. This means that our model is not over- or under-fitted to our data.

Finally, this model can likely be improved through hyper-parameter tuning. We could also experiment with pre-processing techniques such as normalization or re-sampling to remedy the issues with having an imbalanced dataset. However, for the purpose of this project, a 78% accuracy will be adequate.

Web App

This analysis and the machine learning model we build are only useful if they can be used on a variety of news articles outside of our initial dataset. The purpose of the binary classifier is to be able to expand our scope beyond the 2,000 news articles we started with. By building this model, we can now theoretically determine if any news article on the Internet is more likely to be popular in a left- or right- leaning community.

For the last part of this project, I built a web application to make recommendations on news articles. It begins by prompting the user for a topic. I then use the GoogleNews Python library to locate 25 news articles about that topic. I use the same procedure as earlier to parse these articles, clean their contents, generate a document vector, and use the XGBoost classifier model to compute the probability of this article being popular in a left- or right-leaning subreddit. Finally, I take the two most extreme probabilities and recommend one left-leaning news article and one right-leaning news article to the user.

There are limitations to this model. The process of scraping the news articles and using word2vec to generate the document vector takes a significant amount of time and processing power. For this reason, I was not able to deploy this web application and release it publicly, but I got a version running on my local computer. Here is a demo video:I made an error and accidentally mixed up the headings in the video/screenshot.

And this is a screenshot of the results:

Screenshot of Web App web app

In this example, I searched for articles using the keyword "president." The recommended articles were:

From reading these articles, it is apparent that I would classify these articles in a similar fashion. The first one lists all the norms that President Trump broke and how President-Elect Biden can fix them whereas the second one is a positive depiction of President Trump.

In the future, this web app can be customized to recommend more articles. I can also experiment with techniques to optimize the performance of the application.

Conclusion

In this project, I accomplished all the major goals of gathering data, using natural language processing to describe the data, and building a binary classifier to categorize the data. I even made a limited web application using Streamlit to simulate a user interface for my tool. With further refinement, this tool can be ready for a user audience, and I hope that by recommending a diverse media diet, we can start dismantling the echo chambers that so many of us are trapped in.