Analyzing political bias on news website using AWS and R

Yuri Almeida Cunha
5 min readDec 4, 2020

Introduction

The purpose of the following article it´s to give some insights and provide statistical background to evaluate a political bias on a news website. The idea is to provide at the end some final conclusions that support the idea that a website tend to follow some political views while the news are published.

The Website

For the follow experiment, the website chosen was the News in Levels . The site main aim is to help students all over the world to enhance their English skills through accessing world news published on their website separated by levels. The news can be visualized by categories, levels and the search of some specific key-words.

The Tools

Once the website was defined, the tools picked for the analysis were R (generate the graphs and extract the data) and Amazon Comprehend (an Amazon Web Service that uses machine learning to find insights and relationships in text) . The good part about this Amazon service, taking into account that it´s a completely cloud service, is that no Machine Learning skills are needed. That leave´s the tough part of the job to be only the analysis of the scores generated by the Amazon algorithm.

Connect R to your AWS

The next step of the experiment is to connect R with the Amazon Web Services portal. You can do that generating and downloading the key directly on the AWS portal through the menu: IAM > Access Management > Users > [Select your user, if you have more than one on your account]> Security Credentials > Create Access Key.

Generating AWS Access Key

Once you are done, save you access key into a .csv file, and run the following code on R:

AWS and R Connection

Data Extraction

Once the connection is set, data extraction process can be initiated. Bellow, it´s added the code used to web scrape and analysis all the necessary data from the website (each news body content) given the following variables:

searchterm: search criteria, topic to be analyzed.

number_pages: number of news pages that the code should look up on the website

Analysis

After the whole data is prepared, it’s time to start the analysis. The following experiment used the searchterm = trump and the number_pages = 24, retrieving in that way the entire website dataset regarding the following topic . The goal is determine how the the news about the president of the USA are evaluated into the amazon algorithm that recognizes patterns and relationships between words into text.

Word Map

First, based on the most important key words of each news related to the topic, it’ was created Word Map that utilizes the frequency of the appearance of the words to highlight the most important ones. On the image bellow, it’s possible to check which were the ones related to the term “trump”.

term trump — Word Map

Scores Analysis

The second stage was to check the score outcomes given by the AWS tool.

Overall Sentiment

The first analysis was based on the overall_sentiment of the the news content words. For the experiment example, the biggest majority of the news were overall classified as NEUTRAL and NEGATIVE, with a really low percentage of MIXED and POSITIVE ones.

term trump — news overall classification

Positive and Negative Scores Distribution

After that, it was checked the individual distributions of the positive and negative given scores.

term trump — Positive Scores
term trump — Negative Scores

As showed above, there is a clear difference under the positive and negative score distributions. In the positive one, a clear right-tailed skewed distribution concentrates the frequency of the positive scores under really low values. On the other hand, the negative one shows a pattern much closer to the normal distribution.

When adding those graphs together into density lines, this difference becomes even more highlighted. Considering so, it`s possible to visualize that exists a possible significant statistical distinction between those scores.

term Trump — Density Graphs

T-tests

Once the graphs showed some possible statistical difference between those distributions, it was used two T-test’s to gather more strong information about those scores.

The first one was to compare the two scores mean (Confidence Level = 0.99):

term trump — T-test comparing the means

Analyzing the result above, it’s possible to conclude with a really strong confidence level, that the positive and negative scores follow distinct distributions and their means statistically really differ from each other.

The other T-test was to check weather the positive mean scores could be bigger than 0.1 in a scale varying from (0–1). The results could be checked bellow:

term trump — T-test positive score > 0.1

As showed above, the hypotheses that the positive mean score will be bigger than 0.1 can be rejected with also with a 0.99 confidence level. It means that indeed the positive scores for the term trump are pretty much low.

Conclusion

The idea of the experiment was to show through some statistical analysis that an website may follow some tendencies while publishing their news. Using the AWS Comprehend and R, it’s possible to visualize some patterns using the search term trump under the website news in levels.

Although, it´s not possible to conclude that there is a negative bias for the following term, it’s definitely possible to refuse the idea of a positive (favorable) one.

It’s important to mention as well, that all the results were based on the scores given on the AWS tool and don’t represent any personal opinion or political side of the author of this publication.

Finally, the experiment can be reproduced by following the guideline of this post and using the codes that are published for term trump or any other term that feels relevant for the analysis. The results will always follow some objective interpretation.

--

--