November Updates Part 1

I believe I promised meme comics at some point, so here is my favorite Polandball comic at the moment:

Upon learning from a centurion that Rome is closed for the season, a citizen points out that it's only the season of Fall. The centurion replies "exactly" with a single teardrop in his eye.
Don’t go to Rome during the Fall.

We’ve come to a decent pause/stopping point in the Polandball project. We’ve finished most of our data collection and assessment. Now we need to decide how we want to make this data accessible as well as what analysis, if any, we’d like to do over the data. Before I dive into an intensive project update, I’d like to give myself a kind of “pep talk” by covering all of the new tools and techniques I’ve learned so far. I often get so caught up in constantly working on new stuff that I forget to step back and appreciate how far I’ve come.

For context, while I did have an okay amount of previous experience with programming in R, the most I had accomplished with Python before starting this research was the “Hello World!” tutorial.

Since my Digital Scholarship Fellowship started in August, I’ve learned how to:

  1. Create loops and functions in Python
  2. Web scrape using a platform-specific API
  3. Organize retrieved data into a CSV file
  4. Apply basic data analysis (frequency, mean, median, basic plots)
  5. Sync and retrieve code from GitHub
  6. Compare different types of data visualizations with Tableau
  7. Create interactive online plots using Pandas and Bokeh
  8. Find patterns in text with Natural Language Processing
  9. Generate sentiment scores for social media text with VADER
  10. Transcribe text found in images using the Tesseract OCR

Here’s a sample of some of the results—most of its nothing special, but it IS a start:

Plotting the relationship between number of responses to comments to the number of upvotes.
Wordcloud of the most frequent words in the comments of a tournament winner announcement post on r/polandball.

Here is part of my practice code for applying sentiment analysis to Reddit comments:

# Source:

# first, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# next, we initialize VADER so we can use it within our Python script
## basically, turn our SentimentIntensityAnalyzer into an object recognized by Python
sid = SentimentIntensityAnalyzer()

import pandas as pd
import warnings

comment_df = pd.read_csv("commentList.csv")


scores = sid.polarity_scores(comment_3)
for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

And here is an excerpt of the code for image transcription:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'c:\program files\Tesseract-OCR\tesseract.exe'

img = cv2.imread('j3eltq-Copy1.png')

text = pytesseract.image_to_string(img)

The full code for the r/polandball project will be made available on GitHub at some point.

In summary: I have at least seen a 100% increase in my Python abilities.

Leave a Reply

Your email address will not be published. Required fields are marked *