Lab Session 1: Introduction to NLP for Social Science

Lab Session 1: Introduction to NLP for Social Science#

0. Installation#

Before we begin, let’s install the necessary packages for this lab. Run the following cell to install the required libraries:

%pip install nlp4ss
!python -m spacy download en_core_web_sm

1. Introduction#

In this lab, we’ll apply the fundamental NLP techniques and preprocessing methods we learned in Session 1 to analyze Sierra Club press releases. We’ll focus on text cleaning, normalization, and basic NLP tasks.

2. Setup and Data Loading#

First, let’s set up our environment and load the data using the provided code.

from hyfi import HyFI

if HyFI.is_colab():
    HyFI.mount_google_drive()
    project_root="/content/drive/MyDrive/courses/nlp4ss"
else:
    project_root="$HOME/workspace/courses/nlp4ss"

h = HyFI.initialize(
    project_name="nlp4ss",
    project_root=project_root,
    logging_level="INFO",
    verbose=True,
)

print("project_dir:", h.project.root_dir)
print("project_workspace_dir:", h.project.workspace_dir)

raw_data_file = h.project.workspace_dir / "data/raw/articles.jsonl"
rdata = h.load_dataset("json", data_files=raw_data_file.as_posix())
rdata_df = rdata["train"].to_pandas()

# Display basic information about the dataset
print(rdata_df.info())
print("\nSample of the data:")
print(rdata_df.head())

INFO:hyfi.utils.notebooks:Google Colab not detected.
INFO:hyfi.utils.notebooks:Extension autotime not found. Install it first.
INFO:hyfi.joblib.joblib:initialized batcher with <hyfi.joblib.batcher.batcher.Batcher object at 0x16fea3020>
INFO:hyfi.main.config:HyFi project [nlp4ss] initialized

project_dir: /Users/yj.lee/workspace/courses/nlp4ss
project_workspace_dir: /Users/yj.lee/workspace/courses/nlp4ss/workspace
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6354 entries, 0 to 6353
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      6354 non-null   object
 1   timestamp  6354 non-null   object
 2   url        6354 non-null   object
 3   page_url   6354 non-null   object
 4   page       6354 non-null   int64 
 5   content    6354 non-null   object
 6   uuid       6354 non-null   object
dtypes: int64(1), object(6)
memory usage: 347.6+ KB
None

Sample of the data:
                                               title       timestamp  \
0  Sierra Club Urges Commerce Department to Hold ...  April 15, 2024   
1  Sierra Club Statement on BOEM Financial Assura...  April 15, 2024   
2  We Energies Files Third Rate Increase in Three...  April 15, 2024   
3  MEDIA ADVISORY: Oregon Regulators to Hear Conc...  April 12, 2024   
4  Advisory: Special Meeting for County Commissio...  April 12, 2024   

                                                 url  \
0  https://www.sierraclub.org/press-releases/2024...   
1  https://www.sierraclub.org/press-releases/2024...   
2  https://www.sierraclub.org/press-releases/2024...   
3  https://www.sierraclub.org/press-releases/2024...   
4  https://www.sierraclub.org/press-releases/2024...   

                                            page_url  page  \
0  https://www.sierraclub.org/press-releases?_wra...     1   
1  https://www.sierraclub.org/press-releases?_wra...     1   
2  https://www.sierraclub.org/press-releases?_wra...     1   
3  https://www.sierraclub.org/press-releases?_wra...     1   
4  https://www.sierraclub.org/press-releases?_wra...     1   

                                             content  \
0  April 15, 2024\n\n\nContact\nAda Recinos, Depu...   
1  April 15, 2024\n\n\nContact\nIan Brickey, ian....   
2  April 15, 2024\n\n\nContact\nMegan Wittman, me...   
3  April 12, 2024\n\n\nContact\nKim Petty, Sierra...   
4  April 12, 2024\n\n\nContact\nLee Ziesche, lee....   

                                   uuid  
0  9ee69b6c-cd7a-4617-94d2-18951266c182  
1  deed5268-32a0-4607-9c0e-e4f565b189f7  
2  c0486029-319a-46c8-9622-56bdd61fd9e5  
3  3ebc6ab6-15a5-4ccc-953b-593965ae0ad8  
4  924abe2c-586d-4087-b4f2-208b731ebb2b  

3. Required Libraries#

Now that we have our data loaded, let’s import the additional libraries we’ll need for our analysis:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import spacy
from textblob import TextBlob

# Download necessary NLTK data
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /Users/yj.lee/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yj.lee/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /Users/yj.lee/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

4. Text Preprocessing#

Let’s create functions for various preprocessing steps.

def clean_text(text):
    # Remove HTML tags
    text = re.sub('<.*?>', '', text)
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.lower()

def tokenize_text(text):
    return word_tokenize(text)

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]

def stem_tokens(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

def preprocess_text(text):
    text = clean_text(text)
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = lemmatize_tokens(tokens)
    return ' '.join(tokens)

# Apply preprocessing to the 'content' column
rdata_df['processed_content'] = rdata_df['content'].apply(preprocess_text)
print(rdata_df['processed_content'].head())

  april contact ada recinos deputy press secreta...
  april contact ian brickey ianbrickeysierraclub...
  april contact megan wittman meganwittmansierra...
  april contact kim petty sierra club kimpettysi...
  april contact lee ziesche leezieschesierraclub...
Name: processed_content, dtype: object

5. Basic NLP Tasks#

Now, let’s perform some basic NLP tasks on our preprocessed data.

5.1 Word Frequency Analysis#

def get_word_freq(text):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform([text])
    words = vectorizer.get_feature_names_out()
    freqs = X.toarray()[0]
    return dict(zip(words, freqs))


# Get word frequencies for all processed content
all_text = " ".join(rdata_df["processed_content"])
word_freq = get_word_freq(all_text)

# Plot top 20 most frequent words
plt.figure(figsize=(12, 6))
plt.bar(*zip(*sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:20]))
plt.xticks(rotation=45, ha="right")
plt.title("Top 20 Most Frequent Words in Sierra Club Press Releases")
plt.tight_layout()
plt.show()

../_images/a708de038946c00ec4d42f8dc5d1a6a223b42b0c8fa9f0716face5f638a71926.png

5.2 Named Entity Recognition#

import spacy

nlp = spacy.load("en_core_web_sm")


def extract_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]


# Extract entities from the first article
sample_text = rdata_df["content"].iloc[0]
entities = extract_entities(sample_text)
print("Named Entities in the first article:")
for entity, label in entities:
    print(f"{entity} - {label}")

Named Entities in the first article:
April 15, 2024


 - DATE
Ada Recinos - PERSON
Press - ORG
Federal Communications - ORG
Washington, DC - GPE
Sierra Club - ORG
Communities United - ORG
Samsung - ORG
$6.4 billion - MONEY
the U.S. Department of Commerce - ORG
Samsung - ORG
the CHIPS Act - LAW
Samsung - ORG
Austin - PERSON
Taylor - PERSON
Texas - GPE
Samsung - ORG
Samsung - ORG
Austin - GPE
Taylor - PERSON
the US Department of Commerce - ORG
PhD - WORK_OF_ART
Sierra Club’s - ORG
Lone Star Chapter - ORG
Samsung - ORG
Taylor - PERSON
Texas - GPE
multi-billion dollar - MONEY
the U.S. Department of Commerce - ORG
Samsung - ORG
Texas - GPE
100% - PERCENT
Biden - PERSON
the CHIPS Law - LAW
Samsung - ORG
more than $40 billion - MONEY
Texas - GPE
Harry Manin - PERSON
Industrial Policy & Trade - ORG
Sierra Club - ORG
Sierra Club - ORG
Raimondo - PERSON
Samsung - ORG
100% - PERCENT
BackgroundDespite Samsung's - ORG
Taylor - PERSON
Austin - PERSON
Vietnam - GPE
Samsung - ORG
Hanoi - GPE
years - DATE
Hundreds - CARDINAL
Samsung - ORG
South Korea - GPE
Samsung - ORG
2023 - DATE
Samsung - ORG
Texas - GPE
the Sierra Club - ORG
The Sierra Club is - ORG
America - GPE
millions - CARDINAL
Sierra Club - ORG
Ada Recinos - PERSON
Related Press Releases - ORG
Industrial Transformation - ORG

5.3 Sentiment Analysis#

from textblob import TextBlob


def get_sentiment(text):
    return TextBlob(text).sentiment.polarity


# Apply sentiment analysis to all articles
rdata_df["sentiment"] = rdata_df["content"].apply(get_sentiment)

# Plot sentiment distribution
plt.figure(figsize=(10, 6))
plt.hist(rdata_df["sentiment"], bins=20)
plt.title("Sentiment Distribution of Sierra Club Press Releases")
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()

print(f"Average sentiment: {rdata_df['sentiment'].mean()}")

../_images/ab39dec592b4422b227700384d53ab490c7f2768f1bc26eaea213075dd3ddaae.png

Average sentiment: 0.12952396179718417

6. Text Representation#

Let’s create vector representations of our texts using TF-IDF.

tfidf_vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vectorizer.fit_transform(rdata_df["processed_content"])

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the top 10 terms for the first document
first_doc_vector = tfidf_matrix[0]
top_terms = sorted(
    zip(feature_names, first_doc_vector.toarray()[0]), key=lambda x: x[1], reverse=True
)[:10]
print("Top 10 terms in the first document:")
for term, score in top_terms:
    print(f"{term}: {score}")

Top 10 terms in the first document:
texas: 0.3101051886326014
worker: 0.27622023052939987
agreement: 0.2288625126017953
facility: 0.21455232095761398
safety: 0.2009832958865476
must: 0.1752966664790094
energy: 0.1538819754578188
community: 0.15154887003653952
trade: 0.14915865499620834
commitment: 0.13363707315372284

7. Conclusion#

In this lab, we’ve applied various NLP techniques to analyze Sierra Club press releases. We’ve performed text preprocessing, word frequency analysis, named entity recognition, sentiment analysis, and created TF-IDF representations of our texts. These techniques provide a foundation for more advanced analyses in future sessions.

8. Exercise#

As an exercise, try to answer the following questions:

What are the most common themes in the Sierra Club press releases based on the word frequency analysis?
How does the sentiment of the press releases change over time? (Hint: You may need to use the ‘timestamp’ column)
What are the most common named entities mentioned in the press releases?

Submit your findings and code for these exercises.