Environmental Data Sources

Used in This Course

Source Description Access Used In
New York Times API Full-text article search across the NYT archive. Returns metadata, snippets, and article URLs via JSON. Free with registration; rate-limited. API key via NYT Developer Portal Lab 1
Nexis Uni Licensed news and publication database with full-text exports in .docx format. Covers thousands of outlets globally. UCSB Library login Labs 2 & 4
Bluesky / AT Protocol Decentralized social media platform. The {atrrr} package provides a tidy R interface for searching posts by keyword, hashtag, or user. Free account + app password Lab 3
Climate Security Dialogues on Twitter — CGIAR Annotated Twitter dataset covering climate security discourse. Useful for sentiment analysis, classification, and topic modeling in a climate-conflict context. Open access via CGIAR CGSpace —

Key Readings

Studies applying text and sentiment analysis to environmental research questions. Organized thematically.

Foundational NLP & Sentiment Methods

Show readings
Citation Year Key Contribution
Silge & Robinson — Text Mining with R (O’Reilly) 2017 Canonical tidy text analysis reference; covers sentiment, tf-idf, topic models in R
Hvitfeldt & Silge — Supervised ML for Text Analysis in R 2022 ML-based text classification, embeddings, and modeling workflows in R
Blei, Ng & Jordan — JMLR 2003 Original LDA paper; the generative model underlying most topic modeling workflows
Mikolov et al. — arXiv 2013 Word2Vec; introduced dense word embeddings that underpin modern NLP

Climate & Weather

Show readings
Citation Year Key Contribution
Lüdecke et al. — Nature Climate Change 2021 Sentiment analysis of climate change Twitter discourse across countries; identifies negativity bias and polarization
Cody et al. — Environmental Research Letters 2015 Sentiment in climate-related tweets tracks real-world climate events; validates social media as a climate signal
Tvinnereim et al. — Global Environmental Change 2020 Open-ended survey responses on climate change coded with topic models; reveals public framing diverges from expert framing

Sentiment Analysis in Environmental Science

Show readings
Citation Year Key Contribution
Alvarez-Lacalle et al. — Communications Earth & Environment 2024 Sixteen years of Reddit climate change discourse analyzed for shifts in language and sentiment; identifies growing negativity and polarization over time using longitudinal NLP
Shaeri et al. — arXiv 2025 Survey of sentiment analysis methods applied to social media during climate disasters and extreme weather events; taxonomizes approaches from lexicon-based tools through LLMs and identifies open challenges
Amangeldi et al. — arXiv 2023 Applies PMI-based sentiment and NRC emotion analysis to a decade of climate and environmental posts across Twitter, Reddit, and YouTube; finds negative sentiment dominates, with fear and anticipation as leading emotions
Feldman et al. — Weather, Climate, and Society 2023 Applies lexicon-based emotion and sentiment analysis to a large Twitter corpus on climate change; identifies how emotional valence varies by topic framing, user type, and seasonal climate events
Anonymous — Springer 2025 Systematic literature review of NLP and ML methods applied to climate change discourse on social media; maps the state of the field across sentiment analysis, topic modeling, and classification approaches
van der Veen & Bleich — PLOS ONE 2025 Introduces MultiLexScaled and demonstrates that lexicon-based sentiment methods remain competitive with ML and LLM approaches for media and text corpora; directly relevant to lexicon vs. ML tradeoffs discussed in this course

Biodiversity & Conservation

Show readings
Citation Year Key Contribution
Westgate et al. — Conservation Biology 2015 Topic modeling of conservation literature to identify research gaps and emerging themes
Nolan et al. — Methods in Ecology & Evolution 2021 pyResearchInsights: automated topic modeling pipeline for ecology and conservation abstracts
Valle et al. — Ecology Letters 2014 LDA applied to species assemblage data; bridges ecological community analysis and NLP methods
Edelmann et al. — Conservation Biology 2025 Review of LLMs and NLP for evidence synthesis in conservation social science

Pollution & Environmental Health

Show readings
Citation Year Key Contribution
Feng et al. — PLOS ONE 2015 LDA on Weibo posts to track public awareness of PM2.5 air quality in China
Wang & Jia — Journal of Cleaner Production 2021 Social media discourse on air pollution linked to bottom-up environmental governance pressure
Chang et al. — Journal of Environmental Management 2023 Twitter-based civil complaints about urban pollution spatially matched to monitoring data in Taipei
Lin et al. — IJERPH 2021 LDA + PLS-SEM on social media mining to model air pollution adaptation behavior

Corporate ESG & Environmental Reporting

Show readings
Citation Year Key Contribution
Székely & vom Brocke — PLOS ONE 2017 LDA on GRI sustainability reports to track longitudinal shifts in corporate environmental framing
Kriebel & Foege — Decision Support Systems 2024 Benchmarks NLP methods (LDA, BERT, ChatGPT) for sustainability disclosure analysis in 10-K filings
Gorovaia et al. — Sustainability 2025 LDA + TF-IDF pipeline for greenwashing detection using a Greenwashing Severity Index

Text Classification in Environmental Science

Show readings
Citation Year Binary Task
Coan et al. — Nature Climate Change 2021 Classifies climate contrarian claims vs. legitimate climate discourse; applies ML to large-scale denial detection
Kulkarni et al. — Methods in Ecology & Evolution 2021 Classifies news articles as relevant vs. irrelevant to CITES-listed threatened species; shows ML outperforms keyword search
Shyrokykh et al. — PLOS ONE 2023 Compares ML classifiers for short text; identifies climate-relevant tweets with Naive Bayes, SVM, and BERT
Webersinke et al. — arXiv 2021 ClimateBERT: domain-adapted language model for detecting climate-relevant text; strong baseline for downstream classification
Patel et al. — PLOS ONE 2017 Text mining to classify PubMed abstracts as relevant vs. irrelevant for chemical exposure assessment; demonstrates systematic review automation
Grasso et al. — arXiv 2024 EcoVerse: annotated Twitter dataset for eco-relevance binary classification; benchmark for environmental social media filtering
Anonymous — Environmental Data Science 2025 Reviews language models for climate change document analysis including binary classification of policy commitment paragraphs

R Packages for Text Analysis

Core Text Analysis

Package Purpose Install
tidytext Tidy-format text mining: tokenization, tf-idf, sentiment joins, topic model tidying install.packages("tidytext")
quanteda High-performance corpus and document-feature matrix (DFM) construction install.packages("quanteda")
quanteda.textstats Keyness, readability, lexical diversity statistics on DFMs install.packages("quanteda.textstats")
quanteda.textplots Wordclouds, keyness plots, and other quanteda visualizations install.packages("quanteda.textplots")
stringr Consistent string manipulation functions (part of {tidyverse}) install.packages("stringr")

Preprocessing

Package Purpose Install
SnowballC Porter stemmer for 15+ languages install.packages("SnowballC")
textstem Lemmatization using the Hunspell dictionary install.packages("textstem")

Sentiment Analysis

Package Purpose Install
sentimentr Sentence-level sentiment with valence shifters (negation, amplifiers, de-amplifiers) install.packages("sentimentr")
tidyvader Tidy interface to VADER — optimized for social media, handles slang and punctuation remotes::install_github("chris31415926535/tidyvader")
vader Direct VADER implementation; returns compound, positive, negative, and neutral scores install.packages("vader")

Topic Modeling

Package Purpose Install
topicmodels LDA and CTM topic models; interfaces with {tidytext} via tidy() install.packages("topicmodels")
ldatuning Metrics (CaoJuan2009, Deveaud2014, etc.) for selecting the number of topics k install.packages("ldatuning")
LDAvis Interactive visualization of topic-word distributions and intertopic distances install.packages("LDAvis")
stm Structural Topic Model — allows topic prevalence and content to vary with document covariates install.packages("stm")

Data Access

Package Purpose Install
jsonlite Parse JSON responses from APIs (NYT, etc.) into R data frames install.packages("jsonlite")
LexisNexisTools Parse Nexis Uni .docx exports into tidy data frames install.packages("LexisNexisTools")
atrrr Bluesky (AT Protocol) API client — search posts, retrieve feeds, authenticate install.packages("atrrr")