Using supervised machine learning to classify text documents — predicting outcomes from language with Naive Bayes and regularized regression.


Learning Objectives

By the end of this week, you will be able to:


Lecture Materials

Slides

🖥 Slide deck from this week’s lecture

View Slides

Code & Data

💻 R scripts used during class:

Lab5


📂 Assignment

Lab 5 — Text Classification Graded · 11 pts

Due: May 19 @ 11:59 pm

In this lab you will build a supervised text classifier to predict whether a climbing incident report describes a fatal or non-fatal accident. You will fit a Naive Bayes baseline, then compare its performance to other models on held-out test data.

Download Lab 5



📚 Text Classification in Environmental Science — Key Citations

Curated readings organized by thematic area. Useful background for situating the classification methods from this week in applied environmental research contexts.

Climate Change & Misinformation
Citation Year Topic Keywords
Coan et al. — Nature Climate Change 2021 Climate Denial Detection contrarian claims, binary classification, SVM, misinformation, media framing
Shyrokykh et al. — PLOS ONE 2023 Climate Tweet Classification short text, Naive Bayes, SVM, BERT, Twitter, climate relevance
Webersinke et al. — arXiv 2021 ClimateBERT domain-adapted language model, climate-relevant text detection, transfer learning
Biodiversity & Conservation
Citation Year Topic Keywords
Kulkarni et al. — Methods in Ecology & Evolution 2021 Threatened Species News Filtering binary classification, CITES, news relevance, random forest, information retrieval
Grasso et al. — arXiv 2024 EcoVerse Dataset eco-relevance, annotated Twitter dataset, binary classification benchmark
Environmental Health & Pollution
Citation Year Topic Keywords
Patel et al. — PLOS ONE 2017 Exposure Assessment Literature Screening text mining, PubMed, binary relevance classification, systematic review automation
Bellinger et al. — BMC Public Health 2017 Air Pollution Epidemiology Screening ML, data mining, systematic review, air pollution, study relevance classification

← Week 6: LDA Topic Modeling