By the end of this week, you will be able to: –>
Clean raw corpus text — removing URLs, encoding artifacts, and
source-specific boilerplate
Build domain-specific stopword lists — identifying and removing
corpus-specific uninformative terms beyond standard lexicons
Apply stemming and lemmatization — reducing tokens to base forms
using the Porter stemmer and dictionary-based lemmatization, and
evaluate the tradeoffs between them
Conduct frequency analysis — counting and visualizing term frequency
across the full corpus and by source
Perform keyness analysis — comparing word use across sub-corpora
using log-likelihood (G²) with quanteda to identify characteristic
terms
Lab 2 — Text Analysis of Nexis Uni Data Graded · 11 pts
Due: April 21 @ 11:59 pm
Complete the assignment laid out at the end of the Lab 2 in-class doc
(parts 7. and 8.). Submit your .Rmd and knitted document (HTML or PDF)
to Canvas.