Text Classification

Introduction to Natural Language Processing

Text Mining

Overview

  • Natural language processing (NLP) denotes the retrieval and systematization of information contained in written or spoken language
  • Text classification deals with labeling observations containing textual information into groups

Text Documents

Code
set.seed(7)
imdb <- read_parquet("data/imdb.parquet")
imdb <- sample_n(imdb, 2000)

imdb |>
  filter(nchar(review) < 800 & nchar(review) > 200) |>
  filter(row_number() < 6) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )
review sentiment
this show is the best it is full of laughs and Kevin James is the best so if you want a good show i recommend the king of queens and its a letdown that they canceled it so in the end this show will make you forget your worries and troubles cause if you have a cast with Kevin James and jerry stiller you cant go wrong. so i don't know why the canceled the show if any one knows please tell me.now a days you cant find a lot of shows that fulfill your needs as an audience.after Seinfeld and king of queens the only show worth watching is prison break and if that stops i don't know what to do. in the end if i had to recommend a show it will be king of queens. positive
"Undercurrent" features a top-notch cast of wonderful actors who might've been assembled for the perfect drawing-room comedy. Alas, they are pretty much wasted on a 'woman's view' potboiler--and a paper-thin one at that. Katharine Hepburn is indeed radiant as a tomboy/old maid who finally marries, but her husband is deeply disturbed and harboring dark family secrets. Director Vincente Minnelli has absolutely no idea how to mount this outlandish plot, concocted by Edward Chodorov from a story by Thelma Strabel, and the friendly, first-rate cast (including Robert Taylor, Robert Mitchum and Edmund Gwenn) is left treading in murky waters. ** from **** negative
For those out there that like to think of themselves as reasonably intelligent human beings, who love film, have good attention to detail and enjoy indie movies with funny, smartly written dialogue then this is a film for you.<br /><br />For those with a poor attention span, high expectations and no brains.. well.. um.. you may get bored and find things dragging at times.<br /><br />This is a charming, modest and well paced movie with the actors bestowing a real sense of depth and warmth to their roles. I chuckled to myself pretty much the whole way through..<br /><br />This film is a little gem. positive
I am surprised than many viewers hold more respect for the sequel to this brilliant movie... I have seen all the guinea pigs and this one is easily the best.<br /><br />Even though ive seen the "making of", i still have doubts when watching those 35mins of pure torture : its that powerful.<br /><br />A 10 out of 10 because this movie achieved perfectly what it set out to do : be the best fake snuff film ever made. positive
The whole movie was done half-assed. It could have been a much better movie but, that would have required a re-write and different actors. Compared to "Traffic" this was a wreck. I am just glad I didn't have to pay for it.<br /><br />Spoiler:<br /><br />What was the point of having crooked cops getting arrested? To share the guilt of drug dealers and make them feel better? Pu-leaze! The parents were scum, driven by greed, and didn't even consider the harm they were doing; as pointed out by Ice-T. <br /><br />2 out of 10 negative

Tokens

  • Tokens are substrings, words, phrases, or sentences that can be extracted from a document
  • When tokenizing a document, you might want to get rid of:
    • Punctuation (. ? ! …)
    • Stop words (and or because …)
    • Hyphens (-)
    • Any kind of special characters
    • Numbers or dates
    • Short strings

Combining and Compounding of Tokens

  • Dictionaries (stemming) and thesauruses (synonym finding)
  • String similarity analysis
  • Co-occurrence analysis
Code
toks <- imdb$review |> 
  tokens(
    remove_punct = TRUE, 
    remove_separators = TRUE, 
    split_hyphens = FALSE, 
    what="word"
    ) |> 
  tokens_remove(pattern = stopwords('english')) |>
  tokens_remove(pattern = c("br", "<", ">"))
  #tokens_split(separator = "-", remove_separator = TRUE) |>
  #tokens_wordstem()

n-grams

  • Words or characters are compounded in groups of n
  • Each following token drops the first word or character and compounds the next
  • Example: “I smiled through the whole film”
    • 3-grams (words): ["I smiled", "I smiled through", "smiled through the", "through the film", "the film"]
    • 3-grams (characters): ["I s", " sm", "smi", "mil", "ile", "led", ...]
  • Can help to identify meaningful combinations without much manual work

Document Feature Matrix

  • Data frame that uses the tokens as variables (“document features”)
  • Contains counts or dummy indicating appearance of tokens for observation
dfm <- toks |>  
  dfm() |> # stem = T
  dfm_select(min_nchar = 2) |>
  dfm_trim(min_docfreq = 20) # min_termfreq = 0

dfm_data_frame <- convert(dfm, to = "data.frame")

dfm_data_frame |>
  filter(row_number() < 10) |>
  dplyr::select(1:10) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )
doc_id get feeling lot people liked movie want admit understand
text1 1 1 2 3 1 4 2 1 1
text2 1 0 0 0 0 0 0 0 0
text3 0 0 0 1 0 2 0 0 0
text4 1 0 0 0 0 2 0 1 0
text5 0 0 1 0 0 0 1 0 0
text6 0 0 0 0 0 0 0 0 0
text7 1 0 0 0 0 1 0 0 0
text8 2 0 2 2 1 7 0 0 1
text9 0 2 0 0 0 1 1 0 0

Document Feature Matrix

  • After creating tokens, you might want to remove:
    • Features with low occurrence across documents
    • Redundant features
    • “Neutral” features
    • Any other information that could blur predictive performance or increase computational costs
  • Meta-information (e.g., author, date) and other variables can be used as additional features

Descriptives

textstat_frequency(dfm) |>
  filter(row_number() < 17) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )
feature frequency rank docfreq group
movie 3576 1 1229 all
film 3052 2 1086 all
one 2045 3 1122 all
like 1631 4 931 all
just 1444 5 862 all
good 1208 6 757 all
even 1016 7 671 all
really 967 8 633 all
see 930 9 647 all
time 926 10 662 all
can 883 11 634 all
story 871 12 559 all
well 825 13 590 all
bad 797 14 490 all
get 760 15 555 all
also 741 16 514 all

Textplots

textplot_wordcloud(
  dfm,
  color = "#9C6B91",
  background = "transparent",
  fixed_aspect = T
  )

Models for Text Classification

Models for Text Classification

  • In principle, any model could be used to predict classes from a document feature matrix
  • Two most common (traditional) methods methods for text classification:
    • Naive Bayes
    • Support Vector Machine (SVM)
  • Nowadays often more convienient to use Neural Network based LLM or transformer model

Naive Bayes for Text Features

  • For text classification, document features are discrete:
    • Binary indicators (word present/absent) → Bernoulli Naive Bayes
    • Word counts or term frequencies → Multinomial Naive Bayes
  • We want to calculate the probability that an observation falls into category \(C_k\) given a set of observed document features \(f_1, f_2, ..., f_p\):

    \[P(C_k \mid f_1, f_2, ..., f_p) = \frac{P(f_1, f_2, ..., f_p \mid C_k) \times P(C_k)}{P(f_1, f_2, ..., f_p)}\]

  • The “naive” assumption is that the occurrences of features are independent:

    \[P(f_1, f_2, ..., f_p \mid C_k) = P(f_1 \mid C_k) \times P(f_2 \mid C_k) \times ... \times P(f_p \mid C_k) = \prod_{j=1}^{p} P(f_j \mid C_k)\]

Naive Bayes: Gaussian Kernel

  • When features are continuous (e.g., numerical measurements), Naive Bayes assumes they follow a normal (Gaussian) distribution within each class
  • The likelihood of a feature \(f_j\) given class \(C_k\) is modeled as:

    \[ P(f_j \mid C_k) = \frac{1}{\sqrt{2\pi\sigma_{kj}^2}} \exp\left( -\frac{(f_j - \mu_{kj})^2}{2\sigma_{kj}^2} \right) \]

  • This approach is commonly referred to as Gaussian Naive Bayes

  • Less common in text classification, but useful for other domains with real-valued inputs (e.g., sensor data, medical features)

Support Space with Gaussian Kernel

Naive Bayes Classifier

  • Although the “naive” assumption is unlikely to hold in reality, it often doesn’t significantly impair predictive power
  • To predict the class of a given set of document features, we calculate the probability for each category \(C_k\) and choose the one with the highest probability
  • For text features, the distribution of conditional probability is often Bernoulli or multinomial, while for continuous features, a normal distribution is usually assumed

Properties of Naive Bayes

  • Naive Bayes is not necessarily a Bayesian method
  • The method is computationally cheap
  • Often performs well even with small training samples
  • Coefficients are easy to interpret and can be used for feature selection
  • Redundant and highly correlated features can worsen performance by overstating their importance

Application in R

Code
mod_nb <- textmodel_nb(
  x = dfm, 
  y = imdb$sentiment, 
  prior = "docfreq"
  )

imdb$predicted<- predict(
  mod_nb, 
  newdata = dfm, 
  type = "class"
  )

imdb |>
  filter(nchar(review) < 800 & nchar(review) > 200) |>
  filter(row_number() < 6) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )
review sentiment predicted
this show is the best it is full of laughs and Kevin James is the best so if you want a good show i recommend the king of queens and its a letdown that they canceled it so in the end this show will make you forget your worries and troubles cause if you have a cast with Kevin James and jerry stiller you cant go wrong. so i don't know why the canceled the show if any one knows please tell me.now a days you cant find a lot of shows that fulfill your needs as an audience.after Seinfeld and king of queens the only show worth watching is prison break and if that stops i don't know what to do. in the end if i had to recommend a show it will be king of queens. positive positive
"Undercurrent" features a top-notch cast of wonderful actors who might've been assembled for the perfect drawing-room comedy. Alas, they are pretty much wasted on a 'woman's view' potboiler--and a paper-thin one at that. Katharine Hepburn is indeed radiant as a tomboy/old maid who finally marries, but her husband is deeply disturbed and harboring dark family secrets. Director Vincente Minnelli has absolutely no idea how to mount this outlandish plot, concocted by Edward Chodorov from a story by Thelma Strabel, and the friendly, first-rate cast (including Robert Taylor, Robert Mitchum and Edmund Gwenn) is left treading in murky waters. ** from **** negative positive
For those out there that like to think of themselves as reasonably intelligent human beings, who love film, have good attention to detail and enjoy indie movies with funny, smartly written dialogue then this is a film for you.<br /><br />For those with a poor attention span, high expectations and no brains.. well.. um.. you may get bored and find things dragging at times.<br /><br />This is a charming, modest and well paced movie with the actors bestowing a real sense of depth and warmth to their roles. I chuckled to myself pretty much the whole way through..<br /><br />This film is a little gem. positive positive
I am surprised than many viewers hold more respect for the sequel to this brilliant movie... I have seen all the guinea pigs and this one is easily the best.<br /><br />Even though ive seen the "making of", i still have doubts when watching those 35mins of pure torture : its that powerful.<br /><br />A 10 out of 10 because this movie achieved perfectly what it set out to do : be the best fake snuff film ever made. positive positive
The whole movie was done half-assed. It could have been a much better movie but, that would have required a re-write and different actors. Compared to "Traffic" this was a wreck. I am just glad I didn't have to pay for it.<br /><br />Spoiler:<br /><br />What was the point of having crooked cops getting arrested? To share the guilt of drug dealers and make them feel better? Pu-leaze! The parents were scum, driven by greed, and didn't even consider the harm they were doing; as pointed out by Ice-T. <br /><br />2 out of 10 negative negative

Support Vector Machine

SVM with Hard Margin

  • Two perfectly linearly separable classes \(\{-1, 1\}\) within a support space of multiple (standardized) predictors
  • Fit a hyperplane through the space that separates the two groups, maximizing the margin \(M\) between the plane and the observations:

\[\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M\]

Subject to:

\[\sum_{j = 1}^p \beta_j^2 = 1\]

\[y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}) \ge M\]

SVM with Hard Margin

SVM with Soft Margin

  • Perfect linear separation is rarely possible or desirable
  • Introduce a penalty term for observations on the wrong side of the boundary:

\[\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M\]

Subject to:

\[\sum_{j = 1}^p \beta_j^2 = 1\]

\[y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}) \ge M(1 - \xi_i)\]

\[\xi_i \ge 0, \quad \sum_{i = 1}^n \xi_i \le C\]

SVM with Soft Margin

Dimensions and Linear Separability

  • In high dimensions, observations tend to approach linear separability.
    • In text classification, we often have thousands of features
  • The “kernel trick”:
    • Generate additional features by applying transformations to the existing ones
    • Attempt linear separation in now higher-dimensional space
    • Projected back into the original space the plane is non-linear
    • Distinct observations can always be linearly separated in sufficiently complex and high-dimensional spaces

Dimensions and Linear Separability

Kernels

  • Commonly utilized kernel functions are:
    • Polynomial: \(K\left(x, x'\right) = \gamma\left(1 + \langle x, x' \rangle\right) ^ d\)
    • (Gaussian) Radial: \(K\left(x, x'\right) = \exp\left(-\frac{1}{2\sigma^2} \lVert x - x'\rVert ^ 2\right)\)
  • This is usually done in combination with introducing a penalty term (i.e., a soft margin)

Radial Kernel

Tuning \(\sigma\)

Properties of SVM

  • What if we have more than two classes?
    • One-versus-all comparison
    • One-versus-one (pairwise) comparison
  • Method is a “black box” and difficult to comprehend
    • Packages such as vip can translate the importance of a feature into simple numeric values
  • SVM is a versatile method also used for a variety of other prediction problems
    • Can be sensitive to the kernel chosen
    • Non-linear kernels tend to overfit (use CV!)

Text Classification with LLMs

  • Nowadays it is more convinient to run an LLM or transformer model than training you own text classification model
    • either via third party API (e.g. Open AI API)
    • or run it locally (e.g. Ollama)
  • Many models can be fine-tuned for specific tasks

OpenAI API for Text Classification

key <- read_csv("keys/key_schmoigl.txt") |> pull()

prompts <- imdb |>
  select(review) |>
  distinct() |>
  filter(row_number() <= 5)

system_role <- glue("
  Classify movie reviews from imdb as positive or negative. 
  Only return the string 'positive' or 'negative'. 
  Do not respond with any explanations."
  )

answers <- data.frame(review = NULL, predicted_chatGPT = NULL)
for (i in (nrow(answers)+1):nrow(prompts)) { 
  response_raw <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", key)),
    content_type_json(),
    encode = "json",
    body = list(
      model = "gpt-4o-2024-08-06",
      messages = list(
        list(role = "system", content = system_role),
        list(role = "user", content = prompts$review[i])
        ),
      temperature = 0
    )
  )
  answers[i, 1] <- prompts$review[i]
  response <- content(response_raw)$choices[[1]]
  answers[i, 2] <- response$message$content
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(paste0(progress, "%\n"))}
}

Ollama API for Text Classification

After installing Ollama on your device run

ollama pull llama3.2
ollama list
export OLLAMA_HOST=localhost:11434

Check that Ollama is running

answers <- data.frame()

for (i in (nrow(answers)+1):nrow(prompts)) { 
  body <- list(
    prompt = paste(
      system_role,
      "The review is:",
      prompts$review[i]
    ),
    max_tokens = 200,
    temperature = 0,
    stream = FALSE,
    model = "llama3.2"
  )
  ollama_response <- POST(
    url = "http://localhost:11434/api/generate", 
    body = body, 
    encode = "json"
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- fromJSON(content(ollama_response, as = "text", encoding = "UTF-8"))$response
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(paste0(progress, "%\n"))}
}

names(answers) <- c("review", "predicted_ollama")