Text Classification

Introduction to Natural Language Processing

Lukas Schmoigl

lukas.schmoigl@wifo.ac.at

Data Scientist at WIFO

Text Mining

Overview

Natural language processing (NLP) denotes the retrieval and systematization of information contained in written or spoken language
Text classification deals with labeling observations containing textual information into groups

Text Documents

Code

set.seed(7)
imdb <- read_parquet("data/imdb.parquet")
imdb <- sample_n(imdb, 2000)

imdb |>
  filter(nchar(review) < 800 & nchar(review) > 200) |>
  filter(row_number() < 6) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )

review	sentiment
this show is the best it is full of laughs and Kevin James is the best so if you want a good show i recommend the king of queens and its a letdown that they canceled it so in the end this show will make you forget your worries and troubles cause if you have a cast with Kevin James and jerry stiller you cant go wrong. so i don't know why the canceled the show if any one knows please tell me.now a days you cant find a lot of shows that fulfill your needs as an audience.after Seinfeld and king of queens the only show worth watching is prison break and if that stops i don't know what to do. in the end if i had to recommend a show it will be king of queens.	positive
"Undercurrent" features a top-notch cast of wonderful actors who might've been assembled for the perfect drawing-room comedy. Alas, they are pretty much wasted on a 'woman's view' potboiler--and a paper-thin one at that. Katharine Hepburn is indeed radiant as a tomboy/old maid who finally marries, but her husband is deeply disturbed and harboring dark family secrets. Director Vincente Minnelli has absolutely no idea how to mount this outlandish plot, concocted by Edward Chodorov from a story by Thelma Strabel, and the friendly, first-rate cast (including Robert Taylor, Robert Mitchum and Edmund Gwenn) is left treading in murky waters. from **	negative
For those out there that like to think of themselves as reasonably intelligent human beings, who love film, have good attention to detail and enjoy indie movies with funny, smartly written dialogue then this is a film for you.<br /><br />For those with a poor attention span, high expectations and no brains.. well.. um.. you may get bored and find things dragging at times.<br /><br />This is a charming, modest and well paced movie with the actors bestowing a real sense of depth and warmth to their roles. I chuckled to myself pretty much the whole way through..<br /><br />This film is a little gem.	positive
I am surprised than many viewers hold more respect for the sequel to this brilliant movie... I have seen all the guinea pigs and this one is easily the best.<br /><br />Even though ive seen the "making of", i still have doubts when watching those 35mins of pure torture : its that powerful.<br /><br />A 10 out of 10 because this movie achieved perfectly what it set out to do : be the best fake snuff film ever made.	positive
The whole movie was done half-assed. It could have been a much better movie but, that would have required a re-write and different actors. Compared to "Traffic" this was a wreck. I am just glad I didn't have to pay for it.<br /><br />Spoiler:<br /><br />What was the point of having crooked cops getting arrested? To share the guilt of drug dealers and make them feel better? Pu-leaze! The parents were scum, driven by greed, and didn't even consider the harm they were doing; as pointed out by Ice-T. <br /><br />2 out of 10	negative

Tokens

Tokens are substrings, words, phrases, or sentences that can be extracted from a document
When tokenizing a document, you might want to get rid of:
- Punctuation (. ? ! …)
- Stop words (and or because …)
- Hyphens (-)
- Any kind of special characters
- Numbers or dates
- Short strings

For example: ChatGPT Tokenizer

Combining and Compounding of Tokens

Dictionaries (stemming) and thesauruses (synonym finding)
String similarity analysis
Co-occurrence analysis

Code

toks <- imdb$review |> 
  tokens(
    remove_punct = TRUE, 
    remove_separators = TRUE, 
    split_hyphens = FALSE, 
    what="word"
    ) |> 
  tokens_remove(pattern = stopwords('english')) |>
  tokens_remove(pattern = c("br", "<", ">"))
  #tokens_split(separator = "-", remove_separator = TRUE) |>
  #tokens_wordstem()

n-grams

Words or characters are compounded in groups of n
Each following token drops the first word or character and compounds the next

Example: “I smiled through the whole film”
- 3-grams (words): ["I smiled", "I smiled through", "smiled through the", "through the film", "the film"]
- 3-grams (characters): ["I s", " sm", "smi", "mil", "ile", "led", ...]

Can help to identify meaningful combinations without much manual work

Document Feature Matrix

Data frame that uses the tokens as variables (“document features”)
Contains counts or dummy indicating appearance of tokens for observation

dfm <- toks |>  
  dfm() |> # stem = T
  dfm_select(min_nchar = 2) |>
  dfm_trim(min_docfreq = 20) # min_termfreq = 0

dfm_data_frame <- convert(dfm, to = "data.frame")

dfm_data_frame |>
  filter(row_number() < 10) |>
  dplyr::select(1:10) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )

doc_id	get	feeling	lot	people	liked	movie	want	admit	understand
text1	1	1	2	3	1	4	2	1	1
text2	1	0	0	0	0	0	0	0	0
text3	0	0	0	1	0	2	0	0	0
text4	1	0	0	0	0	2	0	1	0
text5	0	0	1	0	0	0	1	0	0
text6	0	0	0	0	0	0	0	0	0
text7	1	0	0	0	0	1	0	0	0
text8	2	0	2	2	1	7	0	0	1
text9	0	2	0	0	0	1	1	0	0

Document Feature Matrix

After creating tokens, you might want to remove:
- Features with low occurrence across documents
- Redundant features
- “Neutral” features
- Any other information that could blur predictive performance or increase computational costs

Meta-information (e.g., author, date) and other variables can be used as additional features

Descriptives

textstat_frequency(dfm) |>
  filter(row_number() < 17) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )

feature	frequency	rank	docfreq	group
movie	3576	1	1229	all
film	3052	2	1086	all
one	2045	3	1122	all
like	1631	4	931	all
just	1444	5	862	all
good	1208	6	757	all
even	1016	7	671	all
really	967	8	633	all
see	930	9	647	all
time	926	10	662	all
can	883	11	634	all
story	871	12	559	all
well	825	13	590	all
bad	797	14	490	all
get	760	15	555	all
also	741	16	514	all

Textplots

textplot_wordcloud(
  dfm,
  color = "#9C6B91",
  background = "transparent",
  fixed_aspect = T
  )

Models for Text Classification

In principle, any model could be used to predict classes from a document feature matrix
Two most common (traditional) methods methods for text classification:
- Naive Bayes
- Support Vector Machine (SVM)

Nowadays often more convienient to use Neural Network based LLM or transformer model

Naive Bayes for Text Features

For text classification, document features are discrete:
- Binary indicators (word present/absent) → Bernoulli Naive Bayes
- Word counts or term frequencies → Multinomial Naive Bayes

We want to calculate the probability that an observation falls into category \(C_k\) given a set of observed document features \(f_1, f_2, ..., f_p\):

\[P(C_k \mid f_1, f_2, ..., f_p) = \frac{P(f_1, f_2, ..., f_p \mid C_k) \times P(C_k)}{P(f_1, f_2, ..., f_p)}\]

The “naive” assumption is that the occurrences of features are independent:

\[P(f_1, f_2, ..., f_p \mid C_k) = P(f_1 \mid C_k) \times P(f_2 \mid C_k) \times ... \times P(f_p \mid C_k) = \prod_{j=1}^{p} P(f_j \mid C_k)\]

Naive Bayes: Gaussian Kernel

When features are continuous (e.g., numerical measurements), Naive Bayes assumes they follow a normal (Gaussian) distribution within each class

The likelihood of a feature \(f_j\) given class \(C_k\) is modeled as:

\[ P(f_j \mid C_k) = \frac{1}{\sqrt{2\pi\sigma_{kj}^2}} \exp\left( -\frac{(f_j - \mu_{kj})^2}{2\sigma_{kj}^2} \right) \]

This approach is commonly referred to as Gaussian Naive Bayes
Less common in text classification, but useful for other domains with real-valued inputs (e.g., sensor data, medical features)

Support Space with Gaussian Kernel

Naive Bayes Classifier

Although the “naive” assumption is unlikely to hold in reality, it often doesn’t significantly impair predictive power
To predict the class of a given set of document features, we calculate the probability for each category \(C_k\) and choose the one with the highest probability
For text features, the distribution of conditional probability is often Bernoulli or multinomial, while for continuous features, a normal distribution is usually assumed

Properties of Naive Bayes

Naive Bayes is not necessarily a Bayesian method
The method is computationally cheap
Often performs well even with small training samples
Coefficients are easy to interpret and can be used for feature selection
Redundant and highly correlated features can worsen performance by overstating their importance

Application in R

Code

mod_nb <- textmodel_nb(
  x = dfm, 
  y = imdb$sentiment, 
  prior = "docfreq"
  )

imdb$predicted<- predict(
  mod_nb, 
  newdata = dfm, 
  type = "class"
  )

imdb |>
  filter(nchar(review) < 800 & nchar(review) > 200) |>
  filter(row_number() < 6) |>
  kbl() |> 
  kable_styling(
    html_font = "Century Gothic", 
    font_size = 14
    )

review	sentiment	predicted
this show is the best it is full of laughs and Kevin James is the best so if you want a good show i recommend the king of queens and its a letdown that they canceled it so in the end this show will make you forget your worries and troubles cause if you have a cast with Kevin James and jerry stiller you cant go wrong. so i don't know why the canceled the show if any one knows please tell me.now a days you cant find a lot of shows that fulfill your needs as an audience.after Seinfeld and king of queens the only show worth watching is prison break and if that stops i don't know what to do. in the end if i had to recommend a show it will be king of queens.	positive	positive
"Undercurrent" features a top-notch cast of wonderful actors who might've been assembled for the perfect drawing-room comedy. Alas, they are pretty much wasted on a 'woman's view' potboiler--and a paper-thin one at that. Katharine Hepburn is indeed radiant as a tomboy/old maid who finally marries, but her husband is deeply disturbed and harboring dark family secrets. Director Vincente Minnelli has absolutely no idea how to mount this outlandish plot, concocted by Edward Chodorov from a story by Thelma Strabel, and the friendly, first-rate cast (including Robert Taylor, Robert Mitchum and Edmund Gwenn) is left treading in murky waters. from **	negative	positive
For those out there that like to think of themselves as reasonably intelligent human beings, who love film, have good attention to detail and enjoy indie movies with funny, smartly written dialogue then this is a film for you.<br /><br />For those with a poor attention span, high expectations and no brains.. well.. um.. you may get bored and find things dragging at times.<br /><br />This is a charming, modest and well paced movie with the actors bestowing a real sense of depth and warmth to their roles. I chuckled to myself pretty much the whole way through..<br /><br />This film is a little gem.	positive	positive
I am surprised than many viewers hold more respect for the sequel to this brilliant movie... I have seen all the guinea pigs and this one is easily the best.<br /><br />Even though ive seen the "making of", i still have doubts when watching those 35mins of pure torture : its that powerful.<br /><br />A 10 out of 10 because this movie achieved perfectly what it set out to do : be the best fake snuff film ever made.	positive	positive
The whole movie was done half-assed. It could have been a much better movie but, that would have required a re-write and different actors. Compared to "Traffic" this was a wreck. I am just glad I didn't have to pay for it.<br /><br />Spoiler:<br /><br />What was the point of having crooked cops getting arrested? To share the guilt of drug dealers and make them feel better? Pu-leaze! The parents were scum, driven by greed, and didn't even consider the harm they were doing; as pointed out by Ice-T. <br /><br />2 out of 10	negative	negative

Support Vector Machine

SVM with Hard Margin

Two perfectly linearly separable classes \(\{-1, 1\}\) within a support space of multiple (standardized) predictors
Fit a hyperplane through the space that separates the two groups, maximizing the margin \(M\) between the plane and the observations:

\[\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M\]

Subject to:

\[\sum_{j = 1}^p \beta_j^2 = 1\]

\[y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}) \ge M\]

SVM with Hard Margin

SVM with Soft Margin

Perfect linear separation is rarely possible or desirable
Introduce a penalty term for observations on the wrong side of the boundary:

\[\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M\]

Subject to:

\[\sum_{j = 1}^p \beta_j^2 = 1\]

\[y_i(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}) \ge M(1 - \xi_i)\]

\[\xi_i \ge 0, \quad \sum_{i = 1}^n \xi_i \le C\]

SVM with Soft Margin

Dimensions and Linear Separability

In high dimensions, observations tend to approach linear separability.
- In text classification, we often have thousands of features

The “kernel trick”:
- Generate additional features by applying transformations to the existing ones
- Attempt linear separation in now higher-dimensional space
- Projected back into the original space the plane is non-linear
- Distinct observations can always be linearly separated in sufficiently complex and high-dimensional spaces

Dimensions and Linear Separability

Kernels

Commonly utilized kernel functions are:
- Polynomial: \(K\left(x, x'\right) = \gamma\left(1 + \langle x, x' \rangle\right) ^ d\)
- (Gaussian) Radial: \(K\left(x, x'\right) = \exp\left(-\frac{1}{2\sigma^2} \lVert x - x'\rVert ^ 2\right)\)

This is usually done in combination with introducing a penalty term (i.e., a soft margin)

Radial Kernel

Tuning \(\sigma\)

Properties of SVM

What if we have more than two classes?
- One-versus-all comparison
- One-versus-one (pairwise) comparison

Method is a “black box” and difficult to comprehend
- Packages such as vip can translate the importance of a feature into simple numeric values
SVM is a versatile method also used for a variety of other prediction problems
- Can be sensitive to the kernel chosen
- Non-linear kernels tend to overfit (use CV!)

Text Classification with LLMs

Nowadays it is more convinient to run an LLM or transformer model than training you own text classification model
- either via third party API (e.g. Open AI API)
- or run it locally (e.g. Ollama)

Many models can be fine-tuned for specific tasks

OpenAI API for Text Classification

key <- read_csv("keys/key_schmoigl.txt") |> pull()

prompts <- imdb |>
  select(review) |>
  distinct() |>
  filter(row_number() <= 5)

system_role <- glue("
  Classify movie reviews from imdb as positive or negative. 
  Only return the string 'positive' or 'negative'. 
  Do not respond with any explanations."
  )

answers <- data.frame(review = NULL, predicted_chatGPT = NULL)

for (i in (nrow(answers)+1):nrow(prompts)) { 
  response_raw <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", key)),
    content_type_json(),
    encode = "json",
    body = list(
      model = "gpt-4o-2024-08-06",
      messages = list(
        list(role = "system", content = system_role),
        list(role = "user", content = prompts$review[i])
        ),
      temperature = 0
    )
  )
  answers[i, 1] <- prompts$review[i]
  response <- content(response_raw)$choices[[1]]
  answers[i, 2] <- response$message$content
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(paste0(progress, "%\n"))}
}

Ollama API for Text Classification

After installing Ollama on your device run

ollama pull llama3.2
ollama list
export OLLAMA_HOST=localhost:11434

Check that Ollama is running

answers <- data.frame()

for (i in (nrow(answers)+1):nrow(prompts)) { 
  body <- list(
    prompt = paste(
      system_role,
      "The review is:",
      prompts$review[i]
    ),
    max_tokens = 200,
    temperature = 0,
    stream = FALSE,
    model = "llama3.2"
  )
  ollama_response <- POST(
    url = "http://localhost:11434/api/generate", 
    body = body, 
    encode = "json"
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- fromJSON(content(ollama_response, as = "text", encoding = "UTF-8"))$response
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(paste0(progress, "%\n"))}
}

names(answers) <- c("review", "predicted_ollama")