Databases and APIs

DuckDB, SQL, Tidyverse, REST APIs … handling your data very efficiently

Databases

Storing and extracting data efficiently

Parquet

Going beyond csv

  • Columnar, compressed, typed
  • Usually ~10 times smaller than csv
  • Interoperable across tools
  • Very performant and efficient (parquet + DuckDB)
  • Supports complex data structures
  • Metadata inclusion
library(tidyverse)
library(arrow)

size_csv <-
  file.size("data/synthetic_unemployment_data_large.csv")
start_time <- Sys.time()
data <- read_csv("data/synthetic_unemployment_data_large.csv")
end_time <- Sys.time()
time_csv <- end_time - start_time

write_parquet(
  data, 
  "data/synthetic_unemployment_data_large.parquet",
  compression = "gzip",
  compression_level = 9
  )
size_parquet <- 
  file.size("data/synthetic_unemployment_data_large.parquet")

The csv file is 2847 MB large while the the parquet is 227 MB

DuckDB

Efficient data queries

  • “Stand-alone” database system
  • Execute SQL queries
    • RAM efficient
    • “In-file” data wrangling
  • Vectorized execution
    • very performant
library(duckdb)
library(glue)

con <- dbConnect(duckdb())

cols <- c("target_high", "benefits_amount", "unemployment_1m")

start_time <- Sys.time()
data <- dbGetQuery(con, glue("
  SELECT {paste(cols, collapse = ', ')}
  FROM 'data/synthetic_unemployment_data_large.parquet'
  WHERE promised_employment == 0
"))
end_time <- Sys.time()
time_parquet <- end_time - start_time

dbDisconnect(con, shutdown = TRUE)

It took us 18.6 seconds to read in the csv and only 0.6 seconds to make the DuckDB query!

Data Wrangling

Manipulating data efficiently

Tidyverse

  • Tidyverse - consistent grammar and readable syntax
    • dplyr - data manipulation (filter, select, mutate, summarize)
    • tidyr - data tidying (pivot, separate, unite)
    • magrittr - %>% operator
    • readr - data import/export
    • stringr - string manipulation
    • lubridate - date/time handling
library(tidyverse)

data <- read_csv("data/cases.csv")

plot_data <- data %>%
  separate(case_nr, c("court", "case_nr"), ", ") %>%
  group_by(court, month) %>%
  count(name = "cases") %>%
  group_by(court) %>%
  mutate(sum_cases = sum(cases, na.rm = TRUE)) %>%
  filter(sum_cases > 100) %>%
  ungroup()

Alternatives

  • data.table - extremely fast and memory efficient
    • Concise [i, j, by] syntax
    • In-place modifications
    • Optimized for large datasets
  • Pandas - comprehensive data analysis library in Python
  • polars - next-generation, performant library (Python and R)
  • Arquero - tidyverse-like JavaScript implmentation
library(data.table)

dt <- fread("data/cases.csv")

plot_data <- 
  dt[, c("court", "case_nr") := tstrsplit(case_nr, ", ")
    ][, .(cases = .N), by = .(court, month)
    ][, sum_cases := sum(cases), by = court
    ][sum_cases > 100]

(REST) APIs

Automating data workflows

Applications

  • Fetch data in real time for display on a website (e.g. FG4 Webgrafik)
  • Host local service and use via API (e.g. Ollama)
  • Add or change files on a server (e.g. update database, add address to newsletter)
  • GET
  • GET
  • POST
  • POST
  • POST/PUT/DELETE

Workflow Automation

API Request and Response

  • Request
    • Request line: HTTP method (GET/POST/PUT/DELETE) and URL
    • Header: Metadata (e.g. authentication)
    • Body: Info about request (e.g. JSON with instructions)
  • Response
    • Status code (e.g. 200 OK, 400 Bad Request, 429 Too Many Requests)
    • Content as JSON, HTML or XML

GET — Eurostat Example

  • GET request asks server for information
  • Fixed part:

https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/

{host_url}/{service}/{version}/

  • Dynamic part:

data/nama_10_gdp?format=JSON&time=2023&unit=CP_MEUR&na_item=B1GQ

{response_type}/{datasetCode}?{format}&{filters}

GET from R — Eurostat Example

  • Specify query from R
library(tidyverse)
library(httr)

url <- paste0(
  "https://ec.europa.eu/eurostat/api/",
  "dissemination/statistics/1.0/data/nama_10_gdp?"
)

params <- list(
  time = "2023",
  unit = "CP_MEUR",
  na_item = "B1GQ",
  format = "JSON"
)

response <- GET(url = url, query = params)

data <- content(response)
  • Unnesting JSON response
indices <- data$dimension$geo$category$index

gdp_data <- data.frame(
    geo = names(indices),
    index = as.character(unlist(indices))
  ) %>%
  left_join(
    enframe(unlist(data$value), name = "index", value = "values"),
    by = "index"
  ) %>%
  select(geo, values)
geo values
EU27_2020 17256923.3
EA 14663455.4
EA21 14757980.5
EA20 14663455.4
EA19 14584269.0
EA12 14191081.8
BE 601866.5

Using R Packages — Eurostat Example

  • Packages in R like eurostat simplify the process
library(eurostat)

gdp_data <- get_eurostat(
    id = "nama_10_gdp",
    filters = list(
      time = "2023",
      unit = "CP_MEUR",
      na_item = "B1GQ"
    )
  ) %>%
  select(geo, values) 
  • We get the same data faster and cleaner
geo values
EU27_2020 17256923.3
EA 14663455.4
EA21 14757980.5
EA20 14663455.4
EA19 14584269.0
EA12 14191081.8
BE 601866.5
  • However we are dependant on the package

Automated Data Update

  • Create script(s) with all necessary processes
    • Data download via API
    • Processing
    • Data upload
  • Plan execution of script(s)
    • Unix: cron job
    • Windows: task scheduler

POST from R – OpenAI Example

  • POST creates resource on server
  • Send instructions via JSON in body
  • But might also return data
  • Example: Sentiment analysis with OpenAI
review
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with....
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes d....
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted come....
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This mo....
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is....
key <- read_csv("keys/key_schmoigl.txt") %>% pull()

prompts <- read_csv("data/imdb_small.csv") %>%
  select(review) %>%
  distinct() %>%
  filter(row_number() <= 5)

system_role <- glue("
  Classify movie reviews as positive or negative. 
  Only return the string 'positive' or 'negative'. 
  Do not respond with any explanations.
  ")

answers <- data.frame(
  review = character(), 
  predicted = character()
  )

POST from R – OpenAI Example

  • Looping through reviews
for (i in 1:nrow(prompts)) { 
  response_raw <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", key)),
    encode = "json",
    body = list(
      model = "gpt-4o-2024-08-06",
      messages = list(
        list(role = "system", content = system_role),
        list(role = "user", content = prompts$review[i])
        ),
      temperature = 0
    )
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- content(response_raw)$choices[[1]]$message$content
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(glue(progress, "% \n"))}
}
  • Classification via ChatGPT4
review predicted
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The.... positive
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B.... positive
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con.... positive
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par.... negative
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers .... positive

POST from R – Ollama API

  • Host our own service and use via API
  • Example: Run Ollama on your device
ollama pull llama3.2
ollama list
export OLLAMA_HOST=localhost:11434
  • Classification via llama3.2
review predicted
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The.... positive
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B.... positive
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con.... positive
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par.... negative
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers .... positive
for (i in 1:nrow(prompts)) { 
  body <- list(
    prompt = paste(
      system_role,
      "The review is:",
      prompts$review[i]
    ),
    max_tokens = 200,
    temperature = 0,
    stream = FALSE,
    model = "llama3.2"
  )
  ollama_response <- POST(
    url = "http://localhost:11434/api/generate", 
    body = body, 
    encode = "json"
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- content(ollama_response)$response
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(glue(progress, "% \n"))}
}