Databases and APIs

DuckDB, SQL, Tidyverse, REST APIs … handling your data very efficiently

Lukas Schmoigl

lukas.schmoigl@wifo.ac.at

Data Scientist at WIFO

Databases

Storing and extracting data efficiently

Parquet

Going beyond csv

Columnar, compressed, typed
Usually ~10 times smaller than csv
Interoperable across tools
Very performant and efficient (parquet + DuckDB)
Supports complex data structures
Metadata inclusion

library(tidyverse)
library(arrow)

size_csv <-
  file.size("data/synthetic_unemployment_data_large.csv")
start_time <- Sys.time()
data <- read_csv("data/synthetic_unemployment_data_large.csv")
end_time <- Sys.time()
time_csv <- end_time - start_time

write_parquet(
  data, 
  "data/synthetic_unemployment_data_large.parquet",
  compression = "gzip",
  compression_level = 9
  )
size_parquet <- 
  file.size("data/synthetic_unemployment_data_large.parquet")

The csv file is 2847 MB large while the the parquet is 227 MB

DuckDB

Efficient data queries

“Stand-alone” database system
Execute SQL queries
- RAM efficient
- “In-file” data wrangling
Vectorized execution
- very performant

library(duckdb)
library(glue)

con <- dbConnect(duckdb())

cols <- c("target_high", "benefits_amount", "unemployment_1m")

start_time <- Sys.time()
data <- dbGetQuery(con, glue("
  SELECT {paste(cols, collapse = ', ')}
  FROM 'data/synthetic_unemployment_data_large.parquet'
  WHERE promised_employment == 0
"))
end_time <- Sys.time()
time_parquet <- end_time - start_time

dbDisconnect(con, shutdown = TRUE)

It took us 23.9 seconds to read in the csv and only 0.4 seconds to make the DuckDB query!

Data Wrangling

Manipulating data efficiently

Tidyverse

Tidyverse - consistent grammar and readable syntax
- dplyr - data manipulation (filter, select, mutate, summarize)
- tidyr - data tidying (pivot, separate, unite)
- magrittr - %>% operator
- readr - data import/export
- stringr - string manipulation
- lubridate - date/time handling

library(tidyverse)

data <- read_csv("data/cases.csv")

plot_data <- data %>%
  separate(case_nr, c("court", "case_nr"), ", ") %>%
  group_by(court, month) %>%
  count(name = "cases") %>%
  group_by(court) %>%
  mutate(sum_cases = sum(cases, na.rm = TRUE)) %>%
  filter(sum_cases > 100) %>%
  ungroup()

Alternatives

data.table - extremely fast and memory efficient
- Concise [i, j, by] syntax
- In-place modifications
- Optimized for large datasets

Pandas - comprehensive data analysis library in Python

polars - next-generation, performant library (Python and R)

Arquero - tidyverse-like JavaScript implmentation

library(data.table)

dt <- fread("data/cases.csv")

plot_data <- 
  dt[, c("court", "case_nr") := tstrsplit(case_nr, ", ")
    ][, .(cases = .N), by = .(court, month)
    ][, sum_cases := sum(cases), by = court
    ][sum_cases > 100]

(REST) APIs

Automating data workflows

Applications

Automated data updates (e.g. FIW Data Dashboard, Konjunkturampel)

Fetch data in real time for display on a website (e.g. FG4 Webgrafik)

Using remote services (e.g. Google APIs, OpenAI API)

Host local service and use via API (e.g. Ollama)

Add or change files on a server (e.g. update database, add address to newsletter)

POST

POST

POST/PUT/DELETE

Workflow Automation

viewof workflowType = Inputs.radio(
  ["Manual Workflow", "Local Automation", "Client Request"],
  {label: null, value: "Manual Workflow"}
)

workflowType == "Manual Workflow" ?
mermaid`
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'Century Gothic', 'fontSize': '13px', 'clusterBkg': 'transparent', 'clusterBorder': '#000000'}, 'flowchart': {'nodeSpacing': 38, 'rankSpacing': 35, 'padding': 20, 'diagramPadding': 30}}}%%
flowchart LR
    subgraph Web["Source"]
        A[("Database")]
    end

    subgraph Local["Local"]
        B[["csv"]]
        C["Software"]
        E[["svg"]]
    end

    subgraph Server["Server"]
        G{{"Backend"}}
    end

    subgraph Client["Client"]
        H["Browser"]
        I[/"Viz"/]
    end

    A -.->|"🪚<br>Download"| B
    B -.->|"🪚<br>Import"| C
    C -.->|"🪚<br>Processing"| E
    E -.->|"🪚<br>Upload"| G
    G <--> H
    H --> I

    style Web fill:#336699,color:#fff
    style Local fill:#e52320,color:#fff
    style Server fill:#9C6B91,color:#fff
    style Client fill:#2d8659,color:#fff
    style blank1 fill:transparent,stroke:transparent,color:transparent
    style A fill:#f5f5f5,stroke:#000,color:#000
    style B fill:#f5f5f5,stroke:#000,color:#000
    style C fill:#f5f5f5,stroke:#000,color:#000
    style E fill:#f5f5f5,stroke:#000,color:#000
    style G fill:#f5f5f5,stroke:#000,color:#000
    style H fill:#f5f5f5,stroke:#000,color:#000
    style I fill:#f5f5f5,stroke:#000,color:#000
` :
workflowType == "Local Automation" ?
mermaid`
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'Century Gothic', 'fontSize': '13px', 'clusterBkg': 'transparent', 'clusterBorder': '#000000'}, 'flowchart': {'nodeSpacing': 15, 'rankSpacing': 35, 'padding': 20, 'diagramPadding': 30}}}%%
flowchart LR

    subgraph Local["Local"]
        B["Automated Script"]
    end
    
    subgraph Web["Source"]
        A[("Database")]
    end

    subgraph Server["Server"]
        G{{"Backend"}}
    end

    subgraph Client["Client"]
        H["Browser"]
        I[/"Viz"/]
    end

    B -->|"🤖<br>API Request"| A
    A -->|"🤖<br>API Response"| B
    B -->|"🤖<br>Upload"| G
    G <--> H
    H --> I

    style Web fill:#336699,color:#fff
    style Local fill:#e52320,color:#fff
    style Server fill:#9C6B91,color:#fff
    style Client fill:#2d8659,color:#fff
    style A fill:#f5f5f5,stroke:#000,color:#000
    style B fill:#f5f5f5,stroke:#000,color:#000
    style G fill:#f5f5f5,stroke:#000,color:#000
    style H fill:#f5f5f5,stroke:#000,color:#000
    style I fill:#f5f5f5,stroke:#000,color:#000
` :
mermaid`
%%{init: {'theme': 'base', 'themeVariables': {'fontFamily': 'Century Gothic', 'fontSize': '13px', 'clusterBkg': 'transparent', 'clusterBorder': '#000000'}, 'flowchart': {'nodeSpacing': 15, 'rankSpacing': 35, 'padding': 20, 'diagramPadding': 30}}}%%
flowchart LR

    subgraph Server["Server"]
        G{{"Backend"}}
    end

    subgraph Client["Client"]
        B["Browser"]
        C[/"Viz"/]
    end

    subgraph Web["Source"]
        A[("Database")]
    end

    G <--> B
    B -->|"🤖<br>API Request"| A
    A -->|"🤖<br>API Response"| B
    B --> C

    style Web fill:#336699,color:#fff
    style Client fill:#2d8659,color:#fff
    style Server fill:#9C6B91,color:#fff
    style blank1 fill:transparent,stroke:transparent,color:transparent
    style A fill:#f5f5f5,stroke:#000,color:#000
    style B fill:#f5f5f5,stroke:#000,color:#000
    style C fill:#f5f5f5,stroke:#000,color:#000
    style G fill:#f5f5f5,stroke:#000,color:#000
`

API Request and Response

Request
- Request line: HTTP method (GET/POST/PUT/DELETE) and URL
- Header: Metadata (e.g. authentication)
- Body: Info about request (e.g. JSON with instructions)

Response
- Status code (e.g. 200 OK, 400 Bad Request, 429 Too Many Requests)
- Content as JSON, HTML or XML

GET — Eurostat Example

GET request asks server for information

Eurostat API documentation

GET request via Browser

Fixed part:

https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/

{host_url}/{service}/{version}/

Dynamic part:

data/nama_10_gdp?format=JSON&time=2023&unit=CP_MEUR&na_item=B1GQ

{response_type}/{datasetCode}?{format}&{filters}

GET from R — Eurostat Example

Specify query from R

library(tidyverse)
library(httr)

url <- paste0(
  "https://ec.europa.eu/eurostat/api/",
  "dissemination/statistics/1.0/data/nama_10_gdp?"
)

params <- list(
  time = "2023",
  unit = "CP_MEUR",
  na_item = "B1GQ",
  format = "JSON"
)

response <- GET(url = url, query = params)

data <- content(response)

Unnesting JSON response

indices <- data$dimension$geo$category$index

gdp_data <- data.frame(
    geo = names(indices),
    index = as.character(unlist(indices))
  ) %>%
  left_join(
    enframe(unlist(data$value), name = "index", value = "values"),
    by = "index"
  ) %>%
  select(geo, values)

geo	values
EU27_2020	17264212.6
EA	14670744.7
EA21	14765269.8
EA20	14670744.7
EA19	14591558.4
EA12	14198366.7
BE	601866.5

Using R Packages — Eurostat Example

Packages in R like eurostat simplify the process

library(eurostat)

gdp_data <- get_eurostat(
    id = "nama_10_gdp",
    filters = list(
      time = "2023",
      unit = "CP_MEUR",
      na_item = "B1GQ"
    )
  ) %>%
  select(geo, values)

We get the same data faster and cleaner

geo	values
EU27_2020	17264212.6
EA	14670744.7
EA21	14765269.8
EA20	14670744.7
EA19	14591558.4
EA12	14198366.7
BE	601866.5

However we are dependant on the package

Automated Data Update

Create script(s) with all necessary processes
- Data download via API
- Processing
- Data upload

Plan execution of script(s)
- Unix: cron job
- Windows: task scheduler

POST from R – OpenAI Example

POST creates resource on server

Send instructions via JSON in body

But might also return data

Example: Sentiment analysis with OpenAI

review
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with....
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes d....
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted come....
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This mo....
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is....

key <- read_csv("keys/key_schmoigl.txt") %>% pull()

prompts <- read_csv("data/imdb_small.csv") %>%
  select(review) %>%
  distinct() %>%
  filter(row_number() <= 5)

system_role <- glue("
  Classify movie reviews as positive or negative. 
  Only return the string 'positive' or 'negative'. 
  Do not respond with any explanations.
  ")

answers <- data.frame(
  review = character(), 
  predicted = character()
  )

POST from R – OpenAI Example

Looping through reviews

for (i in 1:nrow(prompts)) { 
  response_raw <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", key)),
    encode = "json",
    body = list(
      model = "gpt-4o-2024-08-06",
      messages = list(
        list(role = "system", content = system_role),
        list(role = "user", content = prompts$review[i])
        ),
      temperature = 0
    )
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- content(response_raw)$choices[[1]]$message$content
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(glue(progress, "% \n"))}
}

Classification via ChatGPT4

review	predicted
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The....	positive
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B....	positive
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con....	positive
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par....	negative
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers ....	positive

POST from R – Ollama API

Host our own service and use via API

Example: Run Ollama on your device

ollama pull llama3.2
ollama list
export OLLAMA_HOST=localhost:11434

Classification via llama3.2

review	predicted
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The....	positive
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B....	positive
I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con....	positive
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par....	negative
Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers ....	positive

for (i in 1:nrow(prompts)) { 
  body <- list(
    prompt = paste(
      system_role,
      "The review is:",
      prompts$review[i]
    ),
    max_tokens = 200,
    temperature = 0,
    stream = FALSE,
    model = "llama3.2"
  )
  ollama_response <- POST(
    url = "http://localhost:11434/api/generate", 
    body = body, 
    encode = "json"
  )
  answers[i, 1] <- prompts$review[i]
  answers[i, 2] <- content(ollama_response)$response
  progress <- round(i/nrow(prompts)*100)
  if (i == 1) {cat(" Progress:\n")}
  if (i%%10 == 0) {cat(glue(progress, "% \n"))}
}