Webscraping is the process of extracting information on the web automatically and transform it into a structured dataset
Everything you see is (somehow) accessible
Requires some knowledge about HTML, CSS and JS
And some help from your best friend is Ctrl+Shift+C
Web Technologies
HTML and the DOM
Hyper Text Markup Language defines the skeleton of webpages and gives elements attributes
webpages are hierarchically structured in what is called the Document Object Model (DOM)
webscraping: HTML elements usually contain the information we want to scrape
CSS
Cascaded Style Sheets style the appearance of HTML elements
webscraping: CSS selectors help you to locate the HTML elements in the DOM
JS
JavaScript is used to manipulate and interact with content in the DOM
webscraping: JS helps understand how data is loaded and processed
A Simple Web App
HTML
<h4>The resulting website: </h4><form method="GET"> What is your name: <input type="text" id="myname"/><button id="mybutton">Submit</button></form><div id="hellomessage"></div>
example_links <- page_example |>html_elements(".rowlink") |>html_elements("a") |>html_attr("href") %>%# instead of the text, here we extract the attribute "href"as.data.frame() |>rename("example_links"=1)