--- title: "How caching works in tidywikidatar" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How caching works in tidywikidatar} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidywikidatar) ``` In order to reduce load on Wikidata's server and to speed up the processing of data, `tidywikidatar` makes extensive use of local caching. ## What data are cached locally There are a few types of data that are cached locally: - searches run with `tw_search()` - data about an item, typically retrieved with `tw_get()` or `tw_get_property()` - labels or description of properties, typically retrieved with `tw_get_property_label()` and `tw_get_property_description()` - qualifiers of properties, typically retrieved with `tw_get_qualifiers()` - data retrieved from (or about) Wikipedia pages, with `tw_get_wikipedia()`, and `tw_get_wikipedia_page_links()` To reduce space used for local caching and speed up processing time, it is possible to store only labels and information available in a given language when relevant. ## Caching with SQLite In `tidywikidatar`, it is possible to enable caching with: ```{r} tw_enable_cache() ``` If you do not include further parameters, by default `tidywikidatar` will use a local SQLite database for caching. You can choose in which folder the SQLite database will be stored with `tw_set_cache_folder()`; if not already existing, you can create that folder with `tw_create_cache_folder()`. ```{r eval=FALSE} tw_set_cache_folder(path = fs::path( fs::path_home_r(), "R", "tw_data" )) tw_create_cache_folder() ``` ## Caching with other database backends Support for other database backends is now available. They can be accessed most easily using the following approach, having ensured that the relevant driver (and `odbc` package) have previously been installed: ```{r eval = FALSE} tw_enable_cache(SQLite = FALSE) tw_set_cache_db( driver = "MySQL", host = "localhost", port = 3306, database = "tidywikidatar", user = "secret_username", pwd = "secret_password" ) # for testing, consider running a local database e.g. with: # docker run --name tidywikidatar_db -p 3306:3306 -e MYSQL_ROOT_PASSWORD=secret_root_password -e MYSQL_USER=secret_username -e MYSQL_PASSWORD=secret_password -e MYSQL_DATABASE=tidywikidatar mysql:latest ``` It is also technically possible to pass directly a connection generated with `DBI::dbConnect()` to each function. ## Name of tables in cached databases Each database has a table for each language and type of content. For example, item information retrieved with `tw_get(id = "Q180099", language = "en")` will be stored in a table called `tw_item_en`. The name of the table is unique and is generated by `tw_get_cache_table_name()`. For example: ```{r} tw_get_cache_table_name(type = "item", language = "en") ``` ## Column types and indexing Due to limited familiarity with different database backends and limited time for testing, the creation of database tables is left to the default values of `DBI::dbWriteTable()`. For occasional use, this should not be an issue. However, when the local cache reaches millions rather than only thousands of rows, response time from a MySql database can take a few seconds, rather than a fraction of a second as would be expected. To deal with this, new functions for adding indexing to cache tables have been introduced, `tw_check_cache_index()`, `tw_index_cache_item()`, and `tw_index_cache_search()`. It is possible to apply this to all existing functions of a given type as outlined below. This speeds up retrieval time dramatically on MySql databases; impact on other types of databases has not been thoroughly tested. ```{r eval=FALSE} db <- tw_connect_to_cache() tables_v <- DBI::dbListTables(conn = db) # for search cache tables purrr::walk( .x = tables_v[stringr::str_starts(string = tables_v, "tw_search_item")], .f = function(x) { tw_index_cache_search(table_name = x) } ) # for item cache tables purrr::walk( .x = tables_v[stringr::str_starts(string = tables_v, "tw_item")], .f = function(x) { tw_index_cache_item(table_name = x) } ) ```