---
title: "Getting started"
vignette: >
  %\VignetteIndexEntry{Getting started}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

## What does this package do?

The osdc (Open Source Diabetes Classifier) package classifies
individuals as having type 1 diabetes (T1D) or type 2 diabetes (T2D)
based on Danish register data. You provide the register data, and the
package returns a data table of individuals identified as having
diabetes, along with their diabetes type (T1D or T2D) and the date of
the classification.

This package serves two overarching purposes:

1. To provide an open-source, code-based algorithm to classify type 1
   and type 2 diabetes using Danish registers as data sources.
2. To inspire discussions within the Danish register-based research
   space on the openness and ease of use on the existing tooling and
   registers, and on the need for an official process for updating or
   contributing to existing data sources.

For a detailed description of the algorithm, see
`vignette("algorithm")`. For the motivations and rationale behind this
package, see `vignette("rationale")`. For a full list of required
register variables, see `vignette("data-sources")`.

## Step-by-step usage

This section walks through the full workflow: loading the package,
preparing data, running the classification, as well as understanding and
saving the output. We use simulated data here, but the same or similar
steps apply to real register data. For more information on using osdc on
real register data, see @sec-working-with-real-register-data.

### Step 1: Install and load the package

First, you need to install and load the package:

```{r setup}
#| message: false
# install.packages("osdc")
library(osdc)
```

### Step 2: Check which registers are needed

The algorithm requires data from multiple Danish registers. The names of
these registers can be seen below:

```{r}
#| echo: false
registers() |>
  purrr::map_chr("name") |>
  knitr::kable(col.names = c("Register abbreviation", "Register name"))
```

For a table showing the specific variables needed from each register,
see `vignette("data-sources")`.

::: callout-important
To use the osdc algorithm, you need to have all the variables described
in `vignette("data-sources")`, so please ensure that you have the
required variables before continuing.
:::

### Step 3: Prepare the data

The package requires data to be in [DuckDB](https://duckdb.org/) format,
which is a high-performance database format that can handle the large
volumes of data without loading everything into memory at once. See
`vignette("design")` for more on why we use DuckDB.

For this example, we generate simulated register data using
`simulate_registers()` and then convert it to DuckDB format:

```{r}
register_data <- registers() |>
  names() |>
  simulate_registers() |>
  purrr::map(duckplyr::as_duckdb_tibble) |>
  # Convert to a DuckDB connection, as duckplyr is still
  # in early development, while the DBI-DuckDB connection
  # is more stable.
  purrr::map(duckplyr::as_tbl)
```

The result is a named list where each element is one register as a
DuckDB table. The LPR data spans multiple versions (LPR2 and LPR3) that
need to be prepared and joined into a single table before being used as
input for `classify_diabetes()`. The same goes for the health service
registers (SSSY, and SYSI). osdc provides helper functions for this:

```{r}
lpr <- list(
  prepare_lpr2(register_data$lpr_adm, register_data$lpr_diag),
  prepare_lpr3f(register_data$lpr3f_kontakter, register_data$lpr3f_diagnoser),
  prepare_lpr3a(register_data$lpr3a_kontakt, register_data$lpr3a_diagnose)
) |>
  join_registers()

hsr <- list(register_data$sssy, register_data$sysi) |> join_registers()
```

The remaining registers are extracted from the `register_data` list, so
they are ready to be passed as arguments to `classify_diabetes()`:

```{r}
bef <- register_data$bef
lmdb <- register_data$lmdb
lab_forsker <- register_data$lab_forsker
```

### Step 4: Run the classification

Now, we're ready to run the classification algorithm. Pass each register
to `classify_diabetes()`:

```{r}
classified_diabetes <- classify_diabetes(
  lpr = lpr,
  hsr = hsr,
  lab_forsker = lab_forsker,
  bef = bef,
  lmdb = lmdb
)

classified_diabetes
```

As seen above, this returns a DuckDB table with the individuals
classified as having either T1D or T2D along with the date of the
classification. Each of the columns in the output is explained in
@sec-understanding-the-output below.

### Step 5 (optional): Collect the results into R

Because the data is stored in DuckDB, the result above is a *lazy*
reference to a database query, i.e., the data has not been loaded into
R's memory. To bring the results into R as a regular data frame, use
`dplyr::collect()`:

```{r}
classified_diabetes <- classified_diabetes |>
  dplyr::collect()

classified_diabetes
```

Now, we can see that with the simulated data,
`r nrow(classified_diabetes)` individuals are classified as having
diabetes.

#### Understanding the output {#sec-understanding-the-output}

The output is a table with one row per classified individual and five
columns:

| Column                  | Description                                                                                                                   |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| `pnr`                   | The pseudonymised personal identification number.                                                                             |
| `stable_inclusion_date` | The date of the second inclusion event, if on or after `stable_inclusion_start_date` (default: 1998). `NA` for earlier dates. |
| `raw_inclusion_date`    | The date of the second inclusion event without setting `NA` for date earlier than the `stable_inclusion_start_date`.          |
| `has_t1d`               | `TRUE` if classified as type 1 diabetes, `FALSE` otherwise.                                                                   |
| `has_t2d`               | `TRUE` if classified as type 2 diabetes, `FALSE` otherwise.                                                                   |

::: callout-note
#### About `stable_inclusion_date` vs `raw_inclusion_date`

The `raw_inclusion_date` is simply the date of the second qualifying
event. However, for events before 1998, the register data may not have
sufficient coverage to reliably distinguish new (incident) cases from
existing (prevalent) ones. The `stable_inclusion_date` column is set to
`NA` for these earlier dates to flag this uncertainty.

The `classify_diabetes()` function includes a
`stable_inclusion_start_date` parameter that is `01-01-1988` by default.
This means that you can change the date for when the classification is
considered stable
:::

For more information about the output, see the Interface section under
`vignette("design")`.

### Step 6: Saving the results

Once you have the classification results, you can save them as a Parquet
file for yourself or your collaborators on your DST project:

``` r
classified_diabetes |>
  duckplyr::as_duckdb_tibble() |>
  duckplyr::compute_parquet(
    "classified_diabetes.parquet"
  )
```

## Working with real register data {#sec-working-with-real-register-data}

In a real-world scenario, the register data is too large to read into
memory all at once. We recommend converting your register files into
[Parquet](https://parquet.apache.org/) format on disk, with each
register in its own folder (e.g., all `lmdb` files in one folder, all
`lab_forsker` files in another, etc.).

::: callout-tip
To convert SAS (`.sas7bdat`) files to Parquet, you can use the
[`fastreg`](https://dp-next.github.io/fastreg/) package.
:::

If you're working on Statistics Denmark's (DST) server, be aware that
the pre-installed R packages can be quite old (for example, at the time
of writing, `library(duckplyr)` loads version 0.4 of the duckplyr
package from 2024, whereas the latest version on CRAN is \>1.1 from
2026). These old versions don't support the dplyr operations necessary
for osdc to work.

Installing osdc with `install.packages("osdc")` should force the
necessary updates of package dependencies. Otherwise, the latest version
of any given package can be installed from DST's local CRAN mirror by
using `install.packages()`. The downside of this approach is that you
have to repeat the package installations frequently (whenever you log on
to a new virtual machine, or after the servers weekly reset).

After making sure that you have the newest version of osdc installed
(which should install/update any necessary dependencies), you can load
each register directly from its Parquet folder and convert it to DuckDB,
as shown below.

Currently, the conversion to DuckDB requires that the user specifies
some connection parameters first:

- the maximum amount of memory to be allocated for DuckDB data, e.g.
  128GB which is stable and should be more than enough to hold the
  inputs to osdc in DuckDB
- where to temporarily store any DuckDB data that exceeds this
  threshold.

``` r
ddb_driver <- duckdb::duckdb(config = list(
  memory = "128GB",
  temp_dir = "path/to/temp/dir"
))

ddb_con <- DBI::dbConnect(ddb_driver)
```

If you're working with very large data sets, it's good practice to
filter the data on the Arrow side before converting to DuckDB to reduce
computation and memory use. Conveniently, most of dplyr's filtering and
selection operations are supported by the arrow package. If you need to
rename variables or convert their types to match the inputs expected by
osdc, you can do that too:

``` r
lpr3f_diagnoser <- "path/to/lpr3f_diagnoser_parquet_folder" |>
  arrow::open_dataset(unify_schemas = TRUE) |>
  # Insert dplyr filtering and variable selection/renaming steps here, e.g.:
  # dplyr::select(...) |>
  # dplyr::mutate(...) |>
  # dplyr::filter(...) |>
  arrow::to_duckdb(con = ddb_con) |>
  dplyr::compute()
```

If your data (or parts of it) is already in R (e.g., as a `data.frame`),
you can convert it to a DuckDB table with:

``` r
lpr3f_diagnoser <- lpr3f_diagnoser |>
  arrow::to_duckdb(con = ddb_con) |>
  dplyr::compute()
```

::: callout-important
Important notes on using `lpr3a` data!

A: `lpr3a` contains duplicates of contacts in 2017 & 2018 that are also
contained in `lpr2`. These rows must be filtered out before being input
to osdc's `prepare_lpr3a()` function.

B: The `kont_starttidspunkt` variable in `lpr3a` is a `datetime` type,
and must be converted to a `date` before being input to
`prepare_lpr3a()`.
:::

The duplicate rows in `lpr3a_kontakt` can be removed during processing
by filtering to `indberetningssystem == "LPR3"` Similarly, the
`kont_starttidspunkt` variable in `lpr3a` can be converted to a `date`
in a single call to `dplyr::mutate()` during pre-processing in Arrow.
e.g.:

``` r
lpr3a_kontakt <- "path/to/lpr3a_kontakt_parquet_folder" |>
  arrow::open_dataset(unify_schemas = TRUE) |>
  # dplyr::select(...) |>
  dplyr::mutate(kont_starttidspunkt = as.Date(kont_starttidspunkt)) |>
  dplyr::filter(indberetningssystem == "LPR3") |>
  arrow::to_duckdb(con = ddb_con) |>
  dplyr::compute()
```

This can be used as input to `prepare_lpr3a()`:

``` r
lpr <- list(
  prepare_lpr2(lpr_adm, lpr_diag),
  prepare_lpr3f(lpr3f_kontakter, lpr3f_diagnoser),
  prepare_lpr3a(lpr3a_kontakt, lpr3a_diagnose)
) |>
  join_registers()
```

Most inputs will work in the Arrow/DuckDB equivalent of their type in
the original SAS file. Unless the SAS-to-Arrow conversion introduced
unexpected types, you should should not need to do any further type
conversion.

Once you're done using DuckDB, you should close the driver and
connection like so:

``` r
duckdb::duckdb_shutdown(ddb_driver)
DBI::dbDisconnect(ddb_con)
```

## Getting help

If you encounter a bug or have a question about how to use osdc, please
open an issue on the
[GitHub repository](https://github.com/steno-aarhus/osdc/issues). When
reporting a bug, include a minimal example that reproduces the problem
along to help us understand and investigate the problem.
