---
title: "Design"
vignette: >
  %\VignetteIndexEntry{Design}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup}
#| include: false
library(dplyr)
library(osdc)
```

## Principles

These are the guiding principles for this package:

1. Because of the amount of data in the registers and the extensive
   processing that osdc does to classify diabetes status, the data must
   be in the [DuckDB](https://duckdb.org/) format. DuckDB is an
   extremely powerful data analytic engine, so this is quite essential
   for osdc to keep performance high.
2. Functions have consistent inputs and outputs (e.g. inputs and outputs
   are the same, regardless of specific conditions).
3. Functions have predictable outputs based on inputs (e.g. if an input
   is a `data.frame`, the output is a `data.frame`).
4. Functions have consistent naming based on their action.
5. Functions have limited additional arguments.
6. Casing of input variables (upper or lower case) is agnostic, but all
   internal variables are lower case, and output variables are lower
   case.

## Use cases

We make these assumptions on how this package will be used, based on our
experiences and expectations for use cases:

We expect the package will be:

- Entirely used within the Denmark Statistics or the Danish Health
  Authority's servers, since that is where their data are kept.
- Used by researchers within or affiliated with Danish research
  institutions.
- Used specifically within a Danish register-based context.

Below is a set of "narratives" or "personas" with associated needs that
this package aims to fulfill.

"As a researcher, ..."

- "... I want to easily get an overview of which Danish registers and
  variables I need to request from Denmark Statistics and the Danish
  Health Data Authority, so that I am able to classify the diabetes
  status of individuals in the registers using the osdc package."
- "... I want to easily and simply create a dataset that contains data
  on diabetes status in my population, so that I can begin conducting my
  research that involves persons with diabetes without having to tinker
  with coding the correct algorithm to classify them."
- "... I want to be informed early and in a clear way whether my data
  fits with the required data types, so that I can fix and correct these
  issues without having to do extensive debugging of the code and/or
  data."

## Core functionality

This is the list of the core functionality of the osdc package:

1. Classifies individuals' diabetes type (type 1 or 2)
2. Outputs a single data frame-type object (as a DuckDB object)
   including individuals with diabetes, their type (type 1 or 2), and
   date of onset as classified by the algorithm.
3. Internally checks individual registers for the variables required by
   the algorithm.
4. Provides a list of required variables and registers in order to
   calculate diabetes status.
5. Provides internal checks of whether variables match the expected data
   types.
6. Provides a common and easily accessible standard for determining
   diabetes status within the context of research using Danish
   registers.

## Function conventions

To effectively develop both the user-facing and internal functions, we
follow some conventions and design patterns for building these
functions. There are a few conventions we describe here: naming patterns
for functions and arguments, their argument input requirements, and
their output data structure.

The below conventions are *ideals* only, to be used as a guidelines to
help with development and understanding of the code; they are not hard
rules.

### Naming

- First word is an action verb, later words are objects or conditions.
- Functions that filter by dropping rows based on specific criteria are
  prefixed with `drop_`.
- Functions that filter by keeping rows based on specific criteria are
  prefixed with `keep_`.
- Helpers that add columns needed for classification are prefixed with
  `add_`.
- Helpers that join the output of other functions are prefixed with
  `join_`.
- Functions that prepare and process register data are prefixed with
  `prepare_`.

### Input

- We assume the register data is *not* taken directly from the original
  SAS files, but has undergone prior pre-processing and cleaning. These
  assumptions are checked, and the user is informed if they are not met.
- We also assume that the original source files have been loaded and
  joined into a single dataset object per register. Although Denmark
  Statistics stores data by year, all years for a register must be
  merged into one dataset.
- As few arguments as is possible, with as few core required arguments
  as possible (ideally one or two).
- `keep_` functions take a register as the first argument.
  - One input register database at a time.
- `drop_` functions can take a register as the first argument or take
  the output from a `keep_` function.
- Function arguments take a single DuckDB type object as register input
  (e.g. `duckplyr_df`), consistent with the assumption that each
  register is provided as a single, unified data frame.
- The first argument will always take a data frame type object.
- The second argument could be an output data frame object from another
  function.

### Output

- All functions output the same type of object as the input object (a
  `duckplyr_df` type object).

## Interface

The osdc package contains one main function that classifies individuals
into those with either type 1 or type 2 diabetes using the Danish
registers and a few helper pre-processing functions.

### `prepare_lpr*()`

In order to classify diabetes status, we need the patient registers with
diagnosis information (known collectively as Landspatientregisteret,
LPR). There isn't just one LPR but several different LPRs that have
evolved over time. Statistics Denmark (DST) in fact relatively recently
created a new LPR (LPR3A) that resolves some issues with the previous
LPR registers. Each version of LPR contains different tables and
variables, though osdc only needs specific variables from two tables.

We originally required each original LPR register as separate arguments
in `classify_diabetes()`, but this became an issue after the new LPR3A
was created. So, we re-designed `classify_diabetes()` to take only one
`lpr` argument and instead require the different LPRs be pre-processed
and joined before entering `classify_diabetes()`. This way, we can add
new pre-processing functions for any future changes to LPR without
changing the interface of `classify_diabetes()`.

To help with this pre-processing, we designed several helper functions
that follow the pattern `prepare_lpr*()`, e.g. for LPR2 it is
`prepare_lpr2()`. This way, if DST update the LPR again, we can add
another `prepare_lpr*()` function to prepare the new LPR format for
classification.

Unfortunately, the data covered by different revisions of the same
registers are not cleanly separated. E.g. data from the year 2005
overlaps between `sysi` (years 1990 through 2005) and `sssy` (2005
onward), and data from 2017 and 2018 are contained in both `lpr2` (1977
through 2018) and `lpr3a` (2017 onward). **This means that the user must
be careful to pre-process these data to avoid duplicated rows!**

Each `prepare_lpr*()` outputs a DuckDB object with the following
variables: `pnr`, `date`, `is_primary_diagnosis`, `is_diabetes_code`,
`is_t1d_code`, `is_t2d_code`, `is_endocrinology_dept`,
`is_medical_dept`, and `is_pregnancy_code`. And a final
`join_registers()` helper function combines the outputs of each
`prepare_lpr*()` into a single data object. See the help docs for
`prepare_lpr()` for more details on these variables. See the diagram
below for the general flow of data sources and the different functions
that prepare them for the `classify_diabetes()` function.

```{mermaid}
%%| label: fig-prepare-lpr-flow
%%| fig-cap: "Flow diagram showing the different data sources needed for the `prepare_lpr*()` functions and how they are processed and joined together before entering into `classify_diabetes()`."
flowchart TB
  subgraph data_sources["Data sources"]
    lpr2_diag[("lpr2_diag")]
    lpr2_adm[("lpr2_adm")]
    lpr3a_kontakt[("lpr3a_kontakt")]
    lpr3a_diagnose[("lpr3a_diagnose")]
    lpr3f_kontakter[("lpr3f_kontakter")]
    lpr3f_diagnoser[("lpr3f_diagnoser")]
  end

  lpr2_diag & lpr2_adm --> prepare_lpr2["prepare_lpr2()"]
  lpr3f_kontakter & lpr3f_diagnoser --> prepare_lpr3f["prepare_lpr3f()"]
  lpr3a_kontakt & lpr3a_diagnose --> prepare_lpr3a["prepare_lpr3a()"]

  prepare_lpr2 & prepare_lpr3f & prepare_lpr3a --> join_registers["join_registers()"]
  join_registers --> lpr[(lpr)]

  %% Styling
  classDef default fill:#EEEEEE, color:#000000, stroke:#000000
  style data_sources fill:#FFFFFF, color:#000000, stroke-width:0px
```

### `classify_diabetes()`

This function classifies those with diabetes (type 1 or 2) based on the
Danish registers described in this vignette and
`vignette("data-sources")`. All data sources needed by osdc are used as
input for this function. The specific details of the classification
algorithm are described in the `vignette("algorithm")`.

There is one argument in `classify_diabetes()` for each required data
source. The names and descriptions of these arguments are as follows:

```{r}
#| output: asis
#| echo: false
registers() |>
  purrr::imap_chr(~ glue::glue("- `{.y}`: The register or set of registers called '{.x$name}' in Danish.")) |>
  unname() |>
  cat(sep = "\n")
```

The output is a DuckDB object with four columns:

- **pnr**: The pseudonymised social security number of individuals in
  the diabetes population (one row per individual).
- **stable_inclusion_date**: The *stable* inclusion date (i.e., the raw
  date mutated so only individuals included in the time-period where
  data coverage is sufficient to make incident cases reliable)[^1].
- **raw_inclusion_date**: The *raw* inclusion date (i.e., the date of
  the second inclusion event as described in the
  `vignette("algorithm")`).
- **has_t1d**: A logical column indicating whether the individual has
  type 1 diabetes.
- **has_t2d**: A logical column indicating whether the individual has
  type 2 diabetes.

[^1]: For more information on the "raw" versus "stable" inclusion date,
  see `vignette("algorithm")`.

For an example, see below.

| pnr | stable_inclusion_date | raw_inclusion_date | has_t1d | has_t2d |
|-----|-----------------------|--------------------|---------|---------|
| 1   | 2020-01-01            | 2020-01-01         | TRUE    | FALSE   |
| 4   | NA                    | 1995-04-19         | FALSE   | TRUE    |

: Example rows of the `data.frame` output of the osdc package.

The individuals `1` and `4` have been classified as having diabetes
(either `has_t1d` or `has_t2d`, respectively). `1` is classified as
having type 1 diabetes (T1D) with an inclusion date of `2020-01-01`.
Since this date is within a time-period of sufficient data coverage, the
column `stable_inclusion_date` is populated with the same date as
`raw_inclusion_date`.

The individual in the second row, `4` is classified as having type 2
diabetes `T2D` with an inclusion date of `1995-19-04`. Since 1995 is
within a time-period of insufficient data coverage, the validity of this
inclusion date is uncertain and `stable_inclusion_date` is `NULL`.
However, `raw_inclusion_date` still contains the inclusion date of this
individual.

In the context of generating a diabetes population with valid inclusion
dates (e.g. true incident cases), three aspects of the register records
were considered when determining which periods of time had sufficient
data available:

- **Sufficient data on inclusion events:** While HbA1c test results are
  the diagnostic standard, these records are the newest addition to the
  register data ecosystem and have limited historical coverage
  nationwide. According to supplementary analyses by Isaksen et
  al.[@Isaksen2023sup], this data has complete nationwide coverage from
  Q4 2015 onward
  ([direct link to supplementary file S9](https://doi.org/10.1371/journal.pgph.0001277.s009)).
  However, as the vast majority of diabetes patients are treated with
  glucose-lowering drugs at some point, we made the pragmatic assessment
  that prescription drug purchase data are sufficient to identify
  incident cases. These are available from 1995 onward.
- **Sufficient data on exclusion events:** In order to correctly
  identify pregnancies and discard inclusion events that may occur due
  to gestational diabetes rather than T1D or T2D, register information
  on pregnancy occurrences is necessary. In the patient register, this
  information is available from 1994 onward, but coverage is
  insufficient until 1997, according to supplementary analyses by
  Isaksen[@isaksen2023thesis]
  ([direct link to analysis](https://aastedet.github.io/dissertation/5-discussion-methods.html#fig-births)).
- **Sufficient wash-out period:** In order to "wash out" prevalent cases
  from true incident cases, a period of time with valid data is
  necessary to capture prevalent cases, before new inclusions can be
  considered true incident cases and the incidence stabilizes. We
  considered a full year to be enough.

Given the above requirements of complete nationwide data on inclusion
and exclusion events, as well as a sufficient wash-out period to
establish valid incident cases, the algorithm was designed to restrict
valid inclusion dates to periods where all criteria are met.
Consequently, only inclusion dates occurring from 1998 onward are
considered true incident cases and assigned a `stable_inclusion_date`
value.