---
title: "Estimation Workflows"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Estimation Workflows}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

## What is a Workflow?

After transforming survey data with steps and recipes, the next task is
**estimation**: computing means, totals, ratios, and their standard errors
while accounting for the complex survey design.

The `workflow()` function wraps the estimators from the `survey` package
(`svymean`, `svytotal`, `svyratio`, `svyby`) and returns tidy results as a `data.table`
that include:

- Point estimates and standard errors
- Coefficients of variation (CV)
- Confidence intervals
- Metadata for reproducibility

## Initial Setup

We use the Academic Performance Index (API) dataset from the `survey` package,
which contains real data from stratified schools in California.

```{r setup}
library(metasurvey)
library(survey)
library(data.table)

data(api, package = "survey")
dt <- data.table(apistrat)

svy <- Survey$new(
  data    = dt,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)
```

## Basic Estimation

### Mean

We estimate the population mean of the API score in the year 2000:

```{r mean}
result <- workflow(
  list(svy),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

result
```

### Total

We estimate total enrollment across all schools:

```{r total}
result_total <- workflow(
  list(svy),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

result_total
```

### Multiple Estimates at Once

You can pass multiple estimation calls to `workflow()` to compute them in a
single step:

```{r multiple}
results <- workflow(
  list(svy),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

results
```

## Domain Estimation

We use `survey::svyby()` to compute estimates by subpopulations (domains):

```{r domain}
# Mean API score by school type
api_by_type <- workflow(
  list(svy),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

api_by_type
```

```{r domain-award}
# Mean enrollment by awards status
enroll_by_award <- workflow(
  list(svy),
  survey::svyby(~enroll, ~awards, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

enroll_by_award
```

## Quality Assessment

The **coefficient of variation (CV)** measures the precision of an estimate. You can use
`evaluate_cv()` to classify quality following standard guidelines:

| CV Range | Quality | Recommendation |
|----------|---------|----------------|
| < 5%     | Excellent | Use without restrictions |
| 5-10%    | Very good | Use with confidence |
| 10-15%   | Good | Use for most purposes |
| 15-25%   | Acceptable | Use with caution |
| 25-35%   | Poor | Only for general trends |
| >= 35%   | Unreliable | Do not publish |

```{r cv}
# Evaluate quality of the API score estimate
cv_pct <- results$cv[1] * 100
quality <- evaluate_cv(cv_pct)

cat("CV:", round(cv_pct, 2), "%\n")
cat("Quality:", quality, "\n")
```

## RecipeWorkflow: Publishable Estimates

A `RecipeWorkflow` bundles estimation calls with metadata, making the
analysis reproducible and shareable. It records:

- Which recipes were used for data preparation
- Which estimation calls were performed
- Authorship and versioning information

### Creating a RecipeWorkflow

```{r create-wf}
wf <- RecipeWorkflow$new(
  name = "API Score Analysis 2000",
  description = "Mean API score estimation by school type",
  user = "Research Team",
  survey_type = "api",
  edition = "2000",
  estimation_type = "annual",
  recipe_ids = character(0),
  calls = list(
    "survey::svymean(~api00, na.rm = TRUE)",
    "survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)"
  )
)

wf
```

### Publishing to the Registry

We publish the workflow so that others can discover and reuse it:

```{r wf-registry}
# Configure a local backend
wf_path <- tempfile(fileext = ".json")
set_workflow_backend("local", path = wf_path)

# Publish
publish_workflow(wf)

# Discover workflows
all_wf <- list_workflows()
length(all_wf)

# Search by text
found <- search_workflows("income")
length(found)

# Filter by survey type
ech_wf <- filter_workflows(survey_type = "ech")
length(ech_wf)
```

### Finding Workflows Associated with a Recipe

If you have a recipe and want to know which estimates have been published
for it, you can use `find_workflows_for_recipe()`:

```{r find-for-recipe}
# Create a workflow that references a recipe
wf2 <- RecipeWorkflow$new(
  name            = "Labor Market Estimates",
  user            = "Team",
  survey_type     = "ech",
  edition         = "2023",
  estimation_type = "annual",
  recipe_ids      = c("labor_force_recipe_001"),
  calls           = list("survey::svymean(~employed, na.rm = TRUE)")
)

publish_workflow(wf2)

# Find all workflows that use this recipe
related <- find_workflows_for_recipe("labor_force_recipe_001")
length(related)
if (length(related) > 0) cat("Found:", related[[1]]$name, "\n")
```

## Sharing via the Remote API

For broader dissemination, you can publish workflows to the metasurvey API:

```r
# Requires authentication
api_login("you@example.com", "password")

# Publish
api_publish_workflow(wf)

# Browse
all <- api_list_workflows(survey_type = "ech")
specific <- api_get_workflow("workflow_id_here")
```

## Full Pipeline

Below is a complete pipeline from raw data to publishable estimation,
using the API dataset:

```{r full-pipeline}
# 1. Create survey from real data
dt_full <- data.table(apistrat)

svy_full <- Survey$new(
  data    = dt_full,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)

# 2. Apply steps: compute derived variables
svy_full <- step_compute(svy_full,
  api_growth = api00 - api99,
  high_growth = ifelse(api00 - api99 > 50, 1L, 0L),
  comment = "API score growth indicators"
)

svy_full <- step_recode(svy_full, school_level,
  stype == "E" ~ "Elementary",
  stype == "M" ~ "Middle",
  stype == "H" ~ "High",
  .default = "Other",
  comment = "School level classification"
)

# 3. Estimate means
estimates <- workflow(
  list(svy_full),
  survey::svymean(~api_growth, na.rm = TRUE),
  survey::svymean(~high_growth, na.rm = TRUE),
  estimation_type = "annual"
)

estimates
```

```{r full-pipeline-domain}
# 4. Domain estimation (by school type)
by_school <- workflow(
  list(svy_full),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

by_school
```

```{r full-pipeline-cv}
# 5. Assess quality
for (i in seq_len(nrow(estimates))) {
  cv_val <- estimates$cv[i] * 100
  cat(
    estimates$stat[i], ":",
    round(cv_val, 1), "% CV -",
    evaluate_cv(cv_val), "\n"
  )
}
```

## Provenance: Data Lineage

Every `Survey` object records **provenance** metadata: where the data
came from, which steps were applied, how many rows survived each step,
and which versions of R and metasurvey were used. This makes it possible
to trace any estimate back to the raw data.

```{r provenance}
# Provenance is populated automatically after bake_steps()
prov <- provenance(svy_full)
prov
```

Provenance is also attached to `workflow()` results, so you can always
inspect the full lineage of an estimate:

```{r provenance-workflow}
prov_wf <- provenance(estimates)
cat("metasurvey version:", prov_wf$environment$metasurvey_version, "\n")
cat("Steps applied:", length(prov_wf$steps), "\n")
```

For audit trails, export provenance to JSON:

```{r provenance-json, eval = FALSE}
provenance_to_json(prov, "audit_trail.json")
```

To compare two runs (e.g., different editions), use `provenance_diff()`:

```{r provenance-diff, eval = FALSE}
diff <- provenance_diff(prov_2022, prov_2023)
diff$steps_changed
diff$n_final_changed
```

## Publication-Quality Tables

`workflow_table()` formats estimation results as publication-ready tables
using the `gt` package. It adds confidence intervals, CV quality
classification with color coding, and provenance-based source notes.

```{r workflow-table, eval = requireNamespace("gt", quietly = TRUE)}
workflow_table(estimates)
```

You can customize the output:

```{r workflow-table-opts, eval = requireNamespace("gt", quietly = TRUE)}
# Spanish locale, hide SE, custom title
workflow_table(
  estimates,
  locale = "es",
  show_se = FALSE,
  title = "API Growth Indicators",
  subtitle = "California Schools, 2000"
)
```

For domain estimates, the table detects group columns automatically:

```{r workflow-table-domain, eval = requireNamespace("gt", quietly = TRUE)}
workflow_table(by_school)
```

Export to any format supported by `gt::gtsave()`:

```{r workflow-table-export, eval = FALSE}
tbl <- workflow_table(estimates)
gt::gtsave(tbl, "estimates.html")
gt::gtsave(tbl, "estimates.docx")
gt::gtsave(tbl, "estimates.png")
```

## Next Steps

- **[Creating and Publishing Recipes](recipes.html)** -- Build reproducible transformation pipelines
- **[Survey Designs and Validation](complex-designs.html)** -- Stratification, clustering, replicate weights
- **[Case Study: ECH](ech-case-study.html)** -- Complete labor market analysis with estimation
