Data Pipeline

Pipeline Overview

Form: NG-HTS 001 Client Intake | Version: v1.6 | Country: Nigeria

This page describes how raw ScanForm scan data is transformed into analysis-ready tables through a series of dbt model layers. No dbt SQL models were provided for this form at time of documentation generation — the pipeline architecture below reflects the standard ScanForm fast analytics pipeline structure that applies to all forms on the platform.

flowchart LR
  subgraph source["📷 Source"]
    S1["ScanForm\nExport API"]
  end

  subgraph base["📥 Base"]
    B1["base\n(raw export columns)"]
  end

  subgraph pre_clean["🔧 Pre-Clean"]
    P1["pre_clean\n(link pages, filter)"]
  end

  subgraph clean["✅ Clean"]
    C1["clean\n(cast, impute, reshape)"]
  end

  subgraph checks["🔍 Checks"]
    CK1["checks\n(validation rules)"]
  end

  subgraph refinery["💎 Refinery"]
    R1["refinery_clean\n(mart-ready data)"]
    R2["refinery_checks\n(mart-ready DQA)"]
  end

  source --> base
  base --> pre_clean
  pre_clean --> clean
  pre_clean --> checks
  clean --> refinery
  checks --> refinery

No dbt models were provided for this form. The sections below describe the expected pipeline structure and transformation logic inferred from the form design (XLSForm, banding, and export configuration). This documentation will be updated automatically when dbt model files are added to the repository.

Pipeline Layers

The HTS 001 Client Intake data pipeline follows the standard QED fast analytics architecture — a sequence of dbt model layers, each with a well-defined responsibility. Data flows from the ScanForm export API through five transformation stages before reaching Metabase dashboards.

📥 Base Layer

The base model is a thin wrapper over the externally materialised ScanForm export. It selects all columns produced by the export configuration and applies minimal renaming to ensure consistent snake_case column names throughout the pipeline.

What happens here:

All columns arrive as raw text strings — no casting, no cleaning
Underscores from incomplete or uncertain OCR are preserved (e.g. _12_ rather than 12)
Both Page 1 and Page 2 columns are present in the same export row for this photopack form
Administrative columns included: page_id, entry_url, entry_submission_date, entry_submitted_by, row_id

Key columns arriving from the ScanForm export:

Column group	Example raw values	Notes
Date of Visit	`21/05/24`, `__/05/24`	Underscores indicate OCR uncertainty
Client Code	`123-456-789-AB`	Pre-printed; validated against expected format
All oval fields	`0`, `1`, `_`	0 = No, 1 = Yes, underscore = uncertain
Score fields	`3`, `_`, `6`	Integer sum of section responses
Text box fields	`JOHN`, `JO_N`	Underscores mark uncertain characters
Handwriting fields	(absent)	Handwriting zones are not OCR-processed

🔧 Pre-Clean Layer

The pre-clean model is the staging and filtering layer. It applies record-level inclusion rules before any analytical transformations occur, ensuring that downstream clean and checks models operate on a consistent, well-scoped dataset.

What happens here:

Duplicate exclusion: Only entries with newest_entry_status IN ('NEWEST', 'DISABLED') are retained, removing superseded re-scans of the same page
Issue filtering: Entries with scan processing failures are excluded
Discarded record exclusion: Rows where discard_page = 1 (i.e. the clinician has crossed the discard oval on the physical form) are dropped
Empty record exclusion: Rows where a minimum proportion of key fields are blank (indicating a page was scanned but no client data was entered) may be filtered
Column pruning: Administrative columns not needed downstream (e.g. photo_taken) are removed

This is the last point at which record inclusion logic is applied. Both the clean model and the checks model are built from the same pre-clean base, ensuring totals are consistent between the analytical dataset and the data quality report.

✅ Clean Layer

The clean model produces the analysis-ready dataset — one row per client intake encounter, with all fields correctly typed, reshaped, and labelled.

Key transformations applied:

Date parsing

Raw column	Transformation	Output column
`date_visit`	Strip underscores, parse `dd/mm/yyyy`, cast to `DATE`	`date_visit_clean`
`end_date`	Same as above	`end_date_clean`
Submission date	Used as fallback if `date_visit` is invalid or missing	`date_fallback`

Numeric casting

All digit-box fields arrive as strings with possible leading/trailing underscores (e.g. _2_). The clean layer applies numberFullyStripped-style logic:

Remove surrounding underscores
Cast to integer or float
Retain NULL for values that cannot be parsed (do not impute)

Affected fields: age, num_wives, num_children, recency, phone_number, knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment, condom_count, lubricant_count, cd4_flow_cytometry

Oval decoding

All select_one_or_zero fields arrive as "0", "1", or "_". The clean layer casts these to:

0 → integer 0 (No)
1 → integer 1 (Yes)
"_" → NULL (uncertain / not verified)

Derived / business logic fields

Derived field	Logic	Purpose
`is_index_client`	`index_client = 1`	Flag for index testing sub-population
`is_hiv_positive`	`hiv_test_result = 1`	Gate for recency, CD4, and hepatitis follow-up
`is_first_visit`	`first_time_visit = 1`	Distinguishes new vs. returning clients
`knowledge_score_complete`	`knowledge_assessment IS NOT NULL`	Whether the 6-item score was fully recorded
`risk_score_complete`	`risk_assessment IS NOT NULL`	Whether the 9-item score was fully recorded
`refer_for_prep`	`prep_referred = 1`	PrEP linkage indicator
`referred_for_tb`	`tb_screening >= 1`	TB referral flag per clinical rule
`referred_for_sti`	`sti_screening >= 1`	STI referral flag per clinical rule
`recommend_retest`	`hiv_test_result = 0 AND (risk_assessment >= 1 OR sti_screening >= 1)`	3-month retest recommendation flag

Conditional field handling

Fields that are conditionally printed on the paper form are retained as-is in the clean model but annotated with a companion validity flag:

Field	Condition	Validity flag
`relation_with_index_client`	`is_index_client = TRUE`	`relation_valid`
`recency_test`	`is_hiv_positive = TRUE`	`recency_valid`
`cd4_semi_quantitative`	`is_hiv_positive = TRUE`	`cd4_valid`
`cd4_flow_cytometry`	`is_hiv_positive = TRUE`	`cd4_valid`

🔍 Checks Layer

The checks model is a data quality audit table — a union of many individual check result rows, each describing one data quality issue found in one record. It is built from the pre-clean model (not the clean model) so that checks operate on the least-transformed data possible.

Check categories expected for this form:

Check type	Fields checked	Description
Date validity	`date_visit`, `end_date`	Parsed date is within plausible range (e.g. not in the future, not before 2010)
Date completeness	`date_visit`	Day, month, or year component is missing (underscore)
Score range	`knowledge_assessment`, `risk_assessment`, `tb_screening`, `sti_screening`, `sex_partner_risk_assessment`	Score exceeds the maximum possible sum for its section
Score consistency	All score fields	Score does not match the count of `1` responses in constituent items
Select-one violation	All `select_one_or_zero` fields	More than one oval marked in a single-select group (caught by OCR model but checked in pipeline too)
Conditional completeness	`relation_with_index_client`, `index_client_code_new`, `index_client_code_old`	Field is blank when `index_client = 1`
Conditional completeness	`recency_test`, `cd4_semi_quantitative`	Field is blank when `hiv_test_result = 1`
Cross-field logic	`breastfeeding_under_6` + `breastfeeding_over_6`	Both marked Yes simultaneously
Cross-field logic	`hiv_test_result` + `recency_test`	Recency result recorded but HIV result is Negative
Client code format	`client_code`	Does not match expected pattern `###-###-###-XX`
Phone number length	`phone_number`	Fewer than 11 digits recorded

Check output structure:

Each row in the checks table contains:

Metadata: page_id, entry_url, entry_submission_date, row_id
Check description: plain-language description of the issue found
Check tags: field name(s) involved, check category
Severity: (where applicable)

💎 Refinery Layer

Refinery models are mart-ready tables placed in the marts/ directory. They are the only models directly accessible in Metabase. They consume both the clean model and the checks model, enriching them with any aggregation, date grouping, or location metadata needed for dashboard construction.

refinery_clean — analytical dataset

Adds to the clean model:

Stable date grouping columns for Metabase filtering: visit_month, visit_year, visit_quarter (derived from date_visit_clean; falls back to entry_submission_date when visit date is invalid)
Modality label decoded from the testing_setting code using the legend printed on the physical form (1 = Emergency, 2 = Index, … 14 = PMTCT Post ANC1)
Location metadata joins (facility name, LGA, state) where a master facility list is available
All derived flags from the clean layer are forwarded unchanged

refinery_checks — DQA dashboard dataset

Adds to the checks model:

Same stable date grouping columns for trend analysis of data quality over time
Check counts aggregated by page_id to enable record-level DQA scores
Location metadata for geographic breakdowns of data quality

Data Flow Summary

Table 1

Layer	Input	Key Operations	Output Consumers	Row grain
Pipeline Layer Summary — HTS 001 Client Intake
📥 Base	ScanForm externally materialised export	Column selection, snake_case rename	Pre-clean layer	1 row per ScanForm entry
🔧 Pre-Clean	Base model	Deduplication, discard filtering, empty record removal	Clean layer, Checks layer	1 row per valid entry
✅ Clean	Pre-clean model	Type casting, date parsing, oval decoding, derived flags	Refinery clean	1 row per client intake
🔍 Checks	Pre-clean model	Rule-based validation, cross-field consistency checks	Refinery checks	1 row per data quality issue
💎 Refinery	Clean model + Checks model	Date grouping, label decoding, location joins, aggregation	Metabase dashboards	1 row per client intake (clean) / 1 row per issue (checks)

Notes for Data Managers

Interpreting underscore values

In all base and pre-clean outputs, underscores (_) represent OCR uncertainty — a character that ScanForm could not read with confidence and that has not yet been verified by a human operator. Never treat an underscore as a zero or a space. In the clean layer these become NULL.

Externally materialised exports cannot be updated in place

The ScanForm export feeding this pipeline is externally materialised. Any change to the export configuration requires creating a new versioned export (e.g. hts001_v2), backfilling all historical data, updating dashboard references, and only then retiring the previous version. This process can take several days for high-volume forms.

Score field consistency

The five assessment scores (knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment) are handwritten sums entered by the clinician. The checks layer validates these against the individual item responses. Discrepancies are common and should be reviewed as part of routine data quality assurance — the pipeline retains the clinician-recorded score as the primary value and flags inconsistencies rather than overwriting them.