flowchart LR
subgraph source["📷 Source"]
S1["ScanForm\nExport API"]
end
subgraph base["📥 Base"]
B1["base\n(raw export columns)"]
end
subgraph pre_clean["🔧 Pre-Clean"]
P1["pre_clean\n(link pages, filter)"]
end
subgraph clean["✅ Clean"]
C1["clean\n(cast, impute, reshape)"]
end
subgraph checks["🔍 Checks"]
CK1["checks\n(validation rules)"]
end
subgraph refinery["💎 Refinery"]
R1["refinery_clean\n(mart-ready data)"]
R2["refinery_checks\n(mart-ready DQA)"]
end
source --> base
base --> pre_clean
pre_clean --> clean
pre_clean --> checks
clean --> refinery
checks --> refinery
Data Pipeline
Pipeline Overview
Pipeline Layers
The HTS 001 Client Intake data pipeline follows the standard QED fast analytics architecture — a sequence of dbt model layers, each with a well-defined responsibility. Data flows from the ScanForm export API through five transformation stages before reaching Metabase dashboards.
📥 Base Layer
The base model is a thin wrapper over the externally materialised ScanForm export. It selects all columns produced by the export configuration and applies minimal renaming to ensure consistent snake_case column names throughout the pipeline.
What happens here:
- All columns arrive as raw text strings — no casting, no cleaning
- Underscores from incomplete or uncertain OCR are preserved (e.g.
_12_rather than12) - Both Page 1 and Page 2 columns are present in the same export row for this photopack form
- Administrative columns included:
page_id,entry_url,entry_submission_date,entry_submitted_by,row_id
Key columns arriving from the ScanForm export:
| Column group | Example raw values | Notes |
|---|---|---|
| Date of Visit | 21/05/24, __/05/24 |
Underscores indicate OCR uncertainty |
| Client Code | 123-456-789-AB |
Pre-printed; validated against expected format |
| All oval fields | 0, 1, _ |
0 = No, 1 = Yes, underscore = uncertain |
| Score fields | 3, _, 6 |
Integer sum of section responses |
| Text box fields | JOHN, JO_N |
Underscores mark uncertain characters |
| Handwriting fields | (absent) | Handwriting zones are not OCR-processed |
🔧 Pre-Clean Layer
The pre-clean model is the staging and filtering layer. It applies record-level inclusion rules before any analytical transformations occur, ensuring that downstream clean and checks models operate on a consistent, well-scoped dataset.
What happens here:
- Duplicate exclusion: Only entries with
newest_entry_status IN ('NEWEST', 'DISABLED')are retained, removing superseded re-scans of the same page - Issue filtering: Entries with scan processing failures are excluded
- Discarded record exclusion: Rows where
discard_page = 1(i.e. the clinician has crossed the discard oval on the physical form) are dropped - Empty record exclusion: Rows where a minimum proportion of key fields are blank (indicating a page was scanned but no client data was entered) may be filtered
- Column pruning: Administrative columns not needed downstream (e.g.
photo_taken) are removed
✅ Clean Layer
The clean model produces the analysis-ready dataset — one row per client intake encounter, with all fields correctly typed, reshaped, and labelled.
Key transformations applied:
Date parsing
| Raw column | Transformation | Output column |
|---|---|---|
date_visit |
Strip underscores, parse dd/mm/yyyy, cast to DATE |
date_visit_clean |
end_date |
Same as above | end_date_clean |
| Submission date | Used as fallback if date_visit is invalid or missing |
date_fallback |
Numeric casting
All digit-box fields arrive as strings with possible leading/trailing underscores (e.g. _2_). The clean layer applies numberFullyStripped-style logic:
- Remove surrounding underscores
- Cast to integer or float
- Retain
NULLfor values that cannot be parsed (do not impute)
Affected fields: age, num_wives, num_children, recency, phone_number, knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment, condom_count, lubricant_count, cd4_flow_cytometry
Oval decoding
All select_one_or_zero fields arrive as "0", "1", or "_". The clean layer casts these to:
0→ integer0(No)1→ integer1(Yes)"_"→NULL(uncertain / not verified)
Derived / business logic fields
| Derived field | Logic | Purpose |
|---|---|---|
is_index_client |
index_client = 1 |
Flag for index testing sub-population |
is_hiv_positive |
hiv_test_result = 1 |
Gate for recency, CD4, and hepatitis follow-up |
is_first_visit |
first_time_visit = 1 |
Distinguishes new vs. returning clients |
knowledge_score_complete |
knowledge_assessment IS NOT NULL |
Whether the 6-item score was fully recorded |
risk_score_complete |
risk_assessment IS NOT NULL |
Whether the 9-item score was fully recorded |
refer_for_prep |
prep_referred = 1 |
PrEP linkage indicator |
referred_for_tb |
tb_screening >= 1 |
TB referral flag per clinical rule |
referred_for_sti |
sti_screening >= 1 |
STI referral flag per clinical rule |
recommend_retest |
hiv_test_result = 0 AND (risk_assessment >= 1 OR sti_screening >= 1) |
3-month retest recommendation flag |
Conditional field handling
Fields that are conditionally printed on the paper form are retained as-is in the clean model but annotated with a companion validity flag:
| Field | Condition | Validity flag |
|---|---|---|
relation_with_index_client |
is_index_client = TRUE |
relation_valid |
recency_test |
is_hiv_positive = TRUE |
recency_valid |
cd4_semi_quantitative |
is_hiv_positive = TRUE |
cd4_valid |
cd4_flow_cytometry |
is_hiv_positive = TRUE |
cd4_valid |
🔍 Checks Layer
The checks model is a data quality audit table — a union of many individual check result rows, each describing one data quality issue found in one record. It is built from the pre-clean model (not the clean model) so that checks operate on the least-transformed data possible.
Check categories expected for this form:
| Check type | Fields checked | Description |
|---|---|---|
| Date validity | date_visit, end_date |
Parsed date is within plausible range (e.g. not in the future, not before 2010) |
| Date completeness | date_visit |
Day, month, or year component is missing (underscore) |
| Score range | knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment |
Score exceeds the maximum possible sum for its section |
| Score consistency | All score fields | Score does not match the count of 1 responses in constituent items |
| Select-one violation | All select_one_or_zero fields |
More than one oval marked in a single-select group (caught by OCR model but checked in pipeline too) |
| Conditional completeness | relation_with_index_client, index_client_code_new, index_client_code_old |
Field is blank when index_client = 1 |
| Conditional completeness | recency_test, cd4_semi_quantitative |
Field is blank when hiv_test_result = 1 |
| Cross-field logic | breastfeeding_under_6 + breastfeeding_over_6 |
Both marked Yes simultaneously |
| Cross-field logic | hiv_test_result + recency_test |
Recency result recorded but HIV result is Negative |
| Client code format | client_code |
Does not match expected pattern ###-###-###-XX |
| Phone number length | phone_number |
Fewer than 11 digits recorded |
Check output structure:
Each row in the checks table contains:
- Metadata:
page_id,entry_url,entry_submission_date,row_id - Check description: plain-language description of the issue found
- Check tags: field name(s) involved, check category
- Severity: (where applicable)
💎 Refinery Layer
Refinery models are mart-ready tables placed in the marts/ directory. They are the only models directly accessible in Metabase. They consume both the clean model and the checks model, enriching them with any aggregation, date grouping, or location metadata needed for dashboard construction.
refinery_clean — analytical dataset
Adds to the clean model:
- Stable date grouping columns for Metabase filtering:
visit_month,visit_year,visit_quarter(derived fromdate_visit_clean; falls back toentry_submission_datewhen visit date is invalid) - Modality label decoded from the
testing_settingcode using the legend printed on the physical form (1 = Emergency, 2 = Index, … 14 = PMTCT Post ANC1) - Location metadata joins (facility name, LGA, state) where a master facility list is available
- All derived flags from the clean layer are forwarded unchanged
refinery_checks — DQA dashboard dataset
Adds to the checks model:
- Same stable date grouping columns for trend analysis of data quality over time
- Check counts aggregated by
page_idto enable record-level DQA scores - Location metadata for geographic breakdowns of data quality
Data Flow Summary
| Pipeline Layer Summary — HTS 001 Client Intake | ||||
| Layer | Input | Key Operations | Output Consumers | Row grain |
|---|---|---|---|---|
| 📥 Base | ScanForm externally materialised export | Column selection, snake_case rename | Pre-clean layer | 1 row per ScanForm entry |
| 🔧 Pre-Clean | Base model | Deduplication, discard filtering, empty record removal | Clean layer, Checks layer | 1 row per valid entry |
| ✅ Clean | Pre-clean model | Type casting, date parsing, oval decoding, derived flags | Refinery clean | 1 row per client intake |
| 🔍 Checks | Pre-clean model | Rule-based validation, cross-field consistency checks | Refinery checks | 1 row per data quality issue |
| 💎 Refinery | Clean model + Checks model | Date grouping, label decoding, location joins, aggregation | Metabase dashboards | 1 row per client intake (clean) / 1 row per issue (checks) |