Data Pipeline

Pipeline Overview

Form: NG-HTS 001 Client Intake  |  Version: v1.6  |  Country: Nigeria

This page describes how raw ScanForm scan data is transformed into analysis-ready tables through a series of dbt model layers. No dbt SQL models were provided for this form at time of documentation generation — the pipeline architecture below reflects the standard ScanForm fast analytics pipeline structure that applies to all forms on the platform.

flowchart LR
  subgraph source["📷 Source"]
    S1["ScanForm\nExport API"]
  end

  subgraph base["📥 Base"]
    B1["base\n(raw export columns)"]
  end

  subgraph pre_clean["🔧 Pre-Clean"]
    P1["pre_clean\n(link pages, filter)"]
  end

  subgraph clean["✅ Clean"]
    C1["clean\n(cast, impute, reshape)"]
  end

  subgraph checks["🔍 Checks"]
    CK1["checks\n(validation rules)"]
  end

  subgraph refinery["💎 Refinery"]
    R1["refinery_clean\n(mart-ready data)"]
    R2["refinery_checks\n(mart-ready DQA)"]
  end

  source --> base
  base --> pre_clean
  pre_clean --> clean
  pre_clean --> checks
  clean --> refinery
  checks --> refinery

No dbt models were provided for this form. The sections below describe the expected pipeline structure and transformation logic inferred from the form design (XLSForm, banding, and export configuration). This documentation will be updated automatically when dbt model files are added to the repository.


Pipeline Layers

The HTS 001 Client Intake data pipeline follows the standard QED fast analytics architecture — a sequence of dbt model layers, each with a well-defined responsibility. Data flows from the ScanForm export API through five transformation stages before reaching Metabase dashboards.

📥 Base Layer

The base model is a thin wrapper over the externally materialised ScanForm export. It selects all columns produced by the export configuration and applies minimal renaming to ensure consistent snake_case column names throughout the pipeline.

What happens here:

  • All columns arrive as raw text strings — no casting, no cleaning
  • Underscores from incomplete or uncertain OCR are preserved (e.g. _12_ rather than 12)
  • Both Page 1 and Page 2 columns are present in the same export row for this photopack form
  • Administrative columns included: page_id, entry_url, entry_submission_date, entry_submitted_by, row_id

Key columns arriving from the ScanForm export:

Column group Example raw values Notes
Date of Visit 21/05/24, __/05/24 Underscores indicate OCR uncertainty
Client Code 123-456-789-AB Pre-printed; validated against expected format
All oval fields 0, 1, _ 0 = No, 1 = Yes, underscore = uncertain
Score fields 3, _, 6 Integer sum of section responses
Text box fields JOHN, JO_N Underscores mark uncertain characters
Handwriting fields (absent) Handwriting zones are not OCR-processed

🔧 Pre-Clean Layer

The pre-clean model is the staging and filtering layer. It applies record-level inclusion rules before any analytical transformations occur, ensuring that downstream clean and checks models operate on a consistent, well-scoped dataset.

What happens here:

  • Duplicate exclusion: Only entries with newest_entry_status IN ('NEWEST', 'DISABLED') are retained, removing superseded re-scans of the same page
  • Issue filtering: Entries with scan processing failures are excluded
  • Discarded record exclusion: Rows where discard_page = 1 (i.e. the clinician has crossed the discard oval on the physical form) are dropped
  • Empty record exclusion: Rows where a minimum proportion of key fields are blank (indicating a page was scanned but no client data was entered) may be filtered
  • Column pruning: Administrative columns not needed downstream (e.g. photo_taken) are removed

This is the last point at which record inclusion logic is applied. Both the clean model and the checks model are built from the same pre-clean base, ensuring totals are consistent between the analytical dataset and the data quality report.


✅ Clean Layer

The clean model produces the analysis-ready dataset — one row per client intake encounter, with all fields correctly typed, reshaped, and labelled.

Key transformations applied:

Date parsing

Raw column Transformation Output column
date_visit Strip underscores, parse dd/mm/yyyy, cast to DATE date_visit_clean
end_date Same as above end_date_clean
Submission date Used as fallback if date_visit is invalid or missing date_fallback

Numeric casting

All digit-box fields arrive as strings with possible leading/trailing underscores (e.g. _2_). The clean layer applies numberFullyStripped-style logic:

  • Remove surrounding underscores
  • Cast to integer or float
  • Retain NULL for values that cannot be parsed (do not impute)

Affected fields: age, num_wives, num_children, recency, phone_number, knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment, condom_count, lubricant_count, cd4_flow_cytometry

Oval decoding

All select_one_or_zero fields arrive as "0", "1", or "_". The clean layer casts these to:

  • 0 → integer 0 (No)
  • 1 → integer 1 (Yes)
  • "_"NULL (uncertain / not verified)

Derived / business logic fields

Derived field Logic Purpose
is_index_client index_client = 1 Flag for index testing sub-population
is_hiv_positive hiv_test_result = 1 Gate for recency, CD4, and hepatitis follow-up
is_first_visit first_time_visit = 1 Distinguishes new vs. returning clients
knowledge_score_complete knowledge_assessment IS NOT NULL Whether the 6-item score was fully recorded
risk_score_complete risk_assessment IS NOT NULL Whether the 9-item score was fully recorded
refer_for_prep prep_referred = 1 PrEP linkage indicator
referred_for_tb tb_screening >= 1 TB referral flag per clinical rule
referred_for_sti sti_screening >= 1 STI referral flag per clinical rule
recommend_retest hiv_test_result = 0 AND (risk_assessment >= 1 OR sti_screening >= 1) 3-month retest recommendation flag

Conditional field handling

Fields that are conditionally printed on the paper form are retained as-is in the clean model but annotated with a companion validity flag:

Field Condition Validity flag
relation_with_index_client is_index_client = TRUE relation_valid
recency_test is_hiv_positive = TRUE recency_valid
cd4_semi_quantitative is_hiv_positive = TRUE cd4_valid
cd4_flow_cytometry is_hiv_positive = TRUE cd4_valid

🔍 Checks Layer

The checks model is a data quality audit table — a union of many individual check result rows, each describing one data quality issue found in one record. It is built from the pre-clean model (not the clean model) so that checks operate on the least-transformed data possible.

Check categories expected for this form:

Check type Fields checked Description
Date validity date_visit, end_date Parsed date is within plausible range (e.g. not in the future, not before 2010)
Date completeness date_visit Day, month, or year component is missing (underscore)
Score range knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment Score exceeds the maximum possible sum for its section
Score consistency All score fields Score does not match the count of 1 responses in constituent items
Select-one violation All select_one_or_zero fields More than one oval marked in a single-select group (caught by OCR model but checked in pipeline too)
Conditional completeness relation_with_index_client, index_client_code_new, index_client_code_old Field is blank when index_client = 1
Conditional completeness recency_test, cd4_semi_quantitative Field is blank when hiv_test_result = 1
Cross-field logic breastfeeding_under_6 + breastfeeding_over_6 Both marked Yes simultaneously
Cross-field logic hiv_test_result + recency_test Recency result recorded but HIV result is Negative
Client code format client_code Does not match expected pattern ###-###-###-XX
Phone number length phone_number Fewer than 11 digits recorded

Check output structure:

Each row in the checks table contains:

  • Metadata: page_id, entry_url, entry_submission_date, row_id
  • Check description: plain-language description of the issue found
  • Check tags: field name(s) involved, check category
  • Severity: (where applicable)

💎 Refinery Layer

Refinery models are mart-ready tables placed in the marts/ directory. They are the only models directly accessible in Metabase. They consume both the clean model and the checks model, enriching them with any aggregation, date grouping, or location metadata needed for dashboard construction.

refinery_clean — analytical dataset

Adds to the clean model:

  • Stable date grouping columns for Metabase filtering: visit_month, visit_year, visit_quarter (derived from date_visit_clean; falls back to entry_submission_date when visit date is invalid)
  • Modality label decoded from the testing_setting code using the legend printed on the physical form (1 = Emergency, 2 = Index, … 14 = PMTCT Post ANC1)
  • Location metadata joins (facility name, LGA, state) where a master facility list is available
  • All derived flags from the clean layer are forwarded unchanged

refinery_checks — DQA dashboard dataset

Adds to the checks model:

  • Same stable date grouping columns for trend analysis of data quality over time
  • Check counts aggregated by page_id to enable record-level DQA scores
  • Location metadata for geographic breakdowns of data quality

Data Flow Summary

Table 1
Pipeline Layer Summary — HTS 001 Client Intake
Layer Input Key Operations Output Consumers Row grain
📥 Base ScanForm externally materialised export Column selection, snake_case rename Pre-clean layer 1 row per ScanForm entry
🔧 Pre-Clean Base model Deduplication, discard filtering, empty record removal Clean layer, Checks layer 1 row per valid entry
✅ Clean Pre-clean model Type casting, date parsing, oval decoding, derived flags Refinery clean 1 row per client intake
🔍 Checks Pre-clean model Rule-based validation, cross-field consistency checks Refinery checks 1 row per data quality issue
💎 Refinery Clean model + Checks model Date grouping, label decoding, location joins, aggregation Metabase dashboards 1 row per client intake (clean) / 1 row per issue (checks)

Notes for Data Managers

TipInterpreting underscore values

In all base and pre-clean outputs, underscores (_) represent OCR uncertainty — a character that ScanForm could not read with confidence and that has not yet been verified by a human operator. Never treat an underscore as a zero or a space. In the clean layer these become NULL.

WarningExternally materialised exports cannot be updated in place

The ScanForm export feeding this pipeline is externally materialised. Any change to the export configuration requires creating a new versioned export (e.g. hts001_v2), backfilling all historical data, updating dashboard references, and only then retiring the previous version. This process can take several days for high-volume forms.

ImportantScore field consistency

The five assessment scores (knowledge_assessment, risk_assessment, tb_screening, sti_screening, sex_partner_risk_assessment) are handwritten sums entered by the clinician. The checks layer validates these against the individual item responses. Discrepancies are common and should be reviewed as part of routine data quality assurance — the pipeline retains the clinician-recorded score as the primary value and flags inconsistencies rather than overwriting them.