Data Quality Checks

How Pipeline DQA Works

Form: NG-HTS 001 Client Intake  |  Version: v1.6  |  Country: Nigeria

This page documents the server-side data quality checks applied after OCR data has been submitted and processed through the dbt pipeline. These checks are complementary to in-phone validation: they operate on the full processed dataset, apply more sophisticated cross-field logic, and produce a structured audit trail used for data quality reporting and review.

No dbt SQL check models were provided for this form. The content below describes the standard check architecture and the checks that would be expected for this form based on its field definitions, OCR models, and clinical logic. This page will be updated automatically when SQL model files are added to the repository.

flowchart LR
  A["📷 Scanned &\nsubmitted data"]
  B["📥 Base layer\n(raw text)"] 
  C["🔧 Pre-clean\n(filter, deduplicate)"]
  D["✅ Clean layer\n(cast, derive)"]
  E["🔍 Checks layer\n(one row per failure)"]
  F["💎 Refinery checks\n(mart-ready DQA)"]
  G["📊 Metabase\nDQA dashboard"]
  H["📋 Data manager\nreview"]

  A --> B --> C
  C --> D
  C --> E
  E --> F --> G --> H

TipIn-phone vs pipeline checks — what is the difference?
Dimension 📱 In-phone validation 🔍 Pipeline DQA checks
When At moment of photo capture After server-side OCR processing
What OCR sees Mark / no mark only (lightweight) Full character-level OCR values
Scope Single record, single photo Entire dataset, all submissions
Effect on worker Immediate alert; must correct before submitting No immediate effect; flagged for data manager review
Effect on data Prevents bad data entering the system Flags or excludes records from clean dataset
Logic available Field presence only Value ranges, cross-field rules, date validity, score consistency

Check Severity Levels

Table 1
Check Severity Levels
Severity Meaning Typical use cases
⚠️ Warning Record is flagged and appears in the DQA report, but is retained in the clean analytical dataset. The value is preserved as-is. A data manager may choose to investigate or correct it. OCR uncertainty on non-critical fields; scores that are plausible but inconsistent with individual items; optional fields with unexpected values.
🔴 Error Record is flagged AND excluded from the clean analytical dataset. It will not appear in counts, indicators, or dashboard aggregations until the issue is resolved. Unparseable dates on mandatory date fields; critical identifiers that cannot be decoded; records where core mandatory fields are entirely missing.

Check Catalogue

The checks below are organised by check macro type. Each section explains what the check does, which fields it applies to, what failure looks like, and what the recommended action is.

🔢 Integer Conversion Checks (batch_check_convert_to_int)

These checks verify that OCR text values in digit-box fields can be parsed as integers. A failure means the OCR returned a value containing non-digit characters — typically an underscore (uncertain OCR), a letter misread as a digit, or a blank — that cannot be converted to a number for analysis.

Severity: ⚠️ Warning — record retained in clean dataset; field value stored as NULL.

Table 2
Integer Conversion Check Fields
Variable Field Label Boxes Expected range Failure example
age Age (years) 3 1–120 "_2_" (OCR uncertain)
numWives No. of Wives/Co-wives 2 0–99 "O2" (letter O misread)
numChildren No. of own children <5 years 2 0–99 "-1" (negative value)
recency Recency Number 4 Any 4-digit integer "123_" (partial read)
phoneNumber Client Telephone Number 11 11-digit Nigerian number "0801234567_" (11th digit uncertain)
postalCode Postal Code 7 Any integer "AB123" (letters present)
knowledgeAssessment Knowledge Assessment Score 1 0–6 "_" (entirely blank)
riskAssessment Personal HIV Risk Assessment Score 1 0–9 "_" (entirely blank)
tbScreening TB Screening Score 1 0–5 "6" (exceeds max of 5)
stiScreening STI Screening Score 1 0–5 "6" (exceeds max of 5)
sexPartnerRiskAssessment Sex Partner Risk Assessment Score 1 0–7 "8" (exceeds max of 7)
condomProvidedToClientCount How many condoms provided 2 0–99 "__" (both boxes uncertain)
lubricantProvidedToClientCount How many lubricants provided 2 0–99 "__" (both boxes uncertain)
cd4FlowCytometry CD4 Flow Cytometry (cells/m3) 4 0–9999 "____" (all boxes uncertain)

📅 Date Conversion Checks (batch_check_convert_to_date)

These checks verify that date fields can be parsed into valid calendar dates in dd/mm/yyyy format. Failures include OCR uncertainty in any part of the date, invalid day/month combinations (e.g. 31 February), and dates outside a plausible range for this programme.

Severity: ⚠️ Warning for partial reads; 🔴 Error for dateVisit when the date is entirely unreadable.

Table 3
Date Conversion Check Fields
Variable Field Label Format OCR models Pre-printed Check severity Failure examples
dateVisit Date of Visit dd/mm/yyyy date_dd, date_mm, two_digits_20, date_nearby_yy Year century '20' 🔴 Error if fully blank; ⚠️ Warning if partial "__/05/2024" (day uncertain); "31/02/2024" (invalid date); date in the future
endDate Date (completion) dd/mm/yyyy date_dd, date_mm, two_digits_20, date_nearby_yy Year century '20' ⚠️ Warning "21/__/2024" (month uncertain); date before 2015
TipHow the two_digits_20 model interacts with date checks

Both date fields have the year century pre-printed as 20 on the physical form. The two_digits_20 OCR model always returns 20 for those two boxes regardless of what is written, so the century component is never uncertain. Only the decade and unit digits of the year (date_nearby_yy boxes) can produce uncertainty.


☑️ Multiple-Selection Checks (batch_check_melt_columns)

For select-one questions, ScanForm exports each answer option as a separate boolean column. The melt check detects records where more than one option column is set to 1 for the same question — meaning the clinician accidentally marked multiple bubbles, or OCR misread shading as a filled bubble.

Severity: ⚠️ Warning — the record is retained but the field value is set to NULL in the clean dataset, since the intended answer is ambiguous.

Note that select_one_or_zero OCR model already enforces a maximum of one filled bubble during the OCR recognition phase. The melt check provides a second layer of defence at the pipeline level, catching any cases that passed through OCR recognition.

Table 4
Multiple-Selection Check Fields
Page 1 fields
Variable Field Label Options (exported columns) Page1
referredFrom Referred From self, tb, sti, fp, opd, ward, blood_bank, other P1
testingSetting Setting ct, tb, sti, fp, opd, ward, outreach, standalone_hts, other P1
modality Modality 1–14 (numeric code) P1
sex Sex male, female P1
firstTimeVisit First Time Visit yes, no P1
maritalStatus Marital Status married, divorced, widowed, separated, single P1
educationId Education Level none, primary_school, junior_secondary, higher_secondary, post_secondary, quranic P1
employmentStatusId Employment Status employed, unemployed, retired, freelance, student P1
typeConsueling Type of Session individual, couple, group, previously_self_tested P1
previouslyTested Previously tested within last 3 months yes, no P1
indexClient Index Testing: Is client from an index client? yes, no P1
relationWithIndexClient Relation with index client biological, sexual, social P1
pregnant Client is Pregnant yes, no P1
breastFeedingUnder6 Client breastfeeding < 6 months yes, no P1
breastFeedingOver6 Client breastfeeding > 6 months yes, no P1
previousTestedHIVNegative Previously tested HIV negative 0, 1 P1
timeLastHIVNegativeTestResult Time of last HIV Negative test Results 1mth, 13mth, 46mths, 6mths P1
everHadSexualIntercourse Ever had sexual intercourse 0, 1 P1
hivTestResult HIV Test Result negative, positive P2
hivTestBefore Have you been tested for HIV before this year? not_previously_tested, previously_negative, previously_positive_hiv_care, previously_positive_not_hiv_care P2
recencyTest Recency test with RTRI recent, long_term, negative, invalid P2
cd4SemiQuantitative CD4 Test Result — Semi-Quantitative lessThan200, over200 P2
syphilisTestResult Syphilis Test Result non_reactive, reactive P2
hepatitisBTest Hepatitis B Virus Test Result negative, positive P2
hepatitisCTest Hepatitis C Virus Test Result negative, positive P2
1 P1 = Page 1 (left page); P2 = Page 2 (right page) of the two-page photopack.

Expected Additional Checks

Beyond the three standard macro-based check types above, the following custom cross-field and range checks are expected for this form based on its clinical logic. These would be implemented as individual CTEs in the checks SQL model.

Table 5
Expected Custom Cross-Field and Range Checks
Check name Fields involved Rule Severity
knowledge_score_range knowledgeAssessment Score must be between 0 and 6 ⚠️ Warning
risk_score_range riskAssessment Score must be between 0 and 9 ⚠️ Warning
tb_score_range tbScreening Score must be between 0 and 5 ⚠️ Warning
sti_score_range stiScreening Score must be between 0 and 5 ⚠️ Warning
sex_partner_score_range sexPartnerRiskAssessment Score must be between 0 and 7 ⚠️ Warning
knowledge_score_consistency knowledgeAssessment + 6 item fields Recorded score must equal the count of Yes (1) responses across the 6 knowledge items ⚠️ Warning
risk_score_consistency riskAssessment + 9 item fields Recorded score must equal the count of Yes (1) responses across the 9 risk items ⚠️ Warning
tb_score_consistency tbScreening + 5 item fields Recorded score must equal the count of Yes (1) responses across the 5 TB items ⚠️ Warning
sti_score_consistency stiScreening + 5 item fields Recorded score must equal the count of Yes (1) responses across the 5 STI items ⚠️ Warning
sex_partner_score_consistency sexPartnerRiskAssessment + 7 item fields Recorded score must equal the count of Yes (1) responses across the 7 sex partner risk items ⚠️ Warning
recency_without_positive_hiv recencyTest + hivTestResult recencyTest must be blank if hivTestResult = Negative ⚠️ Warning
cd4_without_positive_hiv cd4SemiQuantitative / cd4FlowCytometry + hivTestResult CD4 fields must be blank if hivTestResult = Negative ⚠️ Warning
breastfeeding_contradiction breastFeedingUnder6 + breastFeedingOver6 Both breastfeeding fields cannot simultaneously be Yes ⚠️ Warning
index_code_both_filled indexClientCodeNew + indexClientCodeOld indexClientCodeNew and indexClientCodeOld cannot both be filled — only one should be used ⚠️ Warning
date_visit_in_future dateVisit Parsed date of visit must not be after today's date 🔴 Error
date_visit_before_programme dateVisit Parsed date of visit must not be before 2015-01-01 ⚠️ Warning
phone_number_length phoneNumber Phone number must be exactly 11 digits when all boxes are filled ⚠️ Warning
age_range age Age must be between 1 and 120 when parseable ⚠️ Warning

Check Output Structure

Every failing check produces one row in the checks table. Passing records produce no rows. The checks table is a union of all individual CTE results.

Table 6
Checks Table — Output Column Structure
Column Type Description
page_id text Unique page identifier from the ScanForm Data Matrix — used to locate the physical form
entry_url text Direct link to the ScanForm verification interface entry for this record
entry_submission_date timestamp Timestamp when the photo was submitted to the server
row_id text Row identifier within the export (for line-listing forms with multiple records per page)
check_name text Machine-readable check identifier (e.g. tb_score_range)
check_description text Plain-language description of the issue, suitable for display to a data manager
severity text Warning or Error — determines whether the record is excluded from the clean dataset
field_name text The specific variable that triggered this check
field_value text The raw OCR value (before casting) that failed the check
check_tags text[] Array of tags for filtering in the DQA dashboard (e.g. ['date', 'page1'], ['score', 'tb'])

Summary

Figure 1: Expected pipeline check coverage by category
Table 7
Pipeline DQA Check Summary — HTS 001 Client Intake
Metric Value
Macro-based integer conversion checks (fields) 14
Macro-based date conversion checks (fields) 2
Macro-based multiple-selection checks (questions) 25
Expected custom range checks 5
Expected custom score consistency checks 5
Expected custom cross-field / conditional checks 8
Total expected checks 59
Fields with Error-level severity 1 (dateVisit — fully unreadable)
Fields with Warning-level severity 58
SQL models provided 0 — not yet implemented

ImportantScore consistency checks require caution

The five assessment scores (knowledgeAssessment, riskAssessment, tbScreening, stiScreening, sexPartnerRiskAssessment) are handwritten sums entered by the clinician at the time of the consultation. It is common for these to differ from the automated sum of individual item responses, either due to arithmetic errors or because the clinician counted items differently. The pipeline retains the clinician-recorded score as the primary value and raises a Warning — it does not overwrite the recorded score with the computed sum.

TipHow to act on DQA flags
  • 🔴 Error records are excluded from indicator calculations automatically. A data manager should review the physical form (use the entry_url link to open the scan in ScanForm) and either correct the OCR value through manual verification or mark the record as permanently invalid.
  • ⚠️ Warning records remain in the clean dataset. Review is recommended for high-volume Warning categories (e.g. score inconsistencies) to assess whether a systematic training issue exists at a facility.