flowchart LR A["📷 Scanned &\nsubmitted data"] B["📥 Base layer\n(raw text)"] C["🔧 Pre-clean\n(filter, deduplicate)"] D["✅ Clean layer\n(cast, derive)"] E["🔍 Checks layer\n(one row per failure)"] F["💎 Refinery checks\n(mart-ready DQA)"] G["📊 Metabase\nDQA dashboard"] H["📋 Data manager\nreview"] A --> B --> C C --> D C --> E E --> F --> G --> H
Data Quality Checks
How Pipeline DQA Works
Check Severity Levels
| Check Severity Levels | ||
| Severity | Meaning | Typical use cases |
|---|---|---|
| ⚠️ Warning | Record is flagged and appears in the DQA report, but is retained in the clean analytical dataset. The value is preserved as-is. A data manager may choose to investigate or correct it. | OCR uncertainty on non-critical fields; scores that are plausible but inconsistent with individual items; optional fields with unexpected values. |
| 🔴 Error | Record is flagged AND excluded from the clean analytical dataset. It will not appear in counts, indicators, or dashboard aggregations until the issue is resolved. | Unparseable dates on mandatory date fields; critical identifiers that cannot be decoded; records where core mandatory fields are entirely missing. |
Check Catalogue
The checks below are organised by check macro type. Each section explains what the check does, which fields it applies to, what failure looks like, and what the recommended action is.
🔢 Integer Conversion Checks (batch_check_convert_to_int)
These checks verify that OCR text values in digit-box fields can be parsed as integers. A failure means the OCR returned a value containing non-digit characters — typically an underscore (uncertain OCR), a letter misread as a digit, or a blank — that cannot be converted to a number for analysis.
Severity: ⚠️ Warning — record retained in clean dataset; field value stored as NULL.
| Integer Conversion Check Fields | ||||
| Variable | Field Label | Boxes | Expected range | Failure example |
|---|---|---|---|---|
| age | Age (years) | 3 | 1–120 | "_2_" (OCR uncertain) |
| numWives | No. of Wives/Co-wives | 2 | 0–99 | "O2" (letter O misread) |
| numChildren | No. of own children <5 years | 2 | 0–99 | "-1" (negative value) |
| recency | Recency Number | 4 | Any 4-digit integer | "123_" (partial read) |
| phoneNumber | Client Telephone Number | 11 | 11-digit Nigerian number | "0801234567_" (11th digit uncertain) |
| postalCode | Postal Code | 7 | Any integer | "AB123" (letters present) |
| knowledgeAssessment | Knowledge Assessment Score | 1 | 0–6 | "_" (entirely blank) |
| riskAssessment | Personal HIV Risk Assessment Score | 1 | 0–9 | "_" (entirely blank) |
| tbScreening | TB Screening Score | 1 | 0–5 | "6" (exceeds max of 5) |
| stiScreening | STI Screening Score | 1 | 0–5 | "6" (exceeds max of 5) |
| sexPartnerRiskAssessment | Sex Partner Risk Assessment Score | 1 | 0–7 | "8" (exceeds max of 7) |
| condomProvidedToClientCount | How many condoms provided | 2 | 0–99 | "__" (both boxes uncertain) |
| lubricantProvidedToClientCount | How many lubricants provided | 2 | 0–99 | "__" (both boxes uncertain) |
| cd4FlowCytometry | CD4 Flow Cytometry (cells/m3) | 4 | 0–9999 | "____" (all boxes uncertain) |
📅 Date Conversion Checks (batch_check_convert_to_date)
These checks verify that date fields can be parsed into valid calendar dates in dd/mm/yyyy format. Failures include OCR uncertainty in any part of the date, invalid day/month combinations (e.g. 31 February), and dates outside a plausible range for this programme.
Severity: ⚠️ Warning for partial reads; 🔴 Error for dateVisit when the date is entirely unreadable.
| Date Conversion Check Fields | ||||||
| Variable | Field Label | Format | OCR models | Pre-printed | Check severity | Failure examples |
|---|---|---|---|---|---|---|
| dateVisit | Date of Visit | dd/mm/yyyy | date_dd, date_mm, two_digits_20, date_nearby_yy | Year century '20' | 🔴 Error if fully blank; ⚠️ Warning if partial | "__/05/2024" (day uncertain); "31/02/2024" (invalid date); date in the future |
| endDate | Date (completion) | dd/mm/yyyy | date_dd, date_mm, two_digits_20, date_nearby_yy | Year century '20' | ⚠️ Warning | "21/__/2024" (month uncertain); date before 2015 |
☑️ Multiple-Selection Checks (batch_check_melt_columns)
For select-one questions, ScanForm exports each answer option as a separate boolean column. The melt check detects records where more than one option column is set to 1 for the same question — meaning the clinician accidentally marked multiple bubbles, or OCR misread shading as a filled bubble.
Severity: ⚠️ Warning — the record is retained but the field value is set to NULL in the clean dataset, since the intended answer is ambiguous.
| Multiple-Selection Check Fields | |||
Page 1 fields
|
|||
|---|---|---|---|
| Variable | Field Label | Options (exported columns) | Page1 |
| referredFrom | Referred From | self, tb, sti, fp, opd, ward, blood_bank, other | P1 |
| testingSetting | Setting | ct, tb, sti, fp, opd, ward, outreach, standalone_hts, other | P1 |
| modality | Modality | 1–14 (numeric code) | P1 |
| sex | Sex | male, female | P1 |
| firstTimeVisit | First Time Visit | yes, no | P1 |
| maritalStatus | Marital Status | married, divorced, widowed, separated, single | P1 |
| educationId | Education Level | none, primary_school, junior_secondary, higher_secondary, post_secondary, quranic | P1 |
| employmentStatusId | Employment Status | employed, unemployed, retired, freelance, student | P1 |
| typeConsueling | Type of Session | individual, couple, group, previously_self_tested | P1 |
| previouslyTested | Previously tested within last 3 months | yes, no | P1 |
| indexClient | Index Testing: Is client from an index client? | yes, no | P1 |
| relationWithIndexClient | Relation with index client | biological, sexual, social | P1 |
| pregnant | Client is Pregnant | yes, no | P1 |
| breastFeedingUnder6 | Client breastfeeding < 6 months | yes, no | P1 |
| breastFeedingOver6 | Client breastfeeding > 6 months | yes, no | P1 |
| previousTestedHIVNegative | Previously tested HIV negative | 0, 1 | P1 |
| timeLastHIVNegativeTestResult | Time of last HIV Negative test Results | 1mth, 13mth, 46mths, 6mths | P1 |
| everHadSexualIntercourse | Ever had sexual intercourse | 0, 1 | P1 |
| hivTestResult | HIV Test Result | negative, positive | P2 |
| hivTestBefore | Have you been tested for HIV before this year? | not_previously_tested, previously_negative, previously_positive_hiv_care, previously_positive_not_hiv_care | P2 |
| recencyTest | Recency test with RTRI | recent, long_term, negative, invalid | P2 |
| cd4SemiQuantitative | CD4 Test Result — Semi-Quantitative | lessThan200, over200 | P2 |
| syphilisTestResult | Syphilis Test Result | non_reactive, reactive | P2 |
| hepatitisBTest | Hepatitis B Virus Test Result | negative, positive | P2 |
| hepatitisCTest | Hepatitis C Virus Test Result | negative, positive | P2 |
| 1 P1 = Page 1 (left page); P2 = Page 2 (right page) of the two-page photopack. | |||
Expected Additional Checks
Beyond the three standard macro-based check types above, the following custom cross-field and range checks are expected for this form based on its clinical logic. These would be implemented as individual CTEs in the checks SQL model.
| Expected Custom Cross-Field and Range Checks | |||
| Check name | Fields involved | Rule | Severity |
|---|---|---|---|
| knowledge_score_range | knowledgeAssessment | Score must be between 0 and 6 | ⚠️ Warning |
| risk_score_range | riskAssessment | Score must be between 0 and 9 | ⚠️ Warning |
| tb_score_range | tbScreening | Score must be between 0 and 5 | ⚠️ Warning |
| sti_score_range | stiScreening | Score must be between 0 and 5 | ⚠️ Warning |
| sex_partner_score_range | sexPartnerRiskAssessment | Score must be between 0 and 7 | ⚠️ Warning |
| knowledge_score_consistency | knowledgeAssessment + 6 item fields | Recorded score must equal the count of Yes (1) responses across the 6 knowledge items | ⚠️ Warning |
| risk_score_consistency | riskAssessment + 9 item fields | Recorded score must equal the count of Yes (1) responses across the 9 risk items | ⚠️ Warning |
| tb_score_consistency | tbScreening + 5 item fields | Recorded score must equal the count of Yes (1) responses across the 5 TB items | ⚠️ Warning |
| sti_score_consistency | stiScreening + 5 item fields | Recorded score must equal the count of Yes (1) responses across the 5 STI items | ⚠️ Warning |
| sex_partner_score_consistency | sexPartnerRiskAssessment + 7 item fields | Recorded score must equal the count of Yes (1) responses across the 7 sex partner risk items | ⚠️ Warning |
| recency_without_positive_hiv | recencyTest + hivTestResult | recencyTest must be blank if hivTestResult = Negative | ⚠️ Warning |
| cd4_without_positive_hiv | cd4SemiQuantitative / cd4FlowCytometry + hivTestResult | CD4 fields must be blank if hivTestResult = Negative | ⚠️ Warning |
| breastfeeding_contradiction | breastFeedingUnder6 + breastFeedingOver6 | Both breastfeeding fields cannot simultaneously be Yes | ⚠️ Warning |
| index_code_both_filled | indexClientCodeNew + indexClientCodeOld | indexClientCodeNew and indexClientCodeOld cannot both be filled — only one should be used | ⚠️ Warning |
| date_visit_in_future | dateVisit | Parsed date of visit must not be after today's date | 🔴 Error |
| date_visit_before_programme | dateVisit | Parsed date of visit must not be before 2015-01-01 | ⚠️ Warning |
| phone_number_length | phoneNumber | Phone number must be exactly 11 digits when all boxes are filled | ⚠️ Warning |
| age_range | age | Age must be between 1 and 120 when parseable | ⚠️ Warning |
Check Output Structure
Every failing check produces one row in the checks table. Passing records produce no rows. The checks table is a union of all individual CTE results.
| Checks Table — Output Column Structure | ||
| Column | Type | Description |
|---|---|---|
| page_id | text | Unique page identifier from the ScanForm Data Matrix — used to locate the physical form |
| entry_url | text | Direct link to the ScanForm verification interface entry for this record |
| entry_submission_date | timestamp | Timestamp when the photo was submitted to the server |
| row_id | text | Row identifier within the export (for line-listing forms with multiple records per page) |
| check_name | text | Machine-readable check identifier (e.g. tb_score_range) |
| check_description | text | Plain-language description of the issue, suitable for display to a data manager |
| severity | text | Warning or Error — determines whether the record is excluded from the clean dataset |
| field_name | text | The specific variable that triggered this check |
| field_value | text | The raw OCR value (before casting) that failed the check |
| check_tags | text[] | Array of tags for filtering in the DQA dashboard (e.g. ['date', 'page1'], ['score', 'tb']) |
Summary
| Pipeline DQA Check Summary — HTS 001 Client Intake | |
| Metric | Value |
|---|---|
| Macro-based integer conversion checks (fields) | 14 |
| Macro-based date conversion checks (fields) | 2 |
| Macro-based multiple-selection checks (questions) | 25 |
| Expected custom range checks | 5 |
| Expected custom score consistency checks | 5 |
| Expected custom cross-field / conditional checks | 8 |
| Total expected checks | 59 |
| Fields with Error-level severity | 1 (dateVisit — fully unreadable) |
| Fields with Warning-level severity | 58 |
| SQL models provided | 0 — not yet implemented |