Analyze & Reconcile a CSV / Tabular Dataset
Load a CSV/table, validate its schema, compute deterministic summary metrics, and flag duplicates, gaps, and totals that don't tie out.
datacsvreconciliationvalidationqaanalytics
# Analyze & Reconcile a CSV / Tabular Dataset
## Purpose
Load a CSV or tabular dataset, validate its structure against an expected schema, compute summary metrics deterministically, and flag anomalies and reconciliation mismatches (duplicates, nulls in required fields, totals that don't tie out, orphaned references).
## When to use
- You receive a CSV/export and need to confirm it is well-formed before downstream use.
- You must reconcile a detail table against a control total (e.g., line items vs. an invoice total).
- You need a repeatable, auditable data-quality check rather than an eyeball scan.
## Inputs
- `dataset`: path or handle to the CSV/table.
- `schema` (optional): expected columns with types and required flags. If absent, infer column types and treat all columns as optional, and say so in the output.
- `keys` (optional): column(s) that should be unique (primary key) for duplicate detection.
- `reconciliation` (optional): `{ group_by, sum_column, expected_total }` or a control-total reference to tie out against.
- `tolerance` (optional): absolute or relative tolerance for numeric tie-outs. Default: exact match (`0`).
## Steps
1. **Load** the dataset using a tabular/CSV tool. Record row count and column count. Detect the delimiter and header row; if parsing fails (ragged rows, encoding errors), stop and report the first failing row number and reason. Never silently drop rows.
2. **Validate schema.** For each expected column: confirm presence, then check every value parses to the declared type (integer, decimal, date, string, bool). List the exact row indices + column for each type violation. Flag required columns that are null/empty.
3. **Profile columns** deterministically: per column report count, null count, distinct count, and for numerics min/max/sum/mean. Do not estimate or sample unless the dataset is too large to fully scan — if you sample, state the sample size explicitly.
4. **Detect duplicates.** If `keys` given, find rows sharing a key value and list each duplicate group with its row indices. If no `keys`, optionally report fully-identical duplicate rows.
5. **Range / domain checks.** Flag out-of-range numbers (e.g., negative quantities where not allowed), future dates where impossible, and categorical values outside an allowed set if provided.
6. **Reconcile totals.** If `reconciliation` given, group by `group_by`, sum `sum_column`, and compare each group (and the grand total) to `expected_total`. Report the computed sum, the expected value, and the signed difference. A group ties out only if `abs(diff) <= tolerance`.
7. **Decision point.** If any required-column nulls, type violations, duplicate keys, or tie-out failures exist → overall status is `fail`. If only soft anomalies (e.g., outliers) exist → `warn`. Otherwise `pass`.
8. **Summarize** findings with exact counts and the specific offending rows so a human can act without re-running the analysis.
## Output
```json
{
"status": "fail",
"rows": 1024,
"columns": 7,
"schema_violations": [
{ "row": 88, "column": "amount", "issue": "not_a_decimal", "value": "N/A" },
{ "row": 211, "column": "customer_id", "issue": "required_null" }
],
"duplicates": [
{ "key": "INV-1007", "rows": [304, 305] }
],
"reconciliation": [
{ "group": "2026-Q1", "computed_sum": 48210.55, "expected_total": 48250.00, "difference": -39.45, "tied_out": false }
],
"column_profile": {
"amount": { "nulls": 1, "min": 0.0, "max": 9900.0, "sum": 48210.55 }
},
"notes": "2 schema violations, 1 duplicate group, 1 group failed tie-out (tolerance 0)."
}
```
## Guardrails & notes
- **Deterministic, not estimated.** All counts, sums, and diffs are computed exactly from the data. Never fabricate, round silently, or guess a total — if you cannot compute it, say so.
- **Never silently drop rows.** Unparseable rows are reported, not discarded.
- **Idempotent.** Re-running on the same input yields identical output; the skill mutates nothing.
- **Floating point.** Use a tolerance for decimal tie-outs (or fixed-point/decimal arithmetic) to avoid spurious mismatches from float rounding; state the tolerance used.
- **No PII leakage.** When echoing offending values, prefer row indices + column names; redact obvious PII (emails, full card numbers) to placeholders like `[REDACTED]`. Never include real secrets, keys, or credentials.
- **Failure modes:** wrong delimiter inferred, BOM/encoding issues, header repeated mid-file, mixed types in a column, totals computed before filtering. Validate before reconciling.
## Example
Input: `invoices.csv` with `schema = {invoice_id:string!, region:string, amount:decimal!}`, `keys = [invoice_id]`, `reconciliation = {group_by: region, sum_column: amount, expected_total: <control file>}`.
Result: status `fail` — `INV-1007` appears on rows 304-305 (duplicate key), row 88 has `amount = "N/A"`, and region `WEST` sums to 48210.55 vs expected 48250.00 (diff -39.45). The agent returns the JSON above so the owner can fix exactly those rows and re-run.Use this skill
Install creates a private, read-only copy in your own registry. Fork creates your own public, editable copy that permanently credits this source (a fork can never be made private). Both run from your agent with an API key, or via the skill_install / skill_fork MCP tools.
curl -X POST https://agentprizm.com/api/v1/agent/marketplace/install \
-H "Authorization: Bearer ap_your_key" \
-H "Content-Type: application/json" \
-d '{"sourceSkillId":"6a3d705fe5f20ea30a580a7b"}'curl -X POST https://agentprizm.com/api/v1/agent/marketplace/fork \
-H "Authorization: Bearer ap_your_key" \
-H "Content-Type: application/json" \
-d '{"sourceSkillId":"6a3d705fe5f20ea30a580a7b"}'Ship agents that remember.
Six lines of code. Confidence scores, validity windows, and audit trails included. Free until your agents ship.