How accurate is schema-driven extraction from video?

Accuracy depends on two things: the quality of the transcript, and how well-specified your schema is. On well-captioned content with a clear, constrained schema (enums, required fields, explicit descriptions), extraction is typically in the high-90s for exact-match fields. For open-ended string fields, evaluate on your own corpus with a golden set — schema-driven extraction is deterministic enough to build regression tests against.

Can I use my own JSON Schema, or do I have to use a proprietary format?

You pass JSON Schema directly. Any schema that a validator (ajv, jsonschema, pydantic TypeAdapter) accepts is valid input. Pydantic users can export their models with model_json_schema() and send the result unchanged.

What happens if a field is not present in the video?

Fields marked required raise a validation error if the extractor cannot find evidence for them. Optional fields return null or are omitted. This is the same contract as any JSON Schema validator — your downstream code can handle missing fields the same way it would for any user-submitted JSON.

How do I extract from a video that has no captions?

Call the /v1/extract/video endpoint with a video URL. When captions exist the API retrieves them; when they do not, the API transcribes the audio first and then applies your schema. You do not need to orchestrate the transcription step separately.

Can I extract from a file I already own instead of a URL?

Yes. Upload the file via the Studio at /studio/upload or programmatically, then pass the returned file reference to the extraction call. The same schema, the same validated output — just a different input source.

Does extraction also return timestamps?

Yes — by adding a timestamp or start_sec field to your schema. The extractor will populate it with the moment in the video where the evidence for the field lives, so downstream UIs can deep-link each extracted field back to its source.

How much does extraction cost per video?

Extraction is billed as a single call per video regardless of how many fields your schema defines, plus the cost of getting the transcript first. For captioned YouTube content, transcript retrieval can be as little as $0.00125 per video on the 1,200-credit Premium Pack; for other captioned platforms it is as little as $0.000025 per transcript. If a video has no captions and needs speech-to-text, transcription is as little as $0.25 per hour of audio. See pricing for the exact credit math.

Can I batch-extract across a whole channel or playlist?

Yes. Iterate URLs and call the extract endpoint per video. VidNavigator handles concurrency server-side, so client code only needs to handle its own rate-limit backoff. For ingesting large channel catalogs, see the bulk YouTube transcript guide.

Five copy-and-run schemas that turn spoken video into validated rows in a table.

Video Data Extraction Examples: Five Real Schemas You Can Copy

Published 4/17/2026By Hatem Mezlini

Why schemas beat prompts

The traditional way to get structured data out of a video is to transcribe it, paste the transcript into an LLM, and prompt for JSON. The brittle way: you spend a day iterating on "please return valid JSON" until it mostly works, and then the model drifts on the 400th run and your ingestion pipeline silently corrupts a day of data.

Schema-driven extraction removes the prompt engineering step. You describe the shape you want — types, enums, required fields, per-field descriptions — and the API returns JSON that is validated against that shape. Fields cannot drift. Enum values cannot become free text. Downstream code stays simple.

Below are five schemas that cover categories people actually build on top of video data. Each is a full round-trip: schema, API call, validated output, takeaway. Copy any of them, change the URL, and the rest is the same contract you would get from any typed API.

Example 1 — Product reviews into a price-and-verdict database

Source: Creator review videos (YouTube, TikTok)

You want to index hundreds of product-review videos as rows in a structured table — product name, price, verdict, pros, cons — so users can filter and compare without watching every review end-to-end.

Schema

{
  "type": "object",
  "required": ["product_name", "verdict"],
  "properties": {
    "product_name": { "type": "string" },
    "price_usd":    { "type": "number", "description": "MSRP the reviewer cites, if any" },
    "verdict":      { "type": "string", "enum": ["buy", "skip", "situational"] },
    "pros":         { "type": "array", "items": { "type": "string" }, "maxItems": 5 },
    "cons":         { "type": "array", "items": { "type": "string" }, "maxItems": 5 },
    "timestamp_of_verdict": { "type": "number", "description": "seconds into the video where the verdict is stated" }
  }
}

API call

curl -X POST \
  https://api.vidnavigator.com/v1/extract/video \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -d '{
    "video_url": "https://www.youtube.com/watch?v=REVIEW_ID",
    "schema": { /* schema above */ }
  }'

Validated output

{
  "status": "success",
  "data": {
    "product_name": "Sony WH-1000XM5",
    "price_usd": 399,
    "verdict": "buy",
    "pros": [
      "best-in-class noise cancelling",
      "30-hour battery life",
      "lightweight case"
    ],
    "cons": [
      "plastic build feels cheaper than XM4",
      "no aptX Lossless support"
    ],
    "timestamp_of_verdict": 742
  }
}

Takeaway: Every row in the table is validated against your schema, so the verdict column is guaranteed to be one of buy / skip / situational — downstream filters and comparisons never break on a free-text variant.

Example 2 — Earnings calls into a structured highlights record

Source: Publicly posted earnings-call recordings (YouTube, company IR pages)

Equity-research teams re-listen to earnings calls for forward guidance, reaffirmed or revised numbers, and named product initiatives. The same 45-minute call summarized by five different analysts produces five inconsistent memos. A schema fixes that.

Schema

{
  "type": "object",
  "required": ["ticker", "fiscal_period"],
  "properties": {
    "ticker":         { "type": "string" },
    "fiscal_period":  { "type": "string", "description": "e.g. Q1-2026" },
    "revenue_guidance_usd_m": { "type": "number", "description": "Midpoint of forward revenue guidance in millions" },
    "guidance_direction":     { "type": "string", "enum": ["raised", "reaffirmed", "cut", "not_given"] },
    "named_initiatives": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["name"],
        "properties": {
          "name":        { "type": "string" },
          "description": { "type": "string" },
          "timestamp":   { "type": "number" }
        }
      }
    },
    "risk_flags": { "type": "array", "items": { "type": "string" } }
  }
}

API call

curl -X POST \
  https://api.vidnavigator.com/v1/extract/video \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -d '{
    "video_url": "https://www.youtube.com/watch?v=CALL_ID",
    "schema": { /* schema above */ }
  }'

Validated output

{
  "status": "success",
  "data": {
    "ticker": "ACME",
    "fiscal_period": "Q1-2026",
    "revenue_guidance_usd_m": 1250,
    "guidance_direction": "raised",
    "named_initiatives": [
      {
        "name": "Enterprise AI Platform",
        "description": "Launching GA next quarter; $50m committed pipeline",
        "timestamp": 1284
      },
      {
        "name": "APAC expansion",
        "description": "Two new country launches, sales team hired",
        "timestamp": 1910
      }
    ],
    "risk_flags": [
      "FX headwind on EU revenue",
      "GPU capacity constraint flagged for H2"
    ]
  }
}

Takeaway: The same schema applied to 40 calls a quarter gives you a consistent longitudinal dataset — compare guidance_direction across your portfolio with one SQL query instead of re-reading memos.

Example 3 — Conference panels into topic-tagged moment cards

Source: Podcast interviews, conference panels, fireside chats

Panels and interviews are still valuable extraction sources even without diarization. Instead of trying to guarantee who said what, extract the highest-signal moments with a quote, a topic tag, and timestamps so editors and researchers can jump straight to the evidence and add attribution manually when needed.

Schema

{
  "type": "object",
  "required": ["panel_topic", "moments"],
  "properties": {
    "panel_topic": { "type": "string" },
    "moments": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["quote", "topic_tag", "start_sec"],
        "properties": {
          "quote":     { "type": "string", "description": "High-signal quote or moment from the panel" },
          "topic_tag": { "type": "string", "description": "Short label for the topic discussed" },
          "summary":   { "type": "string", "description": "One-sentence explanation of why this moment matters" },
          "start_sec": { "type": "number" },
          "end_sec":   { "type": "number" }
        }
      }
    }
  }
}

API call

curl -X POST \
  https://api.vidnavigator.com/v1/extract/video \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -d '{
    "video_url": "https://www.youtube.com/watch?v=PANEL_ID",
    "schema": { /* schema above */ }
  }'

Validated output

{
  "status": "success",
  "data": {
    "panel_topic": "Building agentic AI products in 2026",
    "moments": [
      {
        "quote": "Everyone says agents, nobody defines them. What counts as an agent on your team?",
        "topic_tag": "definitions",
        "summary": "The panel opens by framing the disagreement around what should count as an agent.",
        "start_sec": 62,
        "end_sec": 71
      },
      {
        "quote": "An agent is a loop with memory and the ability to call tools — that's the minimum bar.",
        "topic_tag": "definitions",
        "summary": "A concise working definition of an agent for product teams.",
        "start_sec": 78,
        "end_sec": 89
      },
      {
        "quote": "We stopped calling them agents and started calling them workflows with conditional branches. Investors hated it but our customers got it.",
        "topic_tag": "positioning",
        "summary": "A practical product-marketing lesson about language and buyer understanding.",
        "start_sec": 105,
        "end_sec": 120
      }
    ]
  }
}

Takeaway: This pattern keeps the extraction grounded in what the transcript can support reliably today: the quote itself, the topic, and the exact moment in the video. It is ideal for newsletters, event recaps, research notes, and editorial workflows where a human can add speaker attribution later if needed.

Example 4 — Cooking videos into step-by-step JSON recipes

Source: Food YouTube, TikTok recipe creators, Instagram cooking reels

Recipe aggregators, meal-planning apps, and grocery-list integrations all need a normalized recipe shape — title, ingredients with quantities, ordered steps, timings. A 6-minute video scraped into unstructured transcript is useless; a validated recipe record is an asset.

Schema

{
  "type": "object",
  "required": ["title", "ingredients", "steps"],
  "properties": {
    "title": { "type": "string" },
    "yield_servings": { "type": "number" },
    "total_time_min": { "type": "number" },
    "ingredients": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["name", "quantity"],
        "properties": {
          "name":     { "type": "string" },
          "quantity": { "type": "string" },
          "notes":    { "type": "string" }
        }
      }
    },
    "steps": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["order", "instruction"],
        "properties": {
          "order":       { "type": "integer" },
          "instruction": { "type": "string" },
          "duration_min":{ "type": "number" },
          "timestamp":   { "type": "number" }
        }
      }
    }
  }
}

API call

curl -X POST \
  https://api.vidnavigator.com/v1/extract/video \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -d '{
    "video_url": "https://www.youtube.com/watch?v=RECIPE_ID",
    "schema": { /* schema above */ }
  }'

Validated output

{
  "status": "success",
  "data": {
    "title": "Weeknight Miso Pasta",
    "yield_servings": 2,
    "total_time_min": 25,
    "ingredients": [
      { "name": "dried spaghetti",    "quantity": "200 g" },
      { "name": "white miso paste",   "quantity": "2 tbsp" },
      { "name": "garlic",             "quantity": "3 cloves", "notes": "minced" },
      { "name": "unsalted butter",    "quantity": "30 g" },
      { "name": "pasta water",        "quantity": "1/2 cup",  "notes": "reserved" }
    ],
    "steps": [
      { "order": 1, "instruction": "Boil pasta in salted water until al dente",                    "duration_min": 9,  "timestamp": 40  },
      { "order": 2, "instruction": "Foam butter in pan and bloom the minced garlic",              "duration_min": 2,  "timestamp": 165 },
      { "order": 3, "instruction": "Whisk miso with a ladle of pasta water to loosen",            "duration_min": 1,  "timestamp": 220 },
      { "order": 4, "instruction": "Add drained pasta, toss with miso butter, emulsify with water","duration_min": 3, "timestamp": 260 }
    ]
  }
}

Takeaway: Drop the extracted record straight into a Recipe CMS or an import API — the schema matches schema.org's Recipe shape closely enough that you can also emit rich-result JSON-LD from the same payload.

Example 5 — Recorded customer calls flagged for compliance keywords

Source: Sales-call recordings, customer-support sessions, fintech onboarding videos

Regulated industries (fintech, healthcare, insurance) require either a human listener or a keyword spotter on every recorded conversation. A schema that returns risk-flag categories with timestamps replaces that manual review step with a reviewable audit trail, even when speaker attribution is not available.

Schema

{
  "type": "object",
  "required": ["call_id", "flags"],
  "properties": {
    "call_id": { "type": "string" },
    "flags": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["category", "quote", "start_sec"],
        "properties": {
          "category":  { "type": "string", "enum": ["guarantee", "personal_data", "complaint", "competitor_mention", "pricing_deviation"] },
          "quote":     { "type": "string" },
          "start_sec": { "type": "number" },
          "end_sec":   { "type": "number" }
        }
      }
    },
    "overall_risk": { "type": "string", "enum": ["low", "medium", "high"] }
  }
}

API call

curl -X POST \
  https://api.vidnavigator.com/v1/extract/video \
  -H 'Content-Type: application/json' \
  -H 'X-API-Key: YOUR_API_KEY' \
  -d '{
    "video_url": "https://storage.example.com/calls/2026-04-17/call-8812.mp4",
    "schema": { /* schema above */ }
  }'

Validated output

{
  "status": "success",
  "data": {
    "call_id": "call-8812",
    "flags": [
      {
        "category": "guarantee",
        "quote": "I can promise you'll see returns of at least 8% in the first year",
        "start_sec": 412,
        "end_sec": 421
      },
      {
        "category": "pricing_deviation",
        "quote": "We can waive the setup fee if you sign today",
        "start_sec": 560,
        "end_sec": 568
      }
    ],
    "overall_risk": "high"
  }
}

Takeaway: Route records with overall_risk = "high" to a human reviewer automatically. Each flag carries its own timestamp so the reviewer jumps straight to the 9-second window in question — audit review time drops from minutes per call to seconds, without pretending the system can always identify the speaker correctly.

Patterns you will keep reusing

Enums for categorical fields. Verdicts, risk tiers, guidance directions — if a field only has a handful of valid values, make it an enum. That single constraint prevents 90% of the "my downstream query broke" bugs.
Timestamps on every evidence field. Add start_sec / timestampeven when you think you will not use it. Your future self will want to render deep-links and run retrieval evaluations — the moment you need it and do not have it, re-indexing is painful.
Required vs optional, stated clearly. Required fields fail loudly when evidence is missing. That is what you want — silent nulls at ingestion time corrupt downstream joins and dashboards.
Human-readable descriptions on each field. The extractor reads them as hints. A clear description improves recall more than any amount of prompt engineering on the caller side.
Nested objects, not stringified JSON. If you need a structured sub-object (like the quotes array or the flags array above), express it as a real object in the schema. Do not fake structure with a string that later needs to be parsed again.

Recap

Schema-driven extraction replaces prompt engineering for structured video data.
Five starter schemas above cover reviews, earnings calls, panel moments, recipes, and compliance.
Pattern: enums, timestamps, clear required/optional split, nested objects, field descriptions.
Input is a public URL or an uploaded file — the call shape is the same either way.
Costs scale per video, not per field; concurrency is handled server-side.

Read more about the underlying endpoint in the Video Data Extraction solution or in the Video Data Extraction API deep-dive.

Frequently asked questions

SolutionVideo Data Extraction API →GuideVideo Data Extraction API deep-dive →AudienceFor AI agents →ArchitectureRAG for video transcripts →

Video Data Extraction Examples: Five Real Schemas You Can Copy

Why schemas beat prompts

Example 1 — Product reviews into a price-and-verdict database

Schema

API call

Validated output

Example 2 — Earnings calls into a structured highlights record

Schema

API call

Validated output

Example 3 — Conference panels into topic-tagged moment cards

Schema

API call

Validated output

Example 4 — Cooking videos into step-by-step JSON recipes

Schema

API call

Validated output

Example 5 — Recorded customer calls flagged for compliance keywords

Schema

API call

Validated output

Patterns you will keep reusing

Recap

Frequently asked questions

All VidNavigator solutions

Solutions by audience

Comparisons

Why schemas beat prompts

Example 1 — Product reviews into a price-and-verdict database

Schema

API call

Validated output

Example 2 — Earnings calls into a structured highlights record

Schema

API call

Validated output

Example 3 — Conference panels into topic-tagged moment cards

Schema

API call

Validated output

Example 4 — Cooking videos into step-by-step JSON recipes

Schema

API call

Validated output

Example 5 — Recorded customer calls flagged for compliance keywords

Schema

API call

Validated output

Patterns you will keep reusing

Recap

Frequently asked questions

Related

All VidNavigator solutions

Solutions by audience

Comparisons