Documentation · Serverless API · Announcement blog
Welcome to the Schematron series, Inference.net's long‑context extraction models specialized in converting noisy HTML into clean, typed JSON that conforms to your custom schema. The Schematron series was purpose‑trained for web scraping, data ingestion, and transforming arbitrary pages into structured records.
We're releasing these models in two different sizes:
[!NOTE] This model card is dedicated to the smaller
Schematron-3Bmodel. Check outSchematron-8Bfor the larger model.
[!NOTE] The JSON Schema passed as input needs to conform to the schema.org schema.
We evaluated extraction quality using Gemini 2.5 Pro as a judge, scoring extractions from 1-5 where 5 represents perfect extraction.
| Model | LLM-as-Judge Score |
|---|---|
| GPT-4.1 | 4.74 |
| Schematron-8B | 4.64 |
| Schematron-3B | 4.41 |
| Gemini-3B-Base | 2.24 |
We evaluated Schematron's real-world impact on LLM factuality using SimpleQA.
Test Pipeline:

Key findings:
Use these local snippets to prepare HTML and compose a schema‑guided prompt. The model returns strictly valid JSON; validate it against your schema downstream.
from lxml.html.clean import Cleaner
import lxml.html as LH
HTML_CLEANER = Cleaner(
scripts=True,
javascript=True,
style=True,
inline_style=True,
safe_attrs_only=False,
)
def strip_noise(html: str) -> str:
"""Remove scripts, styles, and JavaScript from HTML using lxml.
"""
if not html or not html.strip():
return ""
try:
doc = LH.fromstring(html)
cleaned = HTML_CLEANER.clean_html(doc)
return LH.tostring(cleaned, encoding="unicode")
except Exception:
return ""
Compose messages with your schema and cleaned HTML:
def construct_messages(schema: str, html: str):
"""Construct messages for a schema‑guided extraction request."""
response_prompt = {
"prompt_part_one": (
"You are going to be given a JSON schema following the standardized JSON "
"Schema format. You are going to be given a HTML page and you are going "
"to apply the schema to the HTML page however you see it as applicable "
"and return the results in a JSON object. The schema is as follows:"
),
"prompt_part_two": "Here is the HTML page:",
"prompt_part_three": "MAKE SURE ITS VALID JSON.",
}
user_prompt = (
response_prompt['prompt_part_one']
+ "\n\n" + schema + "\n\n"
+ response_prompt['prompt_part_two']
+ "\n\n" + html + "\n\n"
+ response_prompt['prompt_part_three']
)
return [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": user_prompt},
]
[!NOTE] In the serverless API there's no need to pass anything but the HTML. We handle the prompt formatting for you.
See license in the metadata above.