Metadata Extraction
Extract comprehensive metadata from any webpage including Open Graph tags, Twitter Cards, Schema.org structured data, icons, images, links, and advanced intelligence like technology stack detection and social media handles.
Endpoint
GET /metadataAuthentication
This endpoint requires API key authentication. Include your API key in the X-API-Key header:
X-API-Key: YOUR_API_KEYRequest Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
| url | string | The public URL to extract metadata from. Must be a valid HTTP/HTTPS URL. Cannot be localhost, private IPs, or internal domains (SSRF protection). |
Optional Parameters
| Parameter | Type | Default | Constraints | Description |
|---|---|---|---|---|
| fields | string | all | Comma-separated | Field groups to include in response. See Field Groups section for available options. |
| device | string | desktop | mobile, desktop | Device type for user agent. Affects how content is rendered/extracted. |
| max_images | number | 25 | Min: 1, Max: 100 | Maximum number of images to return in the response. |
| max_links | number | 100 | Min: 1, Max: 500 | Maximum number of links to return in the response. |
Field Groups
Field groups allow you to request only specific types of metadata, reducing payload size and improving performance. This is especially useful for specific use cases where you don't need all available data.
Available Field Groups
| Field Group | Description | Payload Size |
|---|---|---|
| basic | Core metadata: title, description, charset, language, canonical, keywords, author, publisher | ~10% |
| og | Open Graph metadata for social sharing (og:title, og:image, og:description, etc.) | ~5% |
| Twitter Card metadata (twitter:card, twitter:image, twitter:title, etc.) | ~5% | |
| schema | Schema.org structured data (JSON-LD) with validation summary | ~15% |
| media | Images extracted from the page with metadata (url, alt, dimensions) | ~20% |
| icons | Favicons and app icons (favicon.ico, apple-touch-icon, manifest icons) | ~5% |
| links | Links extracted from the page (internal and external with metadata) | ~15% |
| http | HTTP-level information (status, headers, redirects, cache control, ETag) | ~8% |
| indexability | Indexability hints (robots meta, X-Robots-Tag, canonical validation) | ~5% |
| counters | Content counters (heading counts, word count, reading time) | ~3% |
| tier1 | Essential web intelligence (includes: basic, og, twitter, viewport, theme_color, generator, robots, manifest_url) | ~15% |
| tier2 | Smart meta context (includes: content analysis, images, links, schema, counters) | ~35% |
| tier3 | Advanced intelligence (dominant_color, social_handles, tech_stack) | ~10% |
| all | Include all available fields (default if no fields parameter provided) | 100% |
Field Group Combinations
You can combine multiple field groups by separating them with commas:
# Only basic metadata and social tags
?fields=basic,og,twitter
# Essential web intelligence + structured data
?fields=tier1,schema
# Core metadata with media and links
?fields=basic,media,linksPerformance Optimization
Field filtering + compression can reduce payload size by 90-97%:
- • Using
fields=basic,og,twitterreduces payload to ~20% of full size - • Automatic gzip/brotli compression (for responses > 1KB) provides additional 70-80% reduction
- • Combined effect: ~3-10% of original uncompressed full payload size
Example Request
curl "https://api.webpeek.dev/metadata?url=https://stripe.com&fields=basic,og,twitter"Example Response
{
"url": "https://github.com",
"final_url": "https://github.com/",
"fetched_at": "2025-11-06T19:27:12.487Z",
"title": "GitHub · Change is constant. GitHub keeps you ahead. · GitHub",
"description": "Join the world's most widely adopted, AI-powered developer platform where millions of developers, businesses, and the largest open source community build software that advances humanity.",
"site_name": "GitHub",
"language": "en",
"charset": "utf-8",
"canonical": {
"href": "https://github.com/",
"is_self_canonical": true,
"resolves": true,
"status": 200
},
"viewport": "width=device-width",
"theme_color": "#1e2327",
"og": {
"title": "GitHub · Change is constant. GitHub keeps you ahead.",
"description": "Join the world's most widely adopted, AI-powered developer platform...",
"image": "https://images.ctfassets.net/8aevphvgewt8/4pe4eOtUJ0ARpZRE4fNekf/f52b1f9c52f059a33170229883731ed0/GH-Homepage-Universe-img.png",
"url": "https://github.com/",
"type": "object",
"site_name": "GitHub"
},
"twitter": {
"card": "summary_large_image",
"title": "GitHub · Change is constant. GitHub keeps you ahead.",
"description": "Join the world's most widely adopted, AI-powered developer platform...",
"image": "https://images.ctfassets.net/8aevphvgewt8/4pe4eOtUJ0ARpZRE4fNekf/f52b1f9c52f059a33170229883731ed0/GH-Homepage-Universe-img.png",
"site": "github"
},
"images": [...],
"links": [...],
"dominant_color": "#1e2327",
"social_handles": {
"twitter": "github",
"instagram": "github",
"linkedin": "github",
"github": "features"
},
"tech_stack": [
"React"
],
"http": {
"final_url": "https://github.com/",
"status": 200,
"content_type": "text/html; charset=utf-8",
"response_size_bytes": 565552,
"etag": "W/\"e9244edc36b5ba0c811237312ee5d19a\"",
"cache_control": "max-age=0, private, must-revalidate",
"redirect_chain": []
},
"extraction": {
"method": "raw",
"ua_family": "desktop"
},
"icons": [...],
"indexability": {
"robots_meta": null,
"x_robots_tag": null,
"robots_txt_allowed": true,
"canonical_valid": true,
"robots_effective": "index, follow"
},
"manifest_url": "https://github.com/manifest.json",
"content": {
"content_type": "documentation",
"word_count": 2421,
"read_time_min": 13,
"heading_counts": {
"h1": 4,
"h2": 10,
"h3": 16,
"h4": 0,
"h5": 0,
"h6": 0
}
},
"schema_org_summary": {
"types": [],
"valid": false,
"errors_count": 0,
"warnings_count": 0
},
"truncated": {
"images": false,
"links": true
}
}Detailed Field Group Examples
Here are detailed examples of what each field group returns:
Images Field
Returns images with metadata including URL, alt text, and optional dimensions.
{
"images": [
{
"url": "https://images.ctfassets.net/8aevphvgewt8/4IfncsgGkGPFESlWXlAWfU/6f671f0ff761cb276c2effced9dca773/eyebrow-banner-duck.png",
"alt": "Duck mascot"
},
{
"url": "https://github.githubassets.com/assets/particles-de1dd20f3008.png",
"alt": ""
},
{
"url": "https://images.ctfassets.net/8aevphvgewt8/g1XhuDG7foMyNWIEIpGJj/b40a790b854d0803704deb7905b1bd82/logo-duolingo-14477f9e54a6.svg",
"alt": "Duolingo",
"height": 32
}
],
"truncated": {
"images": false,
"links": false
}
}Icons Field
Returns all favicon and app icon variations found on the page.
{
"icons": [
{
"rel": "fluid-icon",
"url": "https://github.com/fluidicon.png"
},
{
"rel": "mask-icon",
"url": "https://github.githubassets.com/assets/pinned-octocat-093da3e6fa40.svg"
},
{
"rel": "alternate icon",
"type": "image/png",
"url": "https://github.githubassets.com/favicons/favicon.png"
},
{
"rel": "icon",
"type": "image/svg+xml",
"url": "https://github.githubassets.com/favicons/favicon.svg"
}
]
}Links Field
Returns links with href, rel, type, and external flag indicating whether the link points to an external domain.
{
"links": [
{
"href": "https://github.com/#start-of-content",
"rel": null,
"type": "text/html",
"external": false
},
{
"href": "https://github.com/",
"rel": null,
"type": "text/html",
"external": false
},
{
"href": "https://docs.github.com/",
"rel": "noreferrer",
"type": "text/html",
"external": true
},
{
"href": "https://github.blog/",
"rel": "noreferrer",
"type": "text/html",
"external": true
}
],
"truncated": {
"images": false,
"links": true
}
}HTTP Field
Returns HTTP response information including status, content type, response size, caching headers, and redirect chain.
{
"http": {
"final_url": "https://github.com/",
"status": 200,
"content_type": "text/html; charset=utf-8",
"response_size_bytes": 565552,
"etag": "W/\"e9244edc36b5ba0c811237312ee5d19a\"",
"cache_control": "max-age=0, private, must-revalidate",
"redirect_chain": []
}
}Indexability Field
Returns indexability information including robots meta tags, canonical URL validation, and effective indexability status.
{
"indexability": {
"robots_meta": null,
"x_robots_tag": null,
"robots_txt_allowed": true,
"canonical_valid": true,
"robots_effective": "index, follow"
}
}Content Field
Returns content analysis including content type, word count, reading time estimate, and heading counts.
{
"content": {
"content_type": "documentation",
"word_count": 2421,
"read_time_min": 13,
"heading_counts": {
"h1": 4,
"h2": 10,
"h3": 16,
"h4": 0,
"h5": 0,
"h6": 0
}
}
}Social Handles & Tech Stack
Returns extracted social media handles, detected technologies, and dominant color from the page.
{
"dominant_color": "#1e2327",
"social_handles": {
"twitter": "github",
"instagram": "github",
"linkedin": "github",
"github": "features"
},
"tech_stack": [
"React"
]
}Schema.org Data
Returns summary of Schema.org structured data found on the page.
{
"schema_org_summary": {
"types": [],
"valid": false,
"errors_count": 0,
"warnings_count": 0
}
}Use Cases
- Link preview generation
- Social media sharing optimization
- Content aggregation
- Website monitoring
- SEO analysis tools
Code Examples
Here's how to use the metadata endpoint in different languages:
JavaScript / Node.js
async function getMetadata(url) {
const response = await fetch(
`https://api.webpeek.dev/metadata?url=${encodeURIComponent(url)}&fields=basic,og,twitter`
);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const data = await response.json();
return data;
}
// Example usage
const metadata = await getMetadata('https://github.com');
console.log(metadata.title);
console.log(metadata.og.image);Python
import requests
def get_metadata(url):
response = requests.get(
'https://api.webpeek.dev/metadata',
params={
'url': url,
'fields': 'basic,og,twitter'
}
)
response.raise_for_status()
return response.json()
# Example usage
metadata = get_metadata('https://github.com')
print(metadata['title'])
print(metadata['og']['image'])Error Responses
400 Bad Request
Returned when the request parameters are invalid.
{
"statusCode": 400,
"error": "Bad Request",
"message": "url: Invalid URL format"
}Common causes:
- • Invalid URL format
- • URL violates SSRF protection (localhost, private IPs)
- • Invalid
fieldsparameter value - • Invalid
deviceparameter value - •
max_imagesormax_linksout of range
401 Unauthorized
Returned when authentication fails.
{
"statusCode": 401,
"error": "Unauthorized",
"message": "Invalid or missing authentication token"
}500 Internal Server Error
Returned when the server encounters an unexpected error.
{
"statusCode": 500,
"error": "Internal Server Error",
"message": "Failed to fetch URL: Request timeout after 10000ms"
}Common causes:
- • Target URL timeout (>10 seconds)
- • Target URL returns non-HTML content
- • Target URL is unreachable
- • Network errors
Best Practices
1. Use Field Filtering for Performance
Always request only the fields you need to minimize payload size and improve response times:
# Good: Request only what you need
?fields=basic,og,twitter
# Less optimal: Request everything (default)
?url=https://example.com2. Set Appropriate Limits
Use max_images and max_links parameters to control response size:
# Limit images and links for faster responses
?url=https://example.com&max_images=10&max_links=503. Handle Redirects
The API automatically follows redirects. Check final_url to see if the URL redirected:
if (response.final_url !== response.url) {
console.log(`Redirected to: ${response.final_url}`);
}4. Check Truncation Flags
Always check the truncated object to know if arrays were limited:
if (response.truncated?.images) {
console.log('Note: Image list was truncated');
}5. Validate Canonical URLs
Use the canonical metadata to check for canonical issues:
if (!response.canonical?.is_self_canonical) {
console.warn('Page has non-self-referencing canonical URL');
}
if (!response.canonical?.resolves) {
console.error('Canonical URL does not resolve');
}6. Monitor Indexability
Check indexability hints to understand SEO implications:
const indexable = response.indexability?.robots_effective.includes('index');
const followable = response.indexability?.robots_effective.includes('follow');7. Error Handling
Always implement proper error handling:
try {
const response = await fetch('/metadata?url=https://example.com', {
headers: {
'X-API-Key': apiKey
}
});
if (!response.ok) {
const error = await response.json();
console.error(`Error ${error.statusCode}: ${error.message}`);
return;
}
const metadata = await response.json();
// Process metadata...
} catch (error) {
console.error('Network error:', error);
}Rate Limiting
The metadata endpoint is subject to rate limiting based on your subscription plan:
- • Free tier: 100 requests/day
- • Pro tier: 10,000 requests/day
- • Enterprise tier: Custom limits
Rate limit information is included in response headers:
X-RateLimit-Limit: 10000
X-RateLimit-Remaining: 9543
X-RateLimit-Reset: 1704537600Caching
The API implements intelligent caching:
- • Results are cached based on URL and field parameters
- • Cache duration: 1 hour (configurable per plan)
- • Cached responses include
X-Cache: HITheader - • Use
ETagandCache-Controlheaders from the response for client-side caching
Security
SSRF Protection
The API implements Server-Side Request Forgery (SSRF) protection:
- • Blocks localhost addresses (127.0.0.1, ::1, localhost)
- • Blocks private IP ranges (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
- • Blocks internal metadata services (metadata.google.internal)
- • Only allows HTTP and HTTPS protocols
Content Security
- • Maximum response size: 50MB
- • Request timeout: 10 seconds
- • Content-Type validation (must be text/html)
- • Automatic sanitization of extracted data
FAQ
What happens if the URL requires authentication?
The metadata endpoint can only extract metadata from publicly accessible URLs. If a page requires authentication, the API will receive the login page HTML instead of the actual content.
Can I extract metadata from JavaScript-rendered pages?
By default, the API uses raw HTML extraction (method: 'raw'). For JavaScript-rendered content, contact support about enabling rendered extraction for your account.
Why is my tech_stack array empty?
Technology detection is heuristic-based and may not detect all technologies, especially if:
- • The technology leaves no detectable footprint in HTML
- • Custom or proprietary technologies are used
- • Technology indicators are obfuscated
How are social handles normalized?
Social handles are extracted from meta tags and page links, then normalized:
- •
@symbols are removed - • Converted to lowercase
- • Only the username is kept (not full URLs)
What's the difference between title and og.title?
- •
title: Extracted from the<title>tag (what appears in browser tab) - •
og.title: Open Graph title for social sharing (may be different for better social media display)
Can I extract metadata from PDF files?
No, the endpoint only supports HTML pages. PDFs and other document formats are not supported.
How accurate is the read_time_min calculation?
Reading time is estimated based on 200 words per minute, which is the average reading speed. Actual reading time varies based on content complexity and reader proficiency.