Metadata Extraction

GET

Extract comprehensive metadata from any webpage including Open Graph tags, Twitter Cards, Schema.org structured data, icons, images, links, and advanced intelligence like technology stack detection and social media handles.

Endpoint

GET /metadata

Authentication

This endpoint requires API key authentication. Include your API key in the X-API-Key header:

X-API-Key: YOUR_API_KEY

Request Parameters

Required Parameters

Parameter	Type	Description
url	string	The public URL to extract metadata from. Must be a valid HTTP/HTTPS URL. Cannot be localhost, private IPs, or internal domains (SSRF protection).

Optional Parameters

Parameter	Type	Default	Constraints	Description
fields	string	all	Comma-separated	Field groups to include in response. See Field Groups section for available options.
device	string	desktop	mobile, desktop	Device type for user agent. Affects how content is rendered/extracted.
max_images	number	25	Min: 1, Max: 100	Maximum number of images to return in the response.
max_links	number	100	Min: 1, Max: 500	Maximum number of links to return in the response.

Field Groups

Field groups allow you to request only specific types of metadata, reducing payload size and improving performance. This is especially useful for specific use cases where you don't need all available data.

Available Field Groups

Field Group	Description	Payload Size
basic	Core metadata: title, description, charset, language, canonical, keywords, author, publisher	~10%
og	Open Graph metadata for social sharing (og:title, og:image, og:description, etc.)	~5%
twitter	Twitter Card metadata (twitter:card, twitter:image, twitter:title, etc.)	~5%
schema	Schema.org structured data (JSON-LD) with validation summary	~15%
media	Images extracted from the page with metadata (url, alt, dimensions)	~20%
icons	Favicons and app icons (favicon.ico, apple-touch-icon, manifest icons)	~5%
links	Links extracted from the page (internal and external with metadata)	~15%
http	HTTP-level information (status, headers, redirects, cache control, ETag)	~8%
indexability	Indexability hints (robots meta, X-Robots-Tag, canonical validation)	~5%
counters	Content counters (heading counts, word count, reading time)	~3%
tier1	Essential web intelligence (includes: basic, og, twitter, viewport, theme_color, generator, robots, manifest_url)	~15%
tier2	Smart meta context (includes: content analysis, images, links, schema, counters)	~35%
tier3	Advanced intelligence (dominant_color, social_handles, tech_stack)	~10%
all	Include all available fields (default if no fields parameter provided)	100%

Field Group Combinations

You can combine multiple field groups by separating them with commas:

# Only basic metadata and social tags
?fields=basic,og,twitter

# Essential web intelligence + structured data
?fields=tier1,schema

# Core metadata with media and links
?fields=basic,media,links

Performance Optimization

Field filtering + compression can reduce payload size by 90-97%:

• Using fields=basic,og,twitter reduces payload to ~20% of full size
• Automatic gzip/brotli compression (for responses > 1KB) provides additional 70-80% reduction
• Combined effect: ~3-10% of original uncompressed full payload size

Example Request

curl "https://api.webpeek.dev/metadata?url=https://stripe.com&fields=basic,og,twitter"

Example Response

{
  "url": "https://github.com",
  "final_url": "https://github.com/",
  "fetched_at": "2025-11-06T19:27:12.487Z",
  "title": "GitHub · Change is constant. GitHub keeps you ahead. · GitHub",
  "description": "Join the world's most widely adopted, AI-powered developer platform where millions of developers, businesses, and the largest open source community build software that advances humanity.",
  "site_name": "GitHub",
  "language": "en",
  "charset": "utf-8",
  "canonical": {
    "href": "https://github.com/",
    "is_self_canonical": true,
    "resolves": true,
    "status": 200
  },
  "viewport": "width=device-width",
  "theme_color": "#1e2327",
  "og": {
    "title": "GitHub · Change is constant. GitHub keeps you ahead.",
    "description": "Join the world's most widely adopted, AI-powered developer platform...",
    "image": "https://images.ctfassets.net/8aevphvgewt8/4pe4eOtUJ0ARpZRE4fNekf/f52b1f9c52f059a33170229883731ed0/GH-Homepage-Universe-img.png",
    "url": "https://github.com/",
    "type": "object",
    "site_name": "GitHub"
  },
  "twitter": {
    "card": "summary_large_image",
    "title": "GitHub · Change is constant. GitHub keeps you ahead.",
    "description": "Join the world's most widely adopted, AI-powered developer platform...",
    "image": "https://images.ctfassets.net/8aevphvgewt8/4pe4eOtUJ0ARpZRE4fNekf/f52b1f9c52f059a33170229883731ed0/GH-Homepage-Universe-img.png",
    "site": "github"
  },
  "images": [...],
  "links": [...],
  "dominant_color": "#1e2327",
  "social_handles": {
    "twitter": "github",
    "instagram": "github",
    "linkedin": "github",
    "github": "features"
  },
  "tech_stack": [
    "React"
  ],
  "http": {
    "final_url": "https://github.com/",
    "status": 200,
    "content_type": "text/html; charset=utf-8",
    "response_size_bytes": 565552,
    "etag": "W/\"e9244edc36b5ba0c811237312ee5d19a\"",
    "cache_control": "max-age=0, private, must-revalidate",
    "redirect_chain": []
  },
  "extraction": {
    "method": "raw",
    "ua_family": "desktop"
  },
  "icons": [...],
  "indexability": {
    "robots_meta": null,
    "x_robots_tag": null,
    "robots_txt_allowed": true,
    "canonical_valid": true,
    "robots_effective": "index, follow"
  },
  "manifest_url": "https://github.com/manifest.json",
  "content": {
    "content_type": "documentation",
    "word_count": 2421,
    "read_time_min": 13,
    "heading_counts": {
      "h1": 4,
      "h2": 10,
      "h3": 16,
      "h4": 0,
      "h5": 0,
      "h6": 0
    }
  },
  "schema_org_summary": {
    "types": [],
    "valid": false,
    "errors_count": 0,
    "warnings_count": 0
  },
  "truncated": {
    "images": false,
    "links": true
  }
}

Detailed Field Group Examples

Here are detailed examples of what each field group returns:

Images Field

Returns images with metadata including URL, alt text, and optional dimensions.

{
  "images": [
    {
      "url": "https://images.ctfassets.net/8aevphvgewt8/4IfncsgGkGPFESlWXlAWfU/6f671f0ff761cb276c2effced9dca773/eyebrow-banner-duck.png",
      "alt": "Duck mascot"
    },
    {
      "url": "https://github.githubassets.com/assets/particles-de1dd20f3008.png",
      "alt": ""
    },
    {
      "url": "https://images.ctfassets.net/8aevphvgewt8/g1XhuDG7foMyNWIEIpGJj/b40a790b854d0803704deb7905b1bd82/logo-duolingo-14477f9e54a6.svg",
      "alt": "Duolingo",
      "height": 32
    }
  ],
  "truncated": {
    "images": false,
    "links": false
  }
}

Icons Field

Returns all favicon and app icon variations found on the page.

{
  "icons": [
    {
      "rel": "fluid-icon",
      "url": "https://github.com/fluidicon.png"
    },
    {
      "rel": "mask-icon",
      "url": "https://github.githubassets.com/assets/pinned-octocat-093da3e6fa40.svg"
    },
    {
      "rel": "alternate icon",
      "type": "image/png",
      "url": "https://github.githubassets.com/favicons/favicon.png"
    },
    {
      "rel": "icon",
      "type": "image/svg+xml",
      "url": "https://github.githubassets.com/favicons/favicon.svg"
    }
  ]
}

Links Field

Returns links with href, rel, type, and external flag indicating whether the link points to an external domain.

{
  "links": [
    {
      "href": "https://github.com/#start-of-content",
      "rel": null,
      "type": "text/html",
      "external": false
    },
    {
      "href": "https://github.com/",
      "rel": null,
      "type": "text/html",
      "external": false
    },
    {
      "href": "https://docs.github.com/",
      "rel": "noreferrer",
      "type": "text/html",
      "external": true
    },
    {
      "href": "https://github.blog/",
      "rel": "noreferrer",
      "type": "text/html",
      "external": true
    }
  ],
  "truncated": {
    "images": false,
    "links": true
  }
}

HTTP Field

Returns HTTP response information including status, content type, response size, caching headers, and redirect chain.

{
  "http": {
    "final_url": "https://github.com/",
    "status": 200,
    "content_type": "text/html; charset=utf-8",
    "response_size_bytes": 565552,
    "etag": "W/\"e9244edc36b5ba0c811237312ee5d19a\"",
    "cache_control": "max-age=0, private, must-revalidate",
    "redirect_chain": []
  }
}

Indexability Field

Returns indexability information including robots meta tags, canonical URL validation, and effective indexability status.

{
  "indexability": {
    "robots_meta": null,
    "x_robots_tag": null,
    "robots_txt_allowed": true,
    "canonical_valid": true,
    "robots_effective": "index, follow"
  }
}

Content Field

Returns content analysis including content type, word count, reading time estimate, and heading counts.

{
  "content": {
    "content_type": "documentation",
    "word_count": 2421,
    "read_time_min": 13,
    "heading_counts": {
      "h1": 4,
      "h2": 10,
      "h3": 16,
      "h4": 0,
      "h5": 0,
      "h6": 0
    }
  }
}

Social Handles & Tech Stack

Returns extracted social media handles, detected technologies, and dominant color from the page.

{
  "dominant_color": "#1e2327",
  "social_handles": {
    "twitter": "github",
    "instagram": "github",
    "linkedin": "github",
    "github": "features"
  },
  "tech_stack": [
    "React"
  ]
}

Schema.org Data

Returns summary of Schema.org structured data found on the page.

{
  "schema_org_summary": {
    "types": [],
    "valid": false,
    "errors_count": 0,
    "warnings_count": 0
  }
}

Use Cases

Link preview generation
Social media sharing optimization
Content aggregation
Website monitoring
SEO analysis tools

Code Examples

Here's how to use the metadata endpoint in different languages:

JavaScript / Node.js

metadata.js

async function getMetadata(url) {
  const response = await fetch(
    `https://api.webpeek.dev/metadata?url=${encodeURIComponent(url)}&fields=basic,og,twitter`
  );

  if (!response.ok) {
    throw new Error(`HTTP error! status: ${response.status}`);
  }

  const data = await response.json();
  return data;
}

// Example usage
const metadata = await getMetadata('https://github.com');
console.log(metadata.title);
console.log(metadata.og.image);

Python

metadata.py

import requests

def get_metadata(url):
    response = requests.get(
        'https://api.webpeek.dev/metadata',
        params={
            'url': url,
            'fields': 'basic,og,twitter'
        }
    )
    response.raise_for_status()
    return response.json()

# Example usage
metadata = get_metadata('https://github.com')
print(metadata['title'])
print(metadata['og']['image'])

Error Responses

400 Bad Request

Returned when the request parameters are invalid.

{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "url: Invalid URL format"
}

Common causes:

• Invalid URL format
• URL violates SSRF protection (localhost, private IPs)
• Invalid fields parameter value
• Invalid device parameter value
• max_images or max_links out of range

401 Unauthorized

Returned when authentication fails.

{
  "statusCode": 401,
  "error": "Unauthorized",
  "message": "Invalid or missing authentication token"
}

500 Internal Server Error

Returned when the server encounters an unexpected error.

{
  "statusCode": 500,
  "error": "Internal Server Error",
  "message": "Failed to fetch URL: Request timeout after 10000ms"
}

Common causes:

• Target URL timeout (>10 seconds)
• Target URL returns non-HTML content
• Target URL is unreachable
• Network errors

Best Practices

1. Use Field Filtering for Performance

Always request only the fields you need to minimize payload size and improve response times:

# Good: Request only what you need
?fields=basic,og,twitter

# Less optimal: Request everything (default)
?url=https://example.com

2. Set Appropriate Limits

Use max_images and max_links parameters to control response size:

# Limit images and links for faster responses
?url=https://example.com&max_images=10&max_links=50

3. Handle Redirects

The API automatically follows redirects. Check final_url to see if the URL redirected:

if (response.final_url !== response.url) {
  console.log(`Redirected to: ${response.final_url}`);
}

4. Check Truncation Flags

Always check the truncated object to know if arrays were limited:

if (response.truncated?.images) {
  console.log('Note: Image list was truncated');
}

5. Validate Canonical URLs

Use the canonical metadata to check for canonical issues:

if (!response.canonical?.is_self_canonical) {
  console.warn('Page has non-self-referencing canonical URL');
}
if (!response.canonical?.resolves) {
  console.error('Canonical URL does not resolve');
}

6. Monitor Indexability

Check indexability hints to understand SEO implications:

const indexable = response.indexability?.robots_effective.includes('index');
const followable = response.indexability?.robots_effective.includes('follow');

7. Error Handling

Always implement proper error handling:

try {
  const response = await fetch('/metadata?url=https://example.com', {
    headers: {
      'X-API-Key': apiKey
    }
  });

  if (!response.ok) {
    const error = await response.json();
    console.error(`Error ${error.statusCode}: ${error.message}`);
    return;
  }

  const metadata = await response.json();
  // Process metadata...
} catch (error) {
  console.error('Network error:', error);
}

Rate Limiting

The metadata endpoint is subject to rate limiting based on your subscription plan:

• Free tier: 100 requests/day
• Pro tier: 10,000 requests/day
• Enterprise tier: Custom limits

Rate limit information is included in response headers:

X-RateLimit-Limit: 10000
X-RateLimit-Remaining: 9543
X-RateLimit-Reset: 1704537600

Caching

The API implements intelligent caching:

• Results are cached based on URL and field parameters
• Cache duration: 1 hour (configurable per plan)
• Cached responses include X-Cache: HIT header
• Use ETag and Cache-Control headers from the response for client-side caching

Security

SSRF Protection

The API implements Server-Side Request Forgery (SSRF) protection:

• Blocks localhost addresses (127.0.0.1, ::1, localhost)
• Blocks private IP ranges (10.x.x.x, 192.168.x.x, 172.16-31.x.x)
• Blocks internal metadata services (metadata.google.internal)
• Only allows HTTP and HTTPS protocols

Content Security

• Maximum response size: 50MB
• Request timeout: 10 seconds
• Content-Type validation (must be text/html)
• Automatic sanitization of extracted data

FAQ

What happens if the URL requires authentication?

The metadata endpoint can only extract metadata from publicly accessible URLs. If a page requires authentication, the API will receive the login page HTML instead of the actual content.

Can I extract metadata from JavaScript-rendered pages?

By default, the API uses raw HTML extraction (method: 'raw'). For JavaScript-rendered content, contact support about enabling rendered extraction for your account.

Why is my tech_stack array empty?

Technology detection is heuristic-based and may not detect all technologies, especially if:

• The technology leaves no detectable footprint in HTML
• Custom or proprietary technologies are used
• Technology indicators are obfuscated

How are social handles normalized?

Social handles are extracted from meta tags and page links, then normalized:

• @ symbols are removed
• Converted to lowercase
• Only the username is kept (not full URLs)

What's the difference between title and og.title?

• title: Extracted from the <title> tag (what appears in browser tab)
• og.title: Open Graph title for social sharing (may be different for better social media display)

Can I extract metadata from PDF files?

No, the endpoint only supports HTML pages. PDFs and other document formats are not supported.

How accurate is the read_time_min calculation?

Reading time is estimated based on 200 words per minute, which is the average reading speed. Actual reading time varies based on content complexity and reader proficiency.