Scaling SEO & GEO Content with LLMs: From 0 to 891 Articles Automatically

The Content Scaling Problem

Creating high-quality SEO content at scale has always been a bottleneck. Manual content creation can't keep up with the demand for topical coverage, and traditional automation produces low-quality spam. But with LLMs, we can build something different: a pipeline that generates 891 unique, well-researched articles automatically—while maintaining quality that rivals human writers.

More importantly, as AI-powered search engines (Perplexity, ChatGPT Search, Google AI Overviews) gain market share, we need to think beyond traditional SEO. Enter GEO (Generative Engine Optimization)—optimizing content to be cited by AI systems, not just ranked by traditional search algorithms.

SEO vs GEO:
SEO optimizes for Google's PageRank: backlinks, keywords, click-through rates.
GEO optimizes for AI citation: factual accuracy, structured data, clear definitions.

System Architecture

The content generation pipeline consists of six stages:

Topic Research: Load target keywords with search volume data
SERP Analysis: Fetch competitor content and "People Also Ask" boxes
Content Extraction: Convert competitor pages to LLM-friendly markdown
Outline Generation: Create and refine article structure using Claude
Content Generation: Generate each section with keyword integration
Meta & FAQ: Generate SEO metadata and FAQ schema

Stage 1: Topic Research at Scale

We start with a curated list of target keywords, each with monthly search volume (MSV) data to prioritize high-impact topics:

glossary_topics = [
    {
        "suggested_title": "What is an IPO? Definition & Examples",
        "target_kw": "ipo",
        "msv": 44000,  # Monthly search volume
        "secondary_kws": ["ipo meaning", "what is an ipo", "ipo definition"]
    },
    {
        "suggested_title": "What is Generative AI (ChatGPT, Claude, Dall-E)?",
        "target_kw": "what is generative ai",
        "msv": 4500,
        "secondary_kws": ["ai vs machine learning", "generative ai models"]
    },
    # ... 889 more topics loaded from Excel
]

For this project, we loaded 891 glossary topics from an Excel sheet, covering everything from "What is an IPO?" (44K monthly searches) to niche terms like "What is a reverse IPO?" (100 searches). The long tail matters for GEO—AI systems cite authoritative sources on specific topics.

Stage 2: SERP Analysis & Competitor Research

Before generating content, we need to understand what's currently ranking. The Serper API gives us structured Google search results including the golden "People Also Ask" box:

import json
import requests

def get_serp_response(serp_query, page=1):
    """Fetch Google search results via Serper API"""
    url = "https://google.serper.dev/search"
    payload = json.dumps({
        "q": serp_query,
        "autocorrect": False,
        "location": "California, United States",
        "num": 100,
        "page": page
    })
    headers = {
        "X-API-KEY": os.environ["SERPER_API_KEY"],
        "Content-Type": "application/json",
    }
    response = requests.post(url, headers=headers, data=payload)
    return response.json()

# Get SERP results including "People Also Ask" box
serp_response = get_serp_response("What is an IPO?", 1)
answer_box = serp_response.get('answerBox')
organic_results = serp_response['organic'][:10]
paa_box = serp_response.get("peopleAlsoAsk")  # Gold for FAQ generation

The peopleAlsoAsk data is invaluable—these are the exact questions users are asking, perfect for FAQ generation and featured snippet optimization.

Prioritizing Authoritative Sources

Not all search results are equal. We prioritize high-authority financial sources to ensure our training context is reliable:

# Prioritize authoritative sources for training data
priority_domains = [
    "investopedia.com",
    "forbes.com/advisor/",
    "nerdwallet.com",
    "fool.com",
    "bankrate.com",
    "wallethub.com",
    "valuepenguin.com",
    "lendingtree.com"
]

def prioritize_urls(results, priority_domains):
    """Reorder search results to prioritize authoritative domains"""
    priority_results = []
    other_results = []

    for result in results:
        if any(domain in result['link'] for domain in priority_domains):
            priority_results.append(result)
        else:
            other_results.append(result)

    return priority_results + other_results

# Prioritize authoritative content
prioritized_results = prioritize_urls(organic_results, priority_domains)
relevant_articles = prioritized_results[:5]  # Top 5 sources

This is crucial for GEO: AI search engines weight source authority heavily. Training on Investopedia-quality content produces Investopedia-quality outputs.

Stage 3: Content Extraction with Jina Reader

Raw HTML is messy. Jina Reader converts any URL to clean, LLM-friendly markdown— stripping navigation, ads, and boilerplate:

def load_markdown_from_urls(urls: list, jina_prefix="https://r.jina.ai/") -> tuple:
    """
    Convert web pages to clean markdown using Jina Reader.
    Returns concatenated content and list of failed URLs.
    """
    result = ''
    urls_cannot_be_loaded = []

    for url in urls:
        try:
            # Jina Reader converts any URL to LLM-friendly markdown
            response = requests.get(jina_prefix + url)
            response.raise_for_status()
            # Truncate to avoid context limits
            result += response.text.strip()[:40000] + '\n'
        except:
            urls_cannot_be_loaded.append(url)

    return result, urls_cannot_be_loaded

# Extract competitor content as training context
markdown_content, failed_urls = load_markdown_from_urls(
    [x['link'] for x in relevant_articles]
)

We truncate at 40K characters to stay within context limits while capturing the essential content. This markdown becomes the "source material" for generation.

Stage 4: Structured Outline Generation

Here's where the magic happens. Using Instructor + Pydantic, we get structured outputs from Claude instead of raw text:

from pydantic import BaseModel
from typing import List
import instructor

# Initialize Claude with structured outputs
client = anthropic.Anthropic()
instructor_client = instructor.from_anthropic(client)

class OutlineModel(BaseModel):
    """Structured outline for SEO article"""
    overall_title: str
    introduction_title: str
    content_subtitles: List[str]
    conclusion_title: str

def generate_outline(client, target_title, markdown_content, target_keyword, secondary_keywords):
    prompt = f"""You are creating an outline for an SEO article about "{target_title}".

    Content purpose: Educate readers with a clear definition, in-depth explanation,
    and implications for businesses, investors, markets and the economy.

    
    {target_keyword}
    {secondary_keywords}
    

    
    {markdown_content}
    

    Create an outline with:
    a. Overall Title focused on "{target_title}"
    b. Introduction with definition: "{target_keyword} Definition"
    c. 4+ subtitles based on keywords and competitor content
    d. Conclusion with example: "{target_keyword} Example"

    Use a tone mixing Investopedia & NerdWallet.
    """

    outline = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        response_model=OutlineModel,  # Pydantic validation!
        temperature=0.5
    )
    return outline

The response_model=OutlineModel parameter forces Claude to return a valid Pydantic object. No parsing errors, no malformed JSON—just clean, typed data.

Two-Pass Outline Refinement

The first outline is creative; the second pass ensures accuracy. This is critical for GEO—AI search engines penalize hallucinations:

def improve_outline(client, target_title, markdown_content,
                     target_keyword, secondary_keywords, outline):
    """
    Second pass: Verify accuracy and remove speculation.
    This is crucial for GEO—AI search engines penalize hallucinations.
    """
    prompt = f"""Review and improve this outline for "{target_title}".

    Current outline: {outline}

    CRITICAL verification requirements:
    - Only include information verifiable from the markdown_content
    - Double-check all facts and figures against sources
    - If uncertain about information, omit it or use safer alternatives
    - Avoid investment recommendations or predictions
    - Cross-reference key points with multiple sources if possible
    - Be especially cautious with numbers, dates, and company claims

    It's better to have a shorter, fully verified outline than a longer
    one with uncertain information.
    """

    improved = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        response_model=OutlineModel,
        temperature=0.5
    )
    return improved

The improvement prompt emphasizes verification over creativity. For GEO, it's better to have a shorter, fully accurate article than a longer one with uncertain claims.

Stage 5: Section-by-Section Content Generation

Rather than generating the entire article at once, we generate each section sequentially, passing previous sections as context:

def generate_content_part(client, target_title, markdown_content,
                          target_keyword, secondary_keywords,
                          outline, outline_part, previous_content):
    """Generate one section of the article at a time"""

    prompt = f"""Write the section "{outline_part}" for an article about "{target_title}".

    {target_keyword}, {secondary_keywords}
    {markdown_content}
    {outline}
    {previous_content}

    Guidelines:
    1. Focus solely on this section—don't include the title
    2. Incorporate relevant information from existing_articles
    3. Stay truthful and factual—no false claims
    4. Use industry terms but explain them for general audiences
    5. Naturally integrate keywords for SEO
    6. Aim for 300-350 words
    7. Show expertise so Google ranks appropriately
    8. Use Investopedia & NerdWallet tone
    9. Everything must be factual—rely only on markdown_content
    """

    body = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=4096,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": ""}
        ],
        stop_sequences=[""],
        temperature=0.5
    )
    return body.content[0].text

# Generate each section sequentially, building context
outline_parts = [outline.introduction_title,
                 *outline.content_subtitles,
                 outline.conclusion_title]

content_dict = {}
for outline_part in outline_parts:
    content_dict[outline_part] = generate_content_part(
        client, target_title, markdown_content,
        target_keyword, secondary_keywords,
        outline, outline_part, content_dict  # Pass previous sections
    )

This approach has three benefits:

Coherence: Each section builds on previous content
Control: We can adjust individual sections without regenerating everything
Quality: Smaller generation tasks produce more focused output

Stage 6: Meta Information & FAQ Generation

SEO meta tags and FAQ schema are generated with dedicated prompts:

class MetaSchema(BaseModel):
    meta_title: str        # "IPO: Definition & Examples"
    meta_description: str  # Max 160 chars for SERP snippet
    meta_keywords: List[str]  # 5-7 relevant keywords

def generate_meta(client, target_title, target_keyword,
                  secondary_keywords, json_content):
    prompt = f"""Create meta information for an SEO article about "{target_title}".

    Meta Title formats:
    - "{target_keyword}: Definition & Examples"
    - "What is {target_keyword}? Definition & Examples"

    Meta Description:
    - 2-3 sentences describing the content
    - Maximum 160 characters including spaces
    - Include main keyword, make it compelling to click

    Meta Keywords:
    - List 5-7 relevant keywords including main and secondary
    """

    meta_info = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        response_model=MetaSchema,
        temperature=0.2  # Low temp for consistency
    )
    return meta_info

FAQ Generation from "People Also Ask"

The PAA box is a goldmine for FAQ content. We combine it with article content for comprehensive answers:

class FAQItem(BaseModel):
    question: str
    answer: str  # 100-200 words each

class FAQSchema(BaseModel):
    faqs: List[FAQItem]

def generate_faq(client, target_keyword, markdown_content,
                 paa_box, json_content):
    """
    Generate FAQ section using:
    1. Common questions from competitor content
    2. Google's "People Also Ask" box (gold mine!)
    3. Article content for comprehensive answers
    """
    prompt = f"""Create FAQ section for "{target_keyword}".

    {markdown_content}
    {paa_box}
    {json_content}

    Create 4-8 FAQ items:
    - 3-4 from common questions in markdown_content
    - Use paa_box ("People Also Ask") for additional FAQs
    - Reformulate to focus on target keyword
    - Each answer: 100-200 words, comprehensive yet concise

    FAQs should cover different aspects for well-rounded coverage.
    """

    faq_schema = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
        response_model=FAQSchema,
        temperature=0.5
    )
    return faq_schema

FAQ schema is particularly important for GEO—structured Q&A format is exactly what AI systems look for when generating answers.

The Full Pipeline: 891 Articles

Putting it all together, we process all 891 topics in a single batch:

from tqdm import tqdm

# Load 891 topics from Excel
df = pd.read_excel("glossary_topics.xlsx", sheet_name="891 Topics")
glossary_topics = df[["suggested_title", "target_kw", "secondary_kws", "msv"]].to_dict(orient="records")

glossary_articles = []

for item in tqdm(glossary_topics):
    # 1. SERP Research
    serp_response = get_serp_response(item['suggested_title'], 1)
    organic_results = serp_response['organic'][:10]
    paa_box = serp_response.get("peopleAlsoAsk")

    # 2. Prioritize authoritative sources
    prioritized = prioritize_urls(organic_results, priority_domains)
    relevant_articles = prioritized[:5]

    # 3. Extract competitor content via Jina Reader
    markdown_content, _ = load_markdown_from_urls([x['link'] for x in relevant_articles])

    # 4. Generate & improve outline
    outline = generate_outline(instructor_client, item['suggested_title'],
                               markdown_content, item['target_kw'], item['secondary_kws'])
    outline = improve_outline(instructor_client, item['suggested_title'],
                              markdown_content, item['target_kw'], item['secondary_kws'], outline)

    # 5. Generate content section by section
    content_dict = {}
    for part in [outline.introduction_title, *outline.content_subtitles, outline.conclusion_title]:
        content_dict[part] = generate_content_part(client, item['suggested_title'],
                                                   markdown_content, item['target_kw'],
                                                   item['secondary_kws'], outline, part, content_dict)

    # 6. Generate meta & FAQ
    json_content = [{"title": k, "content": v.strip()} for k, v in content_dict.items()]
    meta_info = generate_meta(instructor_client, item['suggested_title'],
                              item['target_kw'], item['secondary_kws'], json_content)
    faq_content = generate_faq(instructor_client, item['target_kw'],
                               markdown_content, paa_box, json_content)

    # 7. Package final article
    glossary_articles.append({
        "meta_info": meta_info.dict(),
        "faq_content": [faq.dict() for faq in faq_content.faqs],
        "content": json_content,
        "sources": relevant_articles
    })

# Save all 891 articles
with open("glossary_articles.json", "w") as f:
    json.dump(glossary_articles, f)

Each article includes: meta information, 4-8 FAQ items, 6+ content sections, and source attribution. At ~1,500 words per article, that's 1.3 million words of content generated automatically.

Trend Discovery with PyTrends

Static keyword lists aren't enough. We use Google Trends to surface rising keywords— topics gaining momentum that competitors haven't covered yet:

from pytrends.request import TrendReq
import pandas as pd
from tqdm import tqdm

def run_trend_bot(keywords, timeframe, category):
    """
    Surface rising keywords from Google Trends.
    Essential for GEO: AI search engines favor fresh, trending content.
    """
    pytrend = TrendReq(hl='en-US', tz=300, timeout=50)
    df = pd.DataFrame(columns=['rising_keywords', 'percentage', 'tag'])

    # Process keywords in chunks of 5 (API limit)
    chunked_kws = [keywords[i:i+5] for i in range(0, len(keywords), 5)]

    for chunk in tqdm(chunked_kws):
        pytrend.build_payload(
            kw_list=chunk,
            geo='US',
            timeframe=timeframe,  # 'today 1-m', 'now 7-d', etc.
            cat=category  # 7 = Finance, 5 = Tech, etc.
        )

        related_queries = pytrend.related_queries()

        for kw in chunk:
            try:
                rising = related_queries.get(kw).get("rising")
                temp = rising[['query', 'value']].rename(
                    columns={'query': 'rising_keywords', 'value': 'percentage'}
                )
                temp['tag'] = kw  # Track which seed keyword triggered this
                df = pd.concat([df, temp])
            except:
                pass
            time.sleep(5)  # Rate limiting

    return df.dropna(subset=['rising_keywords'])

# Example: Find rising finance keywords
seed_keywords = ["ipo", "stock market", "investing", "cryptocurrency"]
rising_keywords = run_trend_bot(
    keywords=seed_keywords,
    timeframe='today 1-m',  # Last month
    category=7  # Finance
)

# Output: Keywords with 200%+ growth = content opportunities
print(rising_keywords[rising_keywords['percentage'] > 200])

Keywords with 200%+ growth represent content opportunities. For GEO, being first to publish authoritative content on emerging topics is a massive advantage—AI systems often cite the first comprehensive source they find.

Quality Control: De-duplication

At scale, duplicate detection prevents wasted effort:

# De-duplicate articles by title
seen_titles = set()
unique_glossary = []

for article in full_glossary:
    title = article['meta_info']['meta_title']
    if title not in seen_titles:
        seen_titles.add(title)
        unique_glossary.append(article)

full_glossary = unique_glossary
print(f"Reduced from {len(glossary_articles)} to {len(full_glossary)} unique articles")

GEO: The Future of Content Optimization

As AI search engines gain market share, the rules are changing:

# GEO-specific optimizations

def optimize_for_geo(content):
    """
    Optimize content for AI search engines (Perplexity, ChatGPT, etc.)
    """
    optimizations = {
        # 1. Structured data helps AI parse content
        "add_schema_markup": True,

        # 2. Clear, factual statements AI can cite
        "use_definitive_statements": True,  # "X is Y" not "X might be Y"

        # 3. Source attribution for credibility
        "cite_sources_inline": True,

        # 4. FAQ section (AI loves structured Q&A)
        "include_faq": True,

        # 5. Avoid speculation (AI penalizes uncertainty)
        "remove_hedging_language": True,

        # 6. Statistics and data points (AI cites these)
        "include_specific_numbers": True,
    }
    return optimizations

# Key difference: GEO prioritizes being cited by AI, not just ranking
# Traditional SEO: Optimize for click-through rate
# GEO: Optimize for AI citation and answer extraction

Traditional SEO	GEO (Generative Engine Optimization)
Optimize for click-through rate	Optimize for AI citation
Keyword density matters	Factual accuracy matters
Backlinks drive authority	Source reliability drives authority
Featured snippets win	Structured data wins
User engagement metrics	Citation frequency by AI

Results & Learnings

Key metrics from the pipeline:

891 articles generated in a single batch
1.3M+ words of content
~$200 in Claude API costs (Sonnet 3.5)
5-7 sources researched per article
4-8 FAQs generated per article from PAA data

What Worked Well

Pydantic + Instructor: Structured outputs eliminated parsing errors
Two-pass outline: Creative first pass + verification second pass
Priority domain filtering: Better source material = better output
PAA integration: FAQ sections directly answer user questions

Challenges

Rate limiting: Serper, Jina, and Claude all have limits
Context windows: Some competitor pages exceed 40K chars
Hallucination risk: Requires careful prompt engineering
Topic overlap: De-duplication necessary at scale

Conclusion: The New Content Playbook

The combination of LLMs, structured outputs, and SERP research enables content creation at unprecedented scale. But the real insight is preparing for GEO:

Prioritize accuracy over creativity—AI penalizes hallucinations
Structure content for extraction—FAQ schema, clear definitions
Cite authoritative sources—AI trusts content that trusts good sources
Cover the long tail—AI needs authoritative sources on specific topics
Stay fresh—trend monitoring catches opportunities before competitors

The future of content isn't just about ranking—it's about being the source AI systems trust and cite. Build your content pipeline with that future in mind.