The Content Scaling Problem
Creating high-quality SEO content at scale has always been a bottleneck. Manual content creation can't keep up with the demand for topical coverage, and traditional automation produces low-quality spam. But with LLMs, we can build something different: a pipeline that generates 891 unique, well-researched articles automatically—while maintaining quality that rivals human writers.
More importantly, as AI-powered search engines (Perplexity, ChatGPT Search, Google AI Overviews) gain market share, we need to think beyond traditional SEO. Enter GEO (Generative Engine Optimization)—optimizing content to be cited by AI systems, not just ranked by traditional search algorithms.
SEO optimizes for Google's PageRank: backlinks, keywords, click-through rates.
GEO optimizes for AI citation: factual accuracy, structured data, clear definitions.
System Architecture
The content generation pipeline consists of six stages:
- Topic Research: Load target keywords with search volume data
- SERP Analysis: Fetch competitor content and "People Also Ask" boxes
- Content Extraction: Convert competitor pages to LLM-friendly markdown
- Outline Generation: Create and refine article structure using Claude
- Content Generation: Generate each section with keyword integration
- Meta & FAQ: Generate SEO metadata and FAQ schema
Stage 1: Topic Research at Scale
We start with a curated list of target keywords, each with monthly search volume (MSV) data to prioritize high-impact topics:
glossary_topics = [
{
"suggested_title": "What is an IPO? Definition & Examples",
"target_kw": "ipo",
"msv": 44000, # Monthly search volume
"secondary_kws": ["ipo meaning", "what is an ipo", "ipo definition"]
},
{
"suggested_title": "What is Generative AI (ChatGPT, Claude, Dall-E)?",
"target_kw": "what is generative ai",
"msv": 4500,
"secondary_kws": ["ai vs machine learning", "generative ai models"]
},
# ... 889 more topics loaded from Excel
] For this project, we loaded 891 glossary topics from an Excel sheet, covering everything from "What is an IPO?" (44K monthly searches) to niche terms like "What is a reverse IPO?" (100 searches). The long tail matters for GEO—AI systems cite authoritative sources on specific topics.
Stage 2: SERP Analysis & Competitor Research
Before generating content, we need to understand what's currently ranking. The Serper API gives us structured Google search results including the golden "People Also Ask" box:
import json
import requests
def get_serp_response(serp_query, page=1):
"""Fetch Google search results via Serper API"""
url = "https://google.serper.dev/search"
payload = json.dumps({
"q": serp_query,
"autocorrect": False,
"location": "California, United States",
"num": 100,
"page": page
})
headers = {
"X-API-KEY": os.environ["SERPER_API_KEY"],
"Content-Type": "application/json",
}
response = requests.post(url, headers=headers, data=payload)
return response.json()
# Get SERP results including "People Also Ask" box
serp_response = get_serp_response("What is an IPO?", 1)
answer_box = serp_response.get('answerBox')
organic_results = serp_response['organic'][:10]
paa_box = serp_response.get("peopleAlsoAsk") # Gold for FAQ generation
The peopleAlsoAsk data is invaluable—these are the exact questions users
are asking, perfect for FAQ generation and featured snippet optimization.
Prioritizing Authoritative Sources
Not all search results are equal. We prioritize high-authority financial sources to ensure our training context is reliable:
# Prioritize authoritative sources for training data
priority_domains = [
"investopedia.com",
"forbes.com/advisor/",
"nerdwallet.com",
"fool.com",
"bankrate.com",
"wallethub.com",
"valuepenguin.com",
"lendingtree.com"
]
def prioritize_urls(results, priority_domains):
"""Reorder search results to prioritize authoritative domains"""
priority_results = []
other_results = []
for result in results:
if any(domain in result['link'] for domain in priority_domains):
priority_results.append(result)
else:
other_results.append(result)
return priority_results + other_results
# Prioritize authoritative content
prioritized_results = prioritize_urls(organic_results, priority_domains)
relevant_articles = prioritized_results[:5] # Top 5 sources This is crucial for GEO: AI search engines weight source authority heavily. Training on Investopedia-quality content produces Investopedia-quality outputs.
Stage 3: Content Extraction with Jina Reader
Raw HTML is messy. Jina Reader converts any URL to clean, LLM-friendly markdown— stripping navigation, ads, and boilerplate:
def load_markdown_from_urls(urls: list, jina_prefix="https://r.jina.ai/") -> tuple:
"""
Convert web pages to clean markdown using Jina Reader.
Returns concatenated content and list of failed URLs.
"""
result = ''
urls_cannot_be_loaded = []
for url in urls:
try:
# Jina Reader converts any URL to LLM-friendly markdown
response = requests.get(jina_prefix + url)
response.raise_for_status()
# Truncate to avoid context limits
result += response.text.strip()[:40000] + '\n'
except:
urls_cannot_be_loaded.append(url)
return result, urls_cannot_be_loaded
# Extract competitor content as training context
markdown_content, failed_urls = load_markdown_from_urls(
[x['link'] for x in relevant_articles]
) We truncate at 40K characters to stay within context limits while capturing the essential content. This markdown becomes the "source material" for generation.
Stage 4: Structured Outline Generation
Here's where the magic happens. Using Instructor + Pydantic, we get structured outputs from Claude instead of raw text:
from pydantic import BaseModel
from typing import List
import instructor
# Initialize Claude with structured outputs
client = anthropic.Anthropic()
instructor_client = instructor.from_anthropic(client)
class OutlineModel(BaseModel):
"""Structured outline for SEO article"""
overall_title: str
introduction_title: str
content_subtitles: List[str]
conclusion_title: str
def generate_outline(client, target_title, markdown_content, target_keyword, secondary_keywords):
prompt = f"""You are creating an outline for an SEO article about "{target_title}".
Content purpose: Educate readers with a clear definition, in-depth explanation,
and implications for businesses, investors, markets and the economy.
{target_keyword}
{secondary_keywords}
{markdown_content}
Create an outline with:
a. Overall Title focused on "{target_title}"
b. Introduction with definition: "{target_keyword} Definition"
c. 4+ subtitles based on keywords and competitor content
d. Conclusion with example: "{target_keyword} Example"
Use a tone mixing Investopedia & NerdWallet.
"""
outline = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
response_model=OutlineModel, # Pydantic validation!
temperature=0.5
)
return outline
The response_model=OutlineModel parameter forces Claude to return a valid
Pydantic object. No parsing errors, no malformed JSON—just clean, typed data.
Two-Pass Outline Refinement
The first outline is creative; the second pass ensures accuracy. This is critical for GEO—AI search engines penalize hallucinations:
def improve_outline(client, target_title, markdown_content,
target_keyword, secondary_keywords, outline):
"""
Second pass: Verify accuracy and remove speculation.
This is crucial for GEO—AI search engines penalize hallucinations.
"""
prompt = f"""Review and improve this outline for "{target_title}".
Current outline: {outline}
CRITICAL verification requirements:
- Only include information verifiable from the markdown_content
- Double-check all facts and figures against sources
- If uncertain about information, omit it or use safer alternatives
- Avoid investment recommendations or predictions
- Cross-reference key points with multiple sources if possible
- Be especially cautious with numbers, dates, and company claims
It's better to have a shorter, fully verified outline than a longer
one with uncertain information.
"""
improved = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
response_model=OutlineModel,
temperature=0.5
)
return improved The improvement prompt emphasizes verification over creativity. For GEO, it's better to have a shorter, fully accurate article than a longer one with uncertain claims.
Stage 5: Section-by-Section Content Generation
Rather than generating the entire article at once, we generate each section sequentially, passing previous sections as context:
def generate_content_part(client, target_title, markdown_content,
target_keyword, secondary_keywords,
outline, outline_part, previous_content):
"""Generate one section of the article at a time"""
prompt = f"""Write the section "{outline_part}" for an article about "{target_title}".
{target_keyword}, {secondary_keywords}
{markdown_content}
{outline}
{previous_content}
Guidelines:
1. Focus solely on this section—don't include the title
2. Incorporate relevant information from existing_articles
3. Stay truthful and factual—no false claims
4. Use industry terms but explain them for general audiences
5. Naturally integrate keywords for SEO
6. Aim for 300-350 words
7. Show expertise so Google ranks appropriately
8. Use Investopedia & NerdWallet tone
9. Everything must be factual—rely only on markdown_content
"""
body = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=4096,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": ""}
],
stop_sequences=[" "],
temperature=0.5
)
return body.content[0].text
# Generate each section sequentially, building context
outline_parts = [outline.introduction_title,
*outline.content_subtitles,
outline.conclusion_title]
content_dict = {}
for outline_part in outline_parts:
content_dict[outline_part] = generate_content_part(
client, target_title, markdown_content,
target_keyword, secondary_keywords,
outline, outline_part, content_dict # Pass previous sections
) This approach has three benefits:
- Coherence: Each section builds on previous content
- Control: We can adjust individual sections without regenerating everything
- Quality: Smaller generation tasks produce more focused output
Stage 6: Meta Information & FAQ Generation
SEO meta tags and FAQ schema are generated with dedicated prompts:
class MetaSchema(BaseModel):
meta_title: str # "IPO: Definition & Examples"
meta_description: str # Max 160 chars for SERP snippet
meta_keywords: List[str] # 5-7 relevant keywords
def generate_meta(client, target_title, target_keyword,
secondary_keywords, json_content):
prompt = f"""Create meta information for an SEO article about "{target_title}".
Meta Title formats:
- "{target_keyword}: Definition & Examples"
- "What is {target_keyword}? Definition & Examples"
Meta Description:
- 2-3 sentences describing the content
- Maximum 160 characters including spaces
- Include main keyword, make it compelling to click
Meta Keywords:
- List 5-7 relevant keywords including main and secondary
"""
meta_info = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
response_model=MetaSchema,
temperature=0.2 # Low temp for consistency
)
return meta_info FAQ Generation from "People Also Ask"
The PAA box is a goldmine for FAQ content. We combine it with article content for comprehensive answers:
class FAQItem(BaseModel):
question: str
answer: str # 100-200 words each
class FAQSchema(BaseModel):
faqs: List[FAQItem]
def generate_faq(client, target_keyword, markdown_content,
paa_box, json_content):
"""
Generate FAQ section using:
1. Common questions from competitor content
2. Google's "People Also Ask" box (gold mine!)
3. Article content for comprehensive answers
"""
prompt = f"""Create FAQ section for "{target_keyword}".
{markdown_content}
{paa_box}
{json_content}
Create 4-8 FAQ items:
- 3-4 from common questions in markdown_content
- Use paa_box ("People Also Ask") for additional FAQs
- Reformulate to focus on target keyword
- Each answer: 100-200 words, comprehensive yet concise
FAQs should cover different aspects for well-rounded coverage.
"""
faq_schema = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
response_model=FAQSchema,
temperature=0.5
)
return faq_schema FAQ schema is particularly important for GEO—structured Q&A format is exactly what AI systems look for when generating answers.
The Full Pipeline: 891 Articles
Putting it all together, we process all 891 topics in a single batch:
from tqdm import tqdm
# Load 891 topics from Excel
df = pd.read_excel("glossary_topics.xlsx", sheet_name="891 Topics")
glossary_topics = df[["suggested_title", "target_kw", "secondary_kws", "msv"]].to_dict(orient="records")
glossary_articles = []
for item in tqdm(glossary_topics):
# 1. SERP Research
serp_response = get_serp_response(item['suggested_title'], 1)
organic_results = serp_response['organic'][:10]
paa_box = serp_response.get("peopleAlsoAsk")
# 2. Prioritize authoritative sources
prioritized = prioritize_urls(organic_results, priority_domains)
relevant_articles = prioritized[:5]
# 3. Extract competitor content via Jina Reader
markdown_content, _ = load_markdown_from_urls([x['link'] for x in relevant_articles])
# 4. Generate & improve outline
outline = generate_outline(instructor_client, item['suggested_title'],
markdown_content, item['target_kw'], item['secondary_kws'])
outline = improve_outline(instructor_client, item['suggested_title'],
markdown_content, item['target_kw'], item['secondary_kws'], outline)
# 5. Generate content section by section
content_dict = {}
for part in [outline.introduction_title, *outline.content_subtitles, outline.conclusion_title]:
content_dict[part] = generate_content_part(client, item['suggested_title'],
markdown_content, item['target_kw'],
item['secondary_kws'], outline, part, content_dict)
# 6. Generate meta & FAQ
json_content = [{"title": k, "content": v.strip()} for k, v in content_dict.items()]
meta_info = generate_meta(instructor_client, item['suggested_title'],
item['target_kw'], item['secondary_kws'], json_content)
faq_content = generate_faq(instructor_client, item['target_kw'],
markdown_content, paa_box, json_content)
# 7. Package final article
glossary_articles.append({
"meta_info": meta_info.dict(),
"faq_content": [faq.dict() for faq in faq_content.faqs],
"content": json_content,
"sources": relevant_articles
})
# Save all 891 articles
with open("glossary_articles.json", "w") as f:
json.dump(glossary_articles, f) Each article includes: meta information, 4-8 FAQ items, 6+ content sections, and source attribution. At ~1,500 words per article, that's 1.3 million words of content generated automatically.
Trend Discovery with PyTrends
Static keyword lists aren't enough. We use Google Trends to surface rising keywords— topics gaining momentum that competitors haven't covered yet:
from pytrends.request import TrendReq
import pandas as pd
from tqdm import tqdm
def run_trend_bot(keywords, timeframe, category):
"""
Surface rising keywords from Google Trends.
Essential for GEO: AI search engines favor fresh, trending content.
"""
pytrend = TrendReq(hl='en-US', tz=300, timeout=50)
df = pd.DataFrame(columns=['rising_keywords', 'percentage', 'tag'])
# Process keywords in chunks of 5 (API limit)
chunked_kws = [keywords[i:i+5] for i in range(0, len(keywords), 5)]
for chunk in tqdm(chunked_kws):
pytrend.build_payload(
kw_list=chunk,
geo='US',
timeframe=timeframe, # 'today 1-m', 'now 7-d', etc.
cat=category # 7 = Finance, 5 = Tech, etc.
)
related_queries = pytrend.related_queries()
for kw in chunk:
try:
rising = related_queries.get(kw).get("rising")
temp = rising[['query', 'value']].rename(
columns={'query': 'rising_keywords', 'value': 'percentage'}
)
temp['tag'] = kw # Track which seed keyword triggered this
df = pd.concat([df, temp])
except:
pass
time.sleep(5) # Rate limiting
return df.dropna(subset=['rising_keywords'])
# Example: Find rising finance keywords
seed_keywords = ["ipo", "stock market", "investing", "cryptocurrency"]
rising_keywords = run_trend_bot(
keywords=seed_keywords,
timeframe='today 1-m', # Last month
category=7 # Finance
)
# Output: Keywords with 200%+ growth = content opportunities
print(rising_keywords[rising_keywords['percentage'] > 200]) Keywords with 200%+ growth represent content opportunities. For GEO, being first to publish authoritative content on emerging topics is a massive advantage—AI systems often cite the first comprehensive source they find.
Quality Control: De-duplication
At scale, duplicate detection prevents wasted effort:
# De-duplicate articles by title
seen_titles = set()
unique_glossary = []
for article in full_glossary:
title = article['meta_info']['meta_title']
if title not in seen_titles:
seen_titles.add(title)
unique_glossary.append(article)
full_glossary = unique_glossary
print(f"Reduced from {len(glossary_articles)} to {len(full_glossary)} unique articles") GEO: The Future of Content Optimization
As AI search engines gain market share, the rules are changing:
# GEO-specific optimizations
def optimize_for_geo(content):
"""
Optimize content for AI search engines (Perplexity, ChatGPT, etc.)
"""
optimizations = {
# 1. Structured data helps AI parse content
"add_schema_markup": True,
# 2. Clear, factual statements AI can cite
"use_definitive_statements": True, # "X is Y" not "X might be Y"
# 3. Source attribution for credibility
"cite_sources_inline": True,
# 4. FAQ section (AI loves structured Q&A)
"include_faq": True,
# 5. Avoid speculation (AI penalizes uncertainty)
"remove_hedging_language": True,
# 6. Statistics and data points (AI cites these)
"include_specific_numbers": True,
}
return optimizations
# Key difference: GEO prioritizes being cited by AI, not just ranking
# Traditional SEO: Optimize for click-through rate
# GEO: Optimize for AI citation and answer extraction | Traditional SEO | GEO (Generative Engine Optimization) |
|---|---|
| Optimize for click-through rate | Optimize for AI citation |
| Keyword density matters | Factual accuracy matters |
| Backlinks drive authority | Source reliability drives authority |
| Featured snippets win | Structured data wins |
| User engagement metrics | Citation frequency by AI |
Results & Learnings
Key metrics from the pipeline:
- 891 articles generated in a single batch
- 1.3M+ words of content
- ~$200 in Claude API costs (Sonnet 3.5)
- 5-7 sources researched per article
- 4-8 FAQs generated per article from PAA data
What Worked Well
- Pydantic + Instructor: Structured outputs eliminated parsing errors
- Two-pass outline: Creative first pass + verification second pass
- Priority domain filtering: Better source material = better output
- PAA integration: FAQ sections directly answer user questions
Challenges
- Rate limiting: Serper, Jina, and Claude all have limits
- Context windows: Some competitor pages exceed 40K chars
- Hallucination risk: Requires careful prompt engineering
- Topic overlap: De-duplication necessary at scale
Conclusion: The New Content Playbook
The combination of LLMs, structured outputs, and SERP research enables content creation at unprecedented scale. But the real insight is preparing for GEO:
- Prioritize accuracy over creativity—AI penalizes hallucinations
- Structure content for extraction—FAQ schema, clear definitions
- Cite authoritative sources—AI trusts content that trusts good sources
- Cover the long tail—AI needs authoritative sources on specific topics
- Stay fresh—trend monitoring catches opportunities before competitors
The future of content isn't just about ranking—it's about being the source AI systems trust and cite. Build your content pipeline with that future in mind.