Building an AI LinkedIn Sourcing Agent (Full version)

The Challenge: Round two with an old nemesis

Recruiting is broken. Finding the right candidates is like searching for needles in a haystack, and when you do find them, your generic LinkedIn message gets lost in their inbox with 50 others.

For Synapse’s AI hackathon, the challenge was to “Build a LinkedIn Sourcing Agent that finds profiles, scores candidates using AI, and generates personalized outreach messages.”

Two years ago, I tried building exactly this - a LinkedIn scraper combined with a job recommendation engine. I was going to scrape LinkedIn profiles, match them to job requirements, and build something great. LinkedIn’s anti-scraping measures crushed that dream within days.

Back then, I had less technical knowledge but even with perfect execution, it wouldn’t have worked. LinkedIn’s defenses are just too aggressive. I pivoted to scraping job postings instead (Indeed, YCombinator) and flipped the problem - instead of finding candidates for jobs, I matched jobs to user resumes using Airflow for batch processing.

That experience taught me LinkedIn scraping is essentially impossible without using paid solutions. So when this hackathon challenge came up, I was ready for round two.

Open Table of Contents

What I Built
Key Technical Decisions
- Why These Choices Mattered
What I Learned
Scaling Strategy
The Real Challenges (And Why They Matter)
What actually worked well
Scaling
Future Roadmap
Try It Yourself

What I Built

Instead of another keyword-matching tool, I built something that tries to think like a recruiter:

Job Description → Smart Search → Profile Scraping → AI Scoring → Personalized Messages

The core components:

Multi-source discovery: LinkedIn + GitHub profile combination
6-factor scoring algorithm: Because fit isn’t just about keywords
AI-powered outreach: Llama via Groq for personalized messaging
Async processing: Handle multiple jobs without blocking

You can check out the full code here.

The Technical Architecture

I used FastAPI for the backend with async processing throughout. The data flow looks like this:

Search Query Generation: Transform job descriptions into effective search queries
Profile Discovery: SerpAPI to find profiles - LinkedIn and Github URLs
Data Fetch: RapidAPI’s LinkedIn service (because scraping LinkedIn directly is a nightmare), HTTP calls for Github
Data Extraction: Custom logic (using BeautifulSoup for Github)
Intelligent Scoring: 6-factor algorithm with confidence levels
Message Generation: Llama-powered personalized outreach

Data flow diagram

App architecture

The Scoring Algorithm

Factor	Weight	What It Measures
Education	20%	Elite schools get higher scores
Career Trajectory	20%	Clear progression vs. lateral moves
Company Relevance	15%	Relevant industry experience
Skill Match	25%	How well skills align with job requirements
Location	10%	Geographic fit for the role
Tenure	10%	Stability vs. job hopping patterns

Since it’s using LLMs, it understands context. A engineer who moved from startup → Google → senior role gets a higher trajectory score than someone who stayed at the same level for years.

Smart Outreach Generation

Generic LinkedIn messages get ignored. My solution uses Llama (via Groq) to create personalized messages that:

Reference specific experience and achievements
Connect candidate background to job requirements
Feel personal, not templated
Include clear next steps

Example output: “Hi John, I noticed your work at OpenAI on transformer architectures and your ICML 2023 paper on attention mechanisms. Your blend of research and production ML experience is exactly what Windsurf needs for their ML Research Engineer role…”

Sample Results

Testing with the Windsurf ML Research Engineer role:

{
  "name": "John Doe",
  "fit_score": 8.7,
  "confidence": 0.91,
  "score_breakdown": {
    "education": 9.2,    // Stanford PhD in ML
    "trajectory": 8.5,   // Research → Engineering → Lead
    "company": 9.0,      // Google, OpenAI experience
    "skills": 9.1,       // Perfect LLM/transformer match
    "location": 10.0,    // Mountain View based
    "tenure": 7.8        // Healthy 2-3 year progression
  },
  "outreach_message": "Hi John, I came across your transformer optimization work at Google Research, particularly your ICML paper on efficient attention mechanisms. Your move from research to production ML at OpenAI shows the exact blend we need at Windsurf..."
}

Why this works for this JD:

Specific achievements (ICML paper)
Career progression understanding
Clear connection to role requirements

Key Technical Decisions

Why These Choices Mattered

Llama via Groq instead of OpenAI: Faster, cheaper, and surprisingly good at personalized messaging

RapidAPI for LinkedIn data: More reliable than web scraping, cleaner data extraction

Async processing with FastAPI: Can handle multiple jobs in parallel without blocking

MongoDB for storage: Perfect for flexible candidate profiles and easy scaling

Smart caching: Avoids re-fetching the same profiles, reduces overhead, cost

What I Learned

1. Focus on the Algorithm, Not the Data Collection

Anyone can scrape LinkedIn*. The value is in smart scoring that understands candidate quality beyond keywords.

Scraping Linkedin is a really difficult thing to do in practice, so everyone uses APIs provided by services like RapidAPI, BrightData.

2. Personalization Actually Works

Generic outreach gets low response rates. AI-generated personalized messages referencing specific achievements can convert a lot of leads.

As a fallback, we always have template messages.

3. Production Thinking From Day 1

Built with FastAPI, async processing, proper error handling, and caching. This is designed to scale easily.

4. Multi-Source Data is Key

Combining LinkedIn + GitHub profiles gives much richer candidate insights than either alone.

Scaling Strategy

For production use (100s of jobs daily):

Async Processing: Already built with asyncio for parallel job handling. Can explore multiprocessing as well
Queue System: Redis/Celery integration template implemented, integration remains
Database: MongoDB for caching profiles and storing results
Rate Limiting: Smart backoff with API key rotation
Observability: Comprehensive logging for performance tracking (add complex later)
Comprehensive Testing: Including load testing, e2e and more

The Real Challenges (And Why They Matter)

LinkedIn’s War Against Scraping (Round Two) LinkedIn really, really doesn’t want you scraping their data. Having learned this lesson the hard way two years ago, I didn’t even attempt direct scraping this time. My previous attempt involved rotating user agents, proxy pools, CAPTCHA solving - all of it failed within days.

This time I went straight to RapidAPI’s LinkedIn service. More expensive per request ($0.01 per profile), but infinitely more reliable than fighting LinkedIn’s ever-evolving bot detection. My 2022 self would have spent weeks trying to outsmart their defenses. My 2024 self just paid for the API.

Lesson learned: Sometimes the expensive solution is actually the cheap one when you factor in development time.

LLM Consistency is a Myth

Groq’s Llama model was supposed to return structured JSON for scoring. In practice? It worked maybe 70% of the time. The other 30% I’d get beautifully written prose instead of the JSON structure I needed.

What I learned: Always have fallback parsing. I ended up writing regex patterns to extract scores from malformed responses, and implementing retry logic with different prompts.

GitHub Profile Matching Gone Wrong

Searching for Github profiles is not straightforward, I would get Company profiles suggested instead of people.

Combining LinkedIn and GitHub data seemed straightforward - match by name and see if their GitHub activity aligns with their LinkedIn experience. Reality check: turns out “John Smith” working at “Google” could match with 47 different GitHub profiles.

Current state: I built the GitHub integration but disabled it for the final demo. Sometimes the feature that sounds coolest causes the most headaches.

The MongoDB Integration That Never Happened

I planned to use MongoDB with Motor for async operations.

What actually happened: spending hours debugging data validation mismatches took up a lot of time. For the hackathon timeline, I switched to simple JSON file caching.

Lesson: Sometimes the “better” technical choice isn’t worth the time cost, especially under deadline pressure.

Data Validation

The biggest and stupidest issue that plagued me. A major chunk of my time building was debugging and fixing data validation issues, so I started doing a TDD style thing midway, made my logger verbose to capture a ton of context.

Managing multiple models to build the repo backfired

Since I don’t have a Cursor or similar AI-IDE subscription, I tried putting my process through web based Claude, Gemini and OpenAI models (and later Github Copilot, which I didn’t unlock Pro for until half the project was done).

Claude gave structured code which did not work, I used Gemini and OpenAI models to fix it, which took a lot of time.

Copilot helped iterate quickly - improving tests, iterating on issues and helped me wrap things up quickly.

What actually worked well

Smart Caching Strategy

I implemented a simple profile caching that actually saves time and API costs. Before making any external calls, the system checks if we’ve seen this LinkedIn URL before. For a hackathon scale, simple file-based caching works fine. For production, I’d use Redis with proper TTL settings.

Async Processing Done Right

FastAPI with asyncio lets me process multiple candidates simultaneously. Instead of waiting 30 seconds for 10 profiles sequentially, I can get them all in 5-6 seconds.

I could have used FastAPI’s BackgroundTasks, but it wouldn’t have made a lot of difference. It would be a lot more sensible to go to a task queue based setup for scaling (using Redis + Celery).

The Scoring Algorithm That Makes Sense

Rather than just keyword matching, LLMs understands context. An engineer who went from startup → Google → senior role gets higher trajectory scores than someone who’s been at the same level for years. The LLM can recognize patterns that regex never could.

Scaling

The current system handles maybe 20-30 profiles before throttling and API rate limits kick in. For production scale (hundreds of concurrent jobs), here’s what needs to change:

Code Quality & Architecture

The current codebase is a mess of random object creation everywhere. I’m instantiating API clients, scrapers, and scoring services scattered throughout the code. This makes testing painful and concurrency unpredictable.

Dependency injection would clean this up significantly. Instead of creating LinkedInScraper() objects everywhere, I’d inject them as dependencies. For FastAPI, this means using dependency providers that create singleton instances for thread-safe operations.

# Current messy approach
async def score_candidates(candidates):
    scraper = LinkedInScraper()  # New instance every time
    scorer = FitScorer()         # Another new instance
    # ... rest of logic

# Better approach with DI
async def score_candidates(
    candidates, 
    scraper: LinkedInScraper = Depends(get_scraper),
    scorer: FitScorer = Depends(get_scorer)
):
    # Clean, testable, predictable

For concurrency, dependency injection actually helps. You can inject thread-safe, connection-pooled clients rather than creating new HTTP sessions for every request. This reduces overhead and prevents connection exhaustion.

Combining DI with connection pooling is another great idea.

API key rotation

Though the code is setup, it’s not being used. Ideally would prefer to use a bunch of generators to do this, would help when rate limits for one API kick in.

Real Production Scaling

For hundreds of concurrent jobs, the architecture needs fundamental changes:

Multi-Query Strategy Instead of a single search query, I’d implement tiered searching:

Strict query: Perfect keyword matches, paginate deeply (until you don’t get results) Medium query: Broader terms, fewer pages Loose query: Industry + location only, limited results

This builds a large candidate pool while prioritizing the most relevant profiles.

Smart Pre-filtering Before hitting expensive LLMs:

Deduplication: Bloom filters for URL dedup at scale Basic filtering: Years of experience, location, title keywords Batch scoring: Group similar profiles for bulk processing

Queue Architecture (Async Task Queue Pattern) Job Queue → Search Workers → Filter Workers → LLM Workers → Results

Each stage handles its bottlenecks independently. Search workers can run fast and cheap, while LLM workers are expensive but fewer in number.

Resource Management

API key pools: Rotate keys across workers to handle rate limits Connection pooling: Shared HTTP clients across async workers Circuit breakers: Fail fast when external services are down

Future Roadmap

Short Term (1-2 months)

Complete MongoDB async integration with Motor
Docker containerization for deployment
Enhanced deduplication using bloom filters
A/B testing framework for prompt optimization

Medium Term (3-6 months)

Multi-platform integration (Twitter, personal websites)
Advanced ML models for candidate scoring
Real-time job market insights
Integration with ATS systems

Long Term (6+ months)

Predictive analytics for hiring success
Automated interview scheduling
Bias detection and mitigation
Custom model training for specific companies

Try It Yourself

GitHub Repository: score_profiles
API Documentation: Available at /docs when running locally

I tried using uv, but there were some issues on my laptop recently - so I switched to pip

# Quick start
git clone https://github.com/pranshu-raj-211/score_profiles.git
cd score_profiles
pip install -r requirements.txt
cp .env.example .env  # Add your API keys
python app/main.py

API Usage:

curl -X POST "http://localhost:8000/jobs" \
  -H "Content-Type: application/json" \
  -d '{"search_query": "ML Engineer at AI startup", "max_candidates": 10}'

GitHub Repository