Skip to content
Go back

Building a Python Package to Turn Unstructured Data into Financial Insights

Updated:  at  02:20 PM

Earnings calls are packed with crucial information, but trying to extract it from their transcripts is often a painful manual process. These documents, usually PDFs, are a nightmare of inconsistent formatting. Every company seems to do things differently, layouts change without warning, and they’re filled with noise and excessive text.

We (Jay Shah and I) needed a way to programmatically get structured data out of this mess – with the end goal of building a full pipeline that can deliver insights and reduce time spent to understand how a company is doing.

This resulted in a python package concall-parser, the code for which can be found here.

The Problem: Unstructured PDFs Break Everything

The core challenge here is turning a visually-oriented PDF document into clean, logically structured text that code can understand. When you pull text out of these transcripts, you don’t just get the content; you get artifacts that completely disrupt parsing:

These aren’t just minor annoyances; they break any simple pattern-matching logic you might use later. If your code expects a clean word or phrase, noisy text like m anagement will cause it to fail.


The Pipeline

We designed concall-parser as a multi-step process to handle this progressively. Here’s the basic flow:

concall parser workflow

  1. Load: Get the PDF input.
  2. Extract Text: Pull the raw text out page by page (used pdfplumber for this).
  3. Clean & Segment: Process the raw text to fix errors, identify who’s speaking, and break the text into individual speaker turns.
  4. Categorize: Group the speaker turns into sections like Management’s opening remarks and the Q&A.
  5. Output: Provide the final structured data.

Steps 2 and 3 were where most of the technical headaches (and interesting solutions) lay.


Tackling the Hard Parts

Building this meant confronting some specific problems head-on.

Problem 1: Getting Clean Text

The text coming straight out of the PDF is just not reliable. Besides headers/footers, those internal text errors like m anagement mean you can’t trust the raw output for pattern matching.

Our solution here was to accept the raw text and then apply specific, deterministic heuristics after extraction to fix the most common issues we saw. For instance, I built logic to detect and reassemble fragmented words based on common patterns. It’s not perfect, but it cleans up enough of the noise to make downstream steps possible. We’re also exploring trying other PDF-to-text libraries and using page cropping to get cleaner input from the start.

Problem 2: Figuring Out Who Spoke

Identifying speakers and segmenting the text by speaker is critical. The pattern is usually <Name>:<Speech>, but it’s rarely that simple: names might be missing from introductions, the “Moderator” tag isn’t always used, analysts don’t appear in a speaker list, and colons appear everywhere in financial text (ratios, bullet points, etc.). A simple regex looking for .*: after every line fails miserably.

We needed a more robust way to find speakers that wasn’t fooled by stray colons or missing intros. The solution was to develop a heuristic: scan the text for structural cues that typically indicate a speaker turn – think capitalized text at the start of a line, followed by punctuation. This gives a list of potential speaker names.

Then, I apply my own validation logic to this list. This involves checking potential names against various criteria (like capitalization consistency, length, frequency patterns) to filter out false positives and build a reliable list of the actual speakers in the call. This approach is flexible enough to work even when the formatting isn’t perfect. Once we have that list of validated speakers, segmenting the text becomes straightforward – split the text block every time I see a validated speaker name followed by a colon.

Problem 3: What Kind of Content is This?

Finally, we needed to differentiate between Management’s prepared remarks and the Analyst Q&A.

If a Moderator is present (which is usually the case), this becomes a lot easier - moderators introduce people who are speaking, signal the start of analyst sessions and this can be used to separate the parts.

If there isn’t a moderator, heuristics and llm logic need to be applied.


The Output: Clean Data for Analysis

After all this processing, concall-parser gives you structured data: a list of identified management personnel, the text of management’s comments, and the Q&A pairs.

This structured output is the whole point. It’s clean, predictable, and ready for the next step. You can easily feed it into other scripts or libraries for NLP tasks like sentiment analysis, run simple code to pull out key numbers or metrics, or compare sections across different reports. We’ve managed to get the parser working reliably for a majority of companies in the Nifty 200 index, which shows the approach holds up against real-world variability.


What’s Next

This is still a work in progress. We’re continuing to improve concall-parser by:

Turning messy, unstructured documents into usable data is a challenging but rewarding problem to solve as an engineer.

We’re actively developing concall-parser. Try it out and let us know your feedback!


Suggest Changes

Previous Post
Scaling Server Sent Events - A practical guide to 28,000+ concurrent connections
Next Post
Designing a minimal, local-first version of lichess.