scraping Archives - {5} Setfive - Talking to the World

OpenAI Realtime API

The timing couldn’t have been better. As we were exploring ways to do this, OpenAI released its Realtime API, a game-changer for voice-based AI applications. Unlike conventional text-based APIs that require separate speed-to-text and text-to-speech steps, the Realtime API combines these and enables:

Low-latency native voice conversations.

Natural interruptions for more human-like interactions.

Built-in function calling for triggering actions mid-conversation.

This was an AI capable of having an actual over-the-phone conversation.

Building The Bridge

With the brain of the operation sorted, it was time to find a way to actually make phone calls. For this, we chose Twilio, a well-regarded platform for telecommunications for almost two decades.

Twilio’s Media Streams API made it simple to pipe audio directly to and from the OpenAI Realtime API, creating a seamless conversation flow. The business on the other end hears a responsive customer who can handle unexpected conversational turns.

Navigating The Maze

One of the first challenges we ran into? Phone trees. You know them: “Press 1 for appointments, Press 2 to speak to a customer service representative, …” These interactive voice response (IVR) systems are designed for touch-tone input, not voice commands.

We solved this by building AI tools that can simulate DTMF (Dual-Tone Multi-Frequency) signals using Twilio’s API – which required some trial and error with their callback and TwiML architecture – so that our AI can listen to menu options, simulate button presses, navigate complex multi-level menus, and find the fastest path to reach a customer service representative or a front desk.

From Conversations To Structured Data

Getting through to the right person is only half the battle. The real magic happens when our AI finally gets into a conversation. From there, we are able to extract structured information from free-flowing conversations in real time. Using carefully crafted prompts, our system can:

Identify key information even when it’s mentioned casually

Ask clarifying questions about discrepancies in the information received

Extract additional valuable data like availability, pricing details (first-time customer, minimum orders, ect.), and more

Create clean, structured data ready for your database, Excel spreadsheet, or whatever else you’re using.

When Nobody Answers

Here’s something we didn’t anticipate: businesses that rely heavily on phone communication are often too busy to answer their phones. These are often small businesses that may not have dedicated staff for handling phones or may have employees who wear multiple hats. They’re not sitting by the phone waiting for calls.

This was having a real effect on our success rate, and we didn’t want to make multiple calls to the same business, hoping for someone to be available. The next step was obvious: voicemail. We enhanced our system to handle a full communication cycle:

Intelligent voicemail detection to detect when we have reached a voicemail inbox.

Leave a natural message requesting whatever information the AI is looking for.

Callback handling that is able to naturally continue the conversation when a business calls back.

Ready to Build?

Interested in how this can help you? Email us at contact@setfive.com to find out more or check out
our demo at voice2data.setfive.com!

I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.

As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:

Whitelisted User Agents – Following a few requests from PHP cURL the site started blocking requests from my IP that didn’t include a “regular” user agent.
Requiring cookies and Javascript – I thought this was actually really clever. After a couple of requests the site started quietly loading an intermediate page that required your browser to run Javascript to set a cookie and then complete a POST request to a URL that included a nonce in order to view a page. To a regular user, this was fairly transparent since it happened so quickly but it obviously trips up a client HTTP client.
Soft IP rate limits – After a couple of dozen requests from my IP I started receiving “Solve this captcha” pages in order to view the target content.

Taken all together, it’s a pretty sophisticated setup for what’s effectively a niche social networking site. With the “requires Javascript” requirement I decided to explore using Electron for this project. And turns out, it’s a perfect fit. For a quick primer, Electron is an open source project from GitHub that enables developers to build cross platform desktop applications by merging nodejs and Chrome. Developers end up writing Javascript that can leverage the nodejs ecosystem while also using Chrome’s browser internals to render windows and widgets. Electron helps in this use case because it provides a full Chrome browser that’s scriptable and has access to node’s system level modules. For completeness, you could implement all of this in a Chrome extension but in my experience extensions have more complicated non-privileged to privileged communication and lack access to node so you can’t just fire off a “fs.writeFileSync” to persist your results.

With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.

Ok, enough talk, show me some code!

I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:

To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:

Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.

And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.

As always questions and comments are welcomed!

Tag: scraping

Gathering Structured Data From Phone Calls