Show HN: Robust LLM Extractor for Websites in TypeScript

(github.com)

14 points | by andrew_zhong 1 hour ago

6 comments

sheept 10 minutes ago
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.
This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
Flux159 44 minutes ago
This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?
Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?
Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
plastic041 56 minutes ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
zx8080 25 minutes ago
Robots.txt anyone?
Remi_Etien 42 minutes ago
[dead]
gautamborad 1 hour ago
[dead]