Cutting-edge web scraping techniques

24 points by simonw 8 months ago on lobsters | 4 comments

[OP] simonw | 8 months ago

This is the handout for a workshop I gave this morning at the NICAR data journalism conference. The handout is designed to be useful completely independently of the in-person workshop itself.

I wrote a bit more about this workshop here, including notes on the various new pieces of software I developed in advance of teaching at NICAR.

dubiouslittlecreature | 8 months ago

Comment removed by author

nwjsmith | 8 months ago

He wrote the guide for investigative reporters at a data journalism conference.

[OP] simonw | 8 months ago

Yeah, I have no problem at all with the ethics of helping data journalists scrape things that really don’t want to be scraped. News is what somebody does not want you to print.

I originally planned to include notes on running local models for structured data extraction - but I ran into a problem with length limits. For data extraction you need to be able to input quite a lot of text, and the current batch of local model solutions (at least that I’ve tried) tend to stop working well once you go above a few thousand input tokens. Meanwhile OpenAI handles 100,000 and Gemini handles 1-2 million.

In journalism you’re often working with public data, in which case I see no problem at all feeding it into models.

I had some interesting conversations at the conference about the much larger challenge of analyzing private leaks. Those are cases where dumping them into an online model isn’t acceptable - the best solution right now may be investing in a new $10,000 Mac Studio and running capable long context models on that.

posix_cowboy | 8 months ago

Here’s another method I’ve found helpful: use an addon like “singlefile’ or “save page we” and then run queries (using xidel –xpath or similar) to extract stuff from it. It works on a surprising number of things that would otherwise captcha and rate limit you with their cloud flare configurations.