Page Scraping Ethics

Discussion in 'Artificial Intelligence' started by Arnie Guitar, Jun 30, 2025 at 10:26 AM.

  1. Dismiss Notice
  1. I've only recently became aware of page scraping.

    So when I ask a...gardening question...I'm getting a summation of a bunch of web pages?

    I understand the argument is that those web pages aren't being compensated for their knowledge.

    Content creators have to know that going in, right?

    That's kinda the debate, right?
     
  2. Baron

    Baron Administrator

    That's pretty much the gist of it.

    Not only are sites are getting scraped for content for use by AI, but now we are seeing google search results show an AI summary first, and that causes most people to just use the AI summary instead of clicking on web page links in the search results.

    I learned the other day that the inbound traffic from Google searches to news sites is down almost 50% over the past year since the AI summaries have been running.

    In the past website owners have allowed search engines to scrape their sites because that meant the content was indexed by the search engines so those web pages could be found by people doing searches, so it was kind of a reciprocal arrangement. But these AI companies are essentially taking that same data and displaying a summary of it when queried and so the sites that provided the data are completely cut out of the search and discovery process altogether.
     
  3. If only the content they were providing was original and they didn't copy it from someone else...
    What they complain about is that their ads don't print as they used to anymore.
    People prefer AIs not only because they offer synthesized information, but also because they don't have to deal with all sorts of stupid ads while reading a text.

    As this post says, just 10% of the internet is likely original.

     
  4. Peter8519

    Peter8519

    Just do it with caution as it may affect site bandwidth. Most sites have their robot policy e.g. nasdaq.com/robots.txt. Stringent sites will ip ban for a certain period if abused. Start simple and go slow e.g. delay 5 seconds in between scrape. Most sites will block bots now. Even Edgar. So, automate browser for scraping is the best bet. Excel VBA IE automation is good place to start. Having all the ratios of the stock watchlist in a single sheet is handy.
     
  5. Baron

    Baron Administrator

    I just saw this is a newsletter I received this morning:

    Web infrastructure giant Cloudflare just made a major change to automatically block AI crawlers by default on new websites, alongside the launch of a marketplace where publishers can charge bots micropayments for accessing content.

    The details:

    • Cloudflare will require AI companies to get explicit permission before scraping any of the 20% of websites it protects, reversing decades of open web policies.

    • Publishers can set individual prices for AI crawlers through Pay per Crawl, choosing whether bots pay for training data, search results, or other uses.

    • Media outlets like Condé Nast, TIME, and The Atlantic joined the initiative, citing traffic losses due to AI answering queries without the original sources.

    • Data shows OAI’s crawlers scrape sites 1,700 times per referral sent back, with Anthropic at 73,000 times per referral — compared to 14-to-1 for Google.
    Why it matters: This potentially positions Cloudflare as one of the gatekeepers for the data needed for a coming wave of agents that browse on our behalf. The marketplace could force healthier AI-publisher relationships, but also might create an internet divided between premium content and free sites that become AI's default sources.
     
  6. I'm a little slow,
    I'm finally putting 2 and 2 together.

    This is why I'm asked if I'm a bot?
     
  7. BMK

    BMK

    You're not a bot?
     
    Arnie Guitar likes this.
  8. Now I don't care who you are,
    That's funny right there...