Getting HTML via Browser OK, via wget ERROR. Why?

  1. thecoder


    the following URL works in the web browser:
    But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:
    wget -O "earnings.html"  ""
    What follows is the output of wget:
    --2021-11-06 00:16:57--
    Connecting to connected.
    Proxy request sent, awaiting response... 404 Not Found
    2021-11-06 00:16:57 ERROR 404: Not Found.
    wget normally functions well, but not with this URL. :-(
    What's missing?
  2. DaveV


    Yahoo is probably trying to prevent webscraping. Try adding the wget option

    "--user-agent=Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"

    to trick Yahoo that the request is coming from a browser.
  3. WiktorK


    I may be wrong here, but you are trying to download a file when there is none. Yahoo does not provide anything "downloadable" here.

    If you want to get the full content of this website, you can do so with:

    curl > output.html
    This will output the full page content to a file, which can be later scrapped :)
  4. thecoder


    Both of the above suggestions by @DaveV and @WiktorK have worked here under Linux. Thx to all; saved me the day :)
  5. d08


    Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.
  6. DaveV


    I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.
  7. d08


    I'm sure Puppeteer looks fine as well. Never seen a website that has beaten Selenium as it's technically impossible, if an user can see it, it's able to get it. I've noticed people who claim it cannot scrape a site usually don't know how the site's javascript works.
