Getting HTML via Browser OK, via wget ERROR. Why?

Discussion in 'Data Sets and Feeds' started by thecoder, Nov 5, 2021.

  1. thecoder

    thecoder

    Hi,
    the following URL works in the web browser:
    https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
    But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:
    Code:
    wget -O "earnings.html"  "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05"
    
    What follows is the output of wget:
    Code:
    --2021-11-06 00:16:57--  https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
    Connecting to 192.168.20.1:8118... connected.
    Proxy request sent, awaiting response... 404 Not Found
    2021-11-06 00:16:57 ERROR 404: Not Found.
    
    wget normally functions well, but not with this URL. :-(
    What's missing?
     
  2. DaveV

    DaveV

    Yahoo is probably trying to prevent webscraping. Try adding the wget option

    "--user-agent=Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"

    to trick Yahoo that the request is coming from a browser.
     
    cobco, jtrader33, Baron and 1 other person like this.
  3. WiktorK

    WiktorK

    I may be wrong here, but you are trying to download a file when there is none. Yahoo does not provide anything "downloadable" here.

    If you want to get the full content of this website, you can do so with:

    Code:
    curl https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 > output.html
    This will output the full page content to a file, which can be later scrapped :)
     
    cobco, jtrader33, Baron and 1 other person like this.
  4. thecoder

    thecoder

    Both of the above suggestions by @DaveV and @WiktorK have worked here under Linux. Thx to all; saved me the day :)
     
  5. d08

    d08

    Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.
     
  6. DaveV

    DaveV

    I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.
     
    cobco likes this.
  7. d08

    d08

    I'm sure Puppeteer looks fine as well. Never seen a website that has beaten Selenium as it's technically impossible, if an user can see it, it's able to get it. I've noticed people who claim it cannot scrape a site usually don't know how the site's javascript works.
     
    DaveV likes this.