Hi, the following URL works in the web browser: https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 But trying to get the html file programmatically with the command-line tool "wget" via a script file fails: Code: wget -O "earnings.html" "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05" What follows is the output of wget: Code: --2021-11-06 00:16:57-- https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 Connecting to 192.168.20.1:8118... connected. Proxy request sent, awaiting response... 404 Not Found 2021-11-06 00:16:57 ERROR 404: Not Found. wget normally functions well, but not with this URL. :-( What's missing?
Yahoo is probably trying to prevent webscraping. Try adding the wget option "--user-agent=Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" to trick Yahoo that the request is coming from a browser.
I may be wrong here, but you are trying to download a file when there is none. Yahoo does not provide anything "downloadable" here. If you want to get the full content of this website, you can do so with: Code: curl https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 > output.html This will output the full page content to a file, which can be later scrapped
Both of the above suggestions by @DaveV and @WiktorK have worked here under Linux. Thx to all; saved me the day
Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.
I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.
I'm sure Puppeteer looks fine as well. Never seen a website that has beaten Selenium as it's technically impossible, if an user can see it, it's able to get it. I've noticed people who claim it cannot scrape a site usually don't know how the site's javascript works.