XMLHttpRequest domain

blueraincap · Jun 11, 2020

I just wanted to get an idea about the basics of getting data off a random website, one without api, like scraping headlines off some random news websites like a small one called bloomberg or ft. I see xmlhttprequest is used, but with cross-domain security, we need to either hack it with json-p or one with cross-origin sharing. I think a server can be closed to CORS, right? just want a very basic idea about how to retrieve data (its html) off a site, not structured to share data.

As a cross-origin, I don't think below work, right.
https://flylib.com/books/en/4.254.1.64/1/

alphadistribution · Jun 11, 2020

Cross-domain security is an issue if you are reading data from one website while your script is running on another website. If you're sending requests with a non-browser tool or language (e.g. cURL, Python, Node.js) then you can just send the requests. If you're using Chrome, open the developer tools, click the network tab, find the request you want, then right click and Copy>Copy as cURL or Copy as Node.js fetch to get some example code with all the headers set.

Also, if you really must have the data displayed in a browser because that's where you're developing your UI, you can set up a local web server that takes a request from the browser and sends it off to the original host. Once the web server gets a response, it sends it back to your browser, No cross-domain requests, since it's all proxied by your local server.

blueraincap · Jun 11, 2020

alphadistribution said:
Cross-domain security is an issue if you are reading data from one website while your script is running on another website. If you're sending requests with a non-browser tool or language (e.g. cURL, Python, Node.js) then you can just send the requests. If you're using Chrome, open the developer tools, click the network tab, find the request you want, then right click and Copy>Copy as cURL or Copy as Node.js fetch to get some example code with all the headers set.

Also, if you really must have the data displayed in a browser because that's where you're developing your UI, you can set up a local web server that takes a request from the browser and sends it off to the original host. Once the web server gets a response, it sends it back to your browser, No cross-domain requests, since it's all proxied by your local server.
More...

Right. I read about XmlHttpRequest from some Javascript books and R books. JS books talked about cross-origin and how to use Json-p to hack it which just sounds wrong, but R books just talk as if cross-origin isn't a thing. Right, in JS, I am running scripts from my domain to another one, with R/Java, i should be able to just send requests. So if i write a scraper using R, I need not keep origin in mind?

alphadistribution · Jun 11, 2020

Correct, if you're scraping in R you don't need to worry about cross origin settings.

blueraincap · Jun 11, 2020

Right, I just tried to read some html using Java URLConnection.getInputStream()..no issue, just getting plain html contents off a site is difficult these days with almost all sites generated by software. I wanted to try JS because i just thought jquery would make it easier to parse the DOM, compared to a normal language

alphadistribution · Jun 11, 2020

Makes sense. I'd check out some of the libraries that give you jQuery-like selectors within your preferred programming language.

Java: https://jsoup.org/
Node.js: https://github.com/jsdom/jsdom
Python: https://pypi.org/project/beautifulsoup4/
R: https://rdrr.io/cran/rvest/man/html_nodes.html

Basic pattern is search Google for "<language> html parser css selector"

analytics · Jun 12, 2020

Any idea why sometimes a URL stream returns html, sometimes css?