I am working on a project for my company...I want to be able to use PHP to login to a few different websites and then go to specific pages and parse the HTML. What would be the best way to do this? Once I get the HTML i'm good to go.They both use https and login using a small FORM on the login page. From there, once the session is set i'd like to browse to already known specific webpages.PHP on Linux
5/1/2007 3:20:26 PM
http://us2.php.net/manual/en/ref.curl.php ?[Edited on May 1, 2007 at 3:46 PM. Reason : then http://us.php.net/dom to parse through it]
5/1/2007 3:43:50 PM
curl! that's what i was trying to remember.thanks!
5/1/2007 3:49:03 PM
i've never used the DOM functions to parse the html. i've always just formed my own regular expressions. i knew there were various html parsing libraries and functions out there, but they just seemed under-developed and kinda boxed you into what you could do. are they that much better now? are they easy to work with, are they flexible enough to correctly handle malformed html?[Edited on May 1, 2007 at 4:34 PM. Reason : .]
5/1/2007 4:10:55 PM
i'm gonna parse using regexp
5/1/2007 8:01:01 PM
pagescraping is almost impossible to maintain once you're done...unless this is some sort of one-off tool or the sort I would highly suggest finding another solution to your problem. Perhaps there isn't one, just warning that it can be a real PITA
5/1/2007 8:03:05 PM
it will be about 4 websites that don't really change except for certain numbers on the page
5/1/2007 10:33:17 PM
I'd use PHP's HTTP request extension.http://www.php.net/manual/en/ref.http.phpand for what your doing, specifically the HTTPRequest class (http://www.php.net/manual/en/http.HttpRequest.php)gonna need pear tho.
5/2/2007 4:22:48 PM
anybody else use DOM packages to parse html?
6/29/2007 10:37:16 AM
hai was pagescraping espn.com and tsn.ca for hockey scores and news updates for a LONG time until the traffic from my server hitting it about a billion times a minute (didn't know enough php/mysql at the time to cache that shit) and they blocked my trafficmuch better ways to accomplish this stuff
6/29/2007 1:30:51 PM
such as?
6/29/2007 4:50:45 PM
^^ did you have better method?
7/20/2007 2:07:59 PM
a regular expression.
7/20/2007 2:22:27 PM
yeah, i've always used regular expressions, but the page i'm currently scraping has 7-deep nested tables with no IDs or anything and it's a PITA.javascript style getElementsByTagName et al are so much easier to use
7/20/2007 2:38:30 PM