so i have a database populated with partner organizations, groups that help fund our research...i'm trying to automate a registration process for our intranet, but my regex skills are lacking...what i'd like to do is pull the organization's website out of the database, strip out the http:// and www. if they exist, and compare the remaining string against the registrant's email address to see if they're one of our supporters...pulling the information isn't really hard, nor is the comparison, but i'm running into problems based on the websitesfor example, MIT is a partner and the website we have on file is http://www.mit.edu/ (this is exactly as it appears in the database)...i'd like to strip all except for the mit.edu (exactly) and compare that against their email address (username@mit.edu)that's great, but we have funding partners that also might have international addresses like .co.uk or .com.au or us.company.com/ca.company.com so i can't just write a generic script that strips out everything except the domain and the block of text preceding it (i mean, that would work great for the us.company.com/ca.company.com example, but not for the others)suggestions? any of you PHP regex gurus want to give me some pointers?
6/8/2009 12:08:19 PM
$string = preg_replace('#^(https?://)?([w]{3}\.)?([^/]+)(.*)#', "$3", $orgString);[Edited on June 8, 2009 at 12:16 PM. Reason : works and now tested!]
6/8/2009 12:12:28 PM
remind me what this[^/]+does?
6/8/2009 12:18:09 PM
Selects all non forward slashes as long as there is at least 1.
6/8/2009 12:22:05 PM
that's actually not what i was thinking of, but then had a duh moment and realized it was easier to do that, split their email address at the "@", and then use in_array to compare their email domain against an array of the authorized domains while i'm in this thread, might as well ask my next question...i've never had to do this before, but how does one pass session variables out of a function that redirects to another page? i've got a redirect function that takes in a generated error message and a redirect URL, assigns the message to a session variable, and then redirects via the provided URL...the function is what requires the special case, since if i use the code outside of a function in each instance it works swimminglypage1.php
<?phpfunction redirect($message,$link) { $_SESSION['error'] = $message; session_write_close(); header("location:$link"); exit();}$error = "Oh, bollocks.";$gohere = "page2.php";redirect($error,$gohere);?>
<?phpecho $_SESSION['error'];?>
6/10/2009 10:36:45 AM
global $_SESSION['error'] = $message;good job on not using session_register(), but if you set a session var from within a function you need to make sure it goes to the global namespace.also, make sure you're calling session_start() before anything else on those pages if you're using cookies to handle the SIDs]
6/10/2009 10:41:22 AM
^ i tried using global before, but i got this, so i assumed that wasn't it:
6/10/2009 10:54:03 AM
hm, i stand corrected
6/10/2009 11:11:56 AM
Ignore evan. Sessions are superglobal.I've never used session_write_close() but the manual page does show some people having similar issues as you're having. http://us2.php.net/manual/en/function.session-write-close.php#86791 seems to be a solution to your problem.
6/10/2009 11:14:12 AM
ah, session_regenerate_id(true) is what was needed to get it to pass those session variablesthanks, y'all
6/10/2009 11:16:01 AM
^^it wasn't in previous versions. i rarely use sessions, so i wasn't aware of this. sorry.also, sounds like this is your solution:
6/10/2009 11:17:36 AM
6/10/2009 11:33:17 AM
which is right around the time i started learning php. i learned on 3, 4 wasn't out then. i haven't had a need to register session variables as i do it a different way, so, yeah.btw, you actually were being a dick.
6/10/2009 11:36:01 AM
Previous versions being PHP 3.0.[Edited on June 10, 2009 at 11:37 AM. Reason : ha I was just joshin'][Edited on June 10, 2009 at 11:40 AM. Reason : and ha super globals were introduced in 4.1, late 2001]
6/10/2009 11:37:00 AM
is there a better way (more concise or best practice) to clean up file names than this (strip everything but alphanumeric, including periods, but keep the file extension and last period)?
function cleanFilename($str) { $str = strtolower(trim(basename($str))); // gets file extension $i = strrpos($str,"."); if(!$i){return "";} $l = strlen($str)-$i; $ext = substr($str,$i+1,$l); // replaces characters $pos = strrpos($str,"."); // position of last . in string (strpos does the first) $str = preg_replace("/[^a-zA-Z0-9\s]/","",substr($str,0,$pos)); // remove all non-alphanumeric characters before last . in string $str = preg_replace("/\s+/","_",$str); // compress internal whitespace and replace with _ $str = preg_replace("/\W-/","",$str); // remove all non-alphanumeric characters except _ and - return $str.".".$ext;}
6/19/2009 10:51:49 AM
I would go with below:
function cleanFilename ($str) { $str = basename($str); $fileExtensionPosition = strrpos($str, "."); if($fileExtensionPosition) { $patterns[0] = '/[^a-zA-Z0-9\s]/'; $replacements[0] = ''; $patterns[1] = '/\s\s+/'; $replacements[1] = '_'; $fileName = preg_replace($patterns, $replacements, substr($str,0,$fileExtensionPosition); $fileExtension = substr($str,$fileExtensionPosition); return $fileName.$fileExtension; } return false;}
6/19/2009 4:30:52 PM
okay, so now my question is related to my first post in this thread...similar situation, but again my regex skills are lackingwe might have on record http://www.sponsor.com/ and i can get just the sponsor.com (which is what i want), but i've just come across a case where the email address of the user is something like username@us.sponsor.com so that when i do the compare, it tries to compare us.sponsor.com to sponsor.com and it fails (obviously)i could do a reverse compare (where i check for sponsor.com inside us.sponsor.com), but i'm trying to avoid that...what i want to do is take the user's domain from their email address (us.sponsor.com) and strip out everything EXCEPT sponsor.com...so if their email was username@we.are.a.sponsor.com or username@us.sponsor.com or regardless of the number of subdomains, it will always return JUST sponsor.comsuggestions?
9/8/2009 3:32:22 PM
$whatever = preg_replace('#(.*(\.|@))?([^\.]+\.[^\.]+)$#', "$3", $whatever);[Edited on September 8, 2009 at 3:40 PM. Reason : there we go]
9/8/2009 3:36:20 PM
^ thanks! i really need to brush up on my regex
9/8/2009 3:41:49 PM
http://www.sellsbrothers.com/tools/#regexd is a great little tool for testing out regex btw[Edited on September 8, 2009 at 3:46 PM. Reason : it's built using .net regex, which is mostly the same as php. but still helpful]
9/8/2009 3:44:30 PM
^ that's actually pretty cool...thanks for the heads up
9/9/2009 7:59:21 AM
Bump
4/27/2011 9:35:25 AM
i suck at regex...i have this function to automatically parse text for email addresses:
function emailit($str) { $regex = '/(\S+@\S+\.\S+)/i'; $replace = "<a href='mailto:$1'>$1</a>"; $str = preg_replace($regex, $replace, $str); return $str; $str = preg_match($regex, $str); return $str;}
blah blah blah myemailgoeshere@fakemail.com blah blah blah
blah blah blah <a href="mailto:myemailgoeshere@fakemail.com">myemailgoeshere</a> blah blah blah
4/27/2011 9:40:50 AM
Why are you parsing the presentation layer?
4/27/2011 10:37:42 AM
$regex = '/([^"\'\s]+@\S+\.\[^"\'\s]+)/i';
4/27/2011 10:46:17 AM
preg_match_all('#[a-zA-Z0-9\-_\.]+@[a-zA-Z\-_\.]+#', $testString, $matches);You can make it more specific if you really want by adding something that ensure the backhalf actually has a valid domain, but I mean, this will basically work.[Edited on April 27, 2011 at 11:06 AM. Reason : .]
4/27/2011 11:05:22 AM
^^^ i'm not...not exactly, anyway^^ that did it...thxu [Edited on April 27, 2011 at 11:07 AM. Reason : carats]
4/27/2011 11:07:17 AM
So you're screen-scraping. Just because it's not your presentation layer doesn't mean it's not the presentation layer.
4/27/2011 11:14:09 AM
i'm working with pre-existing data and i'm trying to clean it up to serve my purposesonce again, contributing to a thread by not contributing to it...thanks for your input
4/27/2011 12:24:52 PM
actually, BigMan157, that didn't do it...at least, it takes care of the condition i mentioned, but now the other condition is ignored
4/27/2011 1:02:45 PM
The first question you should always ask yourself is whether there's a better approach than the one that has led you to the problem you're currently dealing with. I don't know why that is so hard for you to appreciate.If you're wanting to scrape e-mail addresses out of HTML and you're using PHP, why don't you just strip out the HREF attributes of any A elements in the document prior to parsing for e-mail addresses? Jesus.[Edited on April 27, 2011 at 1:10 PM. Reason : The idea being that PHP has easy-to-use DOM traversal and manipulation.]
4/27/2011 1:07:50 PM
okay, i'll bitedatabase entry is exactly this (minus any changes tww's crazy code makes):
My name is Bob. My email address is bob@email.com.
4/27/2011 1:19:32 PM
You don't have control over what's in the database record, but what do you expect to find in a record? Is it reasonably reliable that if the record contains a mailto link, the closing tag will be included? If so, you could just run strip_tags() on it prior to regexing for an e-mail address. That would obviate the need to test for an e-mail address in an anchor's HREF entirely.[Edited on April 27, 2011 at 1:26 PM. Reason : ...]
4/27/2011 1:26:07 PM
reasonably, yes...but the content isn't always that (it was just an example, though a realistic one)sometimes there will be HTML character entities and sometimes the tags will be encoded as their entity name/numbermy thought is to create a function to convert all entity names/numbers to their character and then search for URLs and email addresses to convert to their appropriate links
4/27/2011 1:31:29 PM
You could run html_entity_decode() then strip_tags() then run your regex, then. The first function shouldn't hurt anything if there aren't any HTML character references in the input string.
4/27/2011 1:35:17 PM
i suppose i'm not sure what that will accomplishhtml_entity_decode() is obvious, and i'm doing that already...but why would i WANT to strip tags? i want to keep them there (and yes, i realize i could except certain tags, but i want to keep them all)
4/27/2011 1:39:36 PM
Maybe I'm not fully understanding the issue. I thought you were saying your regex wasn't working as expected when the input string contained an anchor with a mailto HREF. Are you actually trying to replace instances of e-mail addresses in your input with a different e-mail address?
4/27/2011 1:43:13 PM
okay, i'll try to do a better job of explaining...below are possible entries:
My name is Bob. My email address is bob@email.comMy name is John. You can email me <a href="mailto:john@email.com">here</a>.<p>My name is Fred. My email address is <a href="mailto:fred@email.com">fred@email.com</a>.</p><p>My name is Mary.<br /><br />You should email me at mary@email.com!</p>My name is Anna. My website is http://www.mywebsite.com/.
4/27/2011 1:50:40 PM
^ So those are the possible entries, but I'm not sure what you want:1. to strip out everything except the email address and return it, or2. replace the email address with a mailto tag and return that?[Edited on April 27, 2011 at 2:03 PM. Reason : ]
4/27/2011 1:56:57 PM
Okay, so for e-mail addresses, couldn't you just include the colon as a potential starting character?
4/27/2011 2:04:02 PM
nevermind, i think i've got it all in one function now:
function linkylinky($str) { $str = ' '.$str; $str = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", "'\\1<a href=\"\\2\" >\\2</a>'", $str); $str = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r<]*)#ise", "'\\1<a href=\"http://\\2\" >\\2</a>'", $str); $str = preg_replace("#(^|[\n ])([a-z0-9&\-_\.]+?)@([\w\-]+\.([\w\-\.]+\.)*[\w]+)#i", "\\1<a href=\"mailto:\\2@\\3\">\\2@\\3</a>", $str); $str = substr($str, 1); return $str;}
4/27/2011 2:09:37 PM
This question relates to the second post, but just out of curiosity, why would you use the session header to pass a message between pages instead of a form?
4/27/2011 2:26:14 PM
oh, that was a long time ago page1 (front-end): form, submit to page2page2 (back-end): process form variables, generate message (success or fail)page3 (front-end): display message[Edited on April 27, 2011 at 2:49 PM. Reason : is there something wrong with that process?]
4/27/2011 2:48:58 PM
It's just sort of a silly way to do it if you're not passing anything you're planning on displaying.
4/27/2011 2:51:25 PM
what do you mean? the message is displayed
4/27/2011 2:57:00 PM
You made it sound like you're just sending "Success" or "Failed", which is something you could just pass in the URL and then use an if statement to actually display whatever message you wanted to show.If that's the case, you're creating additional server overhead using a session for no real reason.Now if you're transmitting a whole error message, like "The operation failed for X, Y, Z" reason, that's a different story.
4/27/2011 3:03:12 PM
4/27/2011 3:08:37 PM