User not logged in - login - register
Home Calendar Books School Tool Photo Gallery Message Boards Users Statistics Advertise Site Info
go to bottom | |
 Message Boards » » PHP regex question Page [1]  
quagmire02
All American
44225 Posts
user info
edit post

so i have a database populated with partner organizations, groups that help fund our research...i'm trying to automate a registration process for our intranet, but my regex skills are lacking...what i'd like to do is pull the organization's website out of the database, strip out the http:// and www. if they exist, and compare the remaining string against the registrant's email address to see if they're one of our supporters...pulling the information isn't really hard, nor is the comparison, but i'm running into problems based on the websites

for example, MIT is a partner and the website we have on file is http://www.mit.edu/ (this is exactly as it appears in the database)...i'd like to strip all except for the mit.edu (exactly) and compare that against their email address (username@mit.edu)

that's great, but we have funding partners that also might have international addresses like .co.uk or .com.au or us.company.com/ca.company.com so i can't just write a generic script that strips out everything except the domain and the block of text preceding it (i mean, that would work great for the us.company.com/ca.company.com example, but not for the others)

suggestions? any of you PHP regex gurus want to give me some pointers?

6/8/2009 12:08:19 PM

Stein
All American
19842 Posts
user info
edit post

$string = preg_replace('#^(https?://)?([w]{3}\.)?([^/]+)(.*)#', "$3", $orgString);

[Edited on June 8, 2009 at 12:16 PM. Reason : works and now tested!]

6/8/2009 12:12:28 PM

Fail Boat
Suspended
3567 Posts
user info
edit post

remind me what this

[^/]+

does?

6/8/2009 12:18:09 PM

Stein
All American
19842 Posts
user info
edit post

Selects all non forward slashes as long as there is at least 1.

6/8/2009 12:22:05 PM

quagmire02
All American
44225 Posts
user info
edit post

that's actually not what i was thinking of, but then had a duh moment and realized it was easier to do that, split their email address at the "@", and then use in_array to compare their email domain against an array of the authorized domains

while i'm in this thread, might as well ask my next question...i've never had to do this before, but how does one pass session variables out of a function that redirects to another page? i've got a redirect function that takes in a generated error message and a redirect URL, assigns the message to a session variable, and then redirects via the provided URL...the function is what requires the special case, since if i use the code outside of a function in each instance it works swimmingly

page1.php

<?php
function redirect($message,$link) {
$_SESSION['error'] = $message;
session_write_close();
header("location:$link");
exit();
}
$error = "Oh, bollocks.";
$gohere = "page2.php";
redirect($error,$gohere);
?>

page2.php
<?php
echo $_SESSION['error'];
?>

6/10/2009 10:36:45 AM

evan
All American
27701 Posts
user info
edit post

global $_SESSION['error'] = $message;

good job on not using session_register(), but if you set a session var from within a function you need to make sure it goes to the global namespace.

also, make sure you're calling session_start() before anything else on those pages if you're using cookies to handle the SIDs

6/10/2009 10:41:22 AM

quagmire02
All American
44225 Posts
user info
edit post

^ i tried using global before, but i got this, so i assumed that wasn't it:

Quote :
"Parse error: syntax error, unexpected '[', expecting ',' or ';'"


also, i'm calling session_start() on both pages

6/10/2009 10:54:03 AM

evan
All American
27701 Posts
user info
edit post

hm, i stand corrected

Quote :
"Note: As of PHP 4.1.0, $_SESSION is available as a global variable just like $_POST, $_GET, $_REQUEST and so on. Unlike $HTTP_SESSION_VARS, $_SESSION is always global. Therefore, you do not need to use the global keyword for $_SESSION."


are you sure it's actually registering a session?

try using the URL method instead:

header("location:" . $link . "?" . SID);

see if it sticks a long md5 hash in the URL.

6/10/2009 11:11:56 AM

Stein
All American
19842 Posts
user info
edit post

Ignore evan. Sessions are superglobal.

I've never used session_write_close() but the manual page does show some people having similar issues as you're having. http://us2.php.net/manual/en/function.session-write-close.php#86791 seems to be a solution to your problem.

6/10/2009 11:14:12 AM

quagmire02
All American
44225 Posts
user info
edit post

ah, session_regenerate_id(true) is what was needed to get it to pass those session variables

thanks, y'all

6/10/2009 11:16:01 AM

evan
All American
27701 Posts
user info
edit post

^^it wasn't in previous versions. i rarely use sessions, so i wasn't aware of this. sorry.

also, sounds like this is your solution:

Quote :
"I was having the same problem as many here regarding setting session data just before a header location redirect and having the session data just not be there. I tried everything people here said, and none of their combinations worked. What did finally work for me was to fire off a session_regenerate_id(true) call just prior to the header() and die() calls.

session_regenerate_id(true);
header('location: blah blah');
die();

Without the regenerate id call, the write close did not seem to do anything. session_write_close() doesn't seem to matter at all. It certainly didn't fix anything on its own for me.

This is a rather annoying issue with php sessions that I've never run into before. I store my sessions to /dev/shm (which is RAM) so file IO blocking can't be the problem. Now I'm nervous that some other session data might not be getting updated prior to a header() location change, which is extremely important and common in any web app."


[Edited on June 10, 2009 at 11:17 AM. Reason : heh, never mind, you already saw it]

6/10/2009 11:17:36 AM

Stein
All American
19842 Posts
user info
edit post

Quote :
"^^it wasn't in previous versions. i rarely use sessions, so i wasn't aware of this. sorry.
"


Not to be a dick, but it's been like this for almost 7.5 years.

6/10/2009 11:33:17 AM

evan
All American
27701 Posts
user info
edit post

which is right around the time i started learning php. i learned on 3, 4 wasn't out then. i haven't had a need to register session variables as i do it a different way, so, yeah.

btw, you actually were being a dick.

6/10/2009 11:36:01 AM

Ernie
All American
45943 Posts
user info
edit post

Previous versions being PHP 3.0.

[Edited on June 10, 2009 at 11:37 AM. Reason : ha I was just joshin']

[Edited on June 10, 2009 at 11:40 AM. Reason : and ha super globals were introduced in 4.1, late 2001]

6/10/2009 11:37:00 AM

quagmire02
All American
44225 Posts
user info
edit post

is there a better way (more concise or best practice) to clean up file names than this (strip everything but alphanumeric, including periods, but keep the file extension and last period)?

function cleanFilename($str) {
$str = strtolower(trim(basename($str)));
// gets file extension
$i = strrpos($str,".");
if(!$i){return "";}
$l = strlen($str)-$i;
$ext = substr($str,$i+1,$l);
// replaces characters
$pos = strrpos($str,"."); // position of last . in string (strpos does the first)
$str = preg_replace("/[^a-zA-Z0-9\s]/","",substr($str,0,$pos)); // remove all non-alphanumeric characters before last . in string
$str = preg_replace("/\s+/","_",$str); // compress internal whitespace and replace with _
$str = preg_replace("/\W-/","",$str); // remove all non-alphanumeric characters except _ and -
return $str.".".$ext;
}


[Edited on June 19, 2009 at 10:53 AM. Reason : .]

6/19/2009 10:51:49 AM

Noen
All American
31346 Posts
user info
edit post

I would go with below:


function cleanFilename ($str) {
$str = basename($str);
$fileExtensionPosition = strrpos($str, ".");

if($fileExtensionPosition) {
$patterns[0] = '/[^a-zA-Z0-9\s]/';
$replacements[0] = '';
$patterns[1] = '/\s\s+/';
$replacements[1] = '_';
$fileName = preg_replace($patterns, $replacements, substr($str,0,$fileExtensionPosition);
$fileExtension = substr($str,$fileExtensionPosition);
return $fileName.$fileExtension;
}
return false;
}


Differences:

-extension check is the primary pass/fail logic of the function, it should be blocked to prevent boundary return conditions.
-offload as much computation as possible until after you do the extension check.
-string length is an optional parameter of substr, you can kill all that
-why chop out the "." of the ext, and then manually reinsert it, when you aren't doing any transforms on the extension?
-grouped the reg expressions into arrays (best practice)
-slight tweak to your whitespace replace, so also remove redundant whitespace ( _____ goes to _)
-the last replace should be redundant, as you already removed *all* non alphanumeric characters in the first replacement, then inserted _'s.
-variable naming, return value for failure should never be "", as it doesn't tell you if that's a false return, or an empty filename (ie: --- is a valid filename, but would return as "" in your function.)

[Edited on June 19, 2009 at 4:33 PM. Reason : .]

6/19/2009 4:30:52 PM

quagmire02
All American
44225 Posts
user info
edit post

okay, so now my question is related to my first post in this thread...similar situation, but again my regex skills are lacking

we might have on record http://www.sponsor.com/ and i can get just the sponsor.com (which is what i want), but i've just come across a case where the email address of the user is something like username@us.sponsor.com so that when i do the compare, it tries to compare us.sponsor.com to sponsor.com and it fails (obviously)

i could do a reverse compare (where i check for sponsor.com inside us.sponsor.com), but i'm trying to avoid that...what i want to do is take the user's domain from their email address (us.sponsor.com) and strip out everything EXCEPT sponsor.com...so if their email was username@we.are.a.sponsor.com or username@us.sponsor.com or regardless of the number of subdomains, it will always return JUST sponsor.com

suggestions?

9/8/2009 3:32:22 PM

Stein
All American
19842 Posts
user info
edit post

$whatever = preg_replace('#(.*(\.|@))?([^\.]+\.[^\.]+)$#', "$3", $whatever);

[Edited on September 8, 2009 at 3:40 PM. Reason : there we go]

9/8/2009 3:36:20 PM

quagmire02
All American
44225 Posts
user info
edit post

^ thanks!

i really need to brush up on my regex

9/8/2009 3:41:49 PM

qntmfred
retired
40726 Posts
user info
edit post

http://www.sellsbrothers.com/tools/#regexd is a great little tool for testing out regex btw

[Edited on September 8, 2009 at 3:46 PM. Reason : it's built using .net regex, which is mostly the same as php. but still helpful]

9/8/2009 3:44:30 PM

quagmire02
All American
44225 Posts
user info
edit post

^ that's actually pretty cool...thanks for the heads up

9/9/2009 7:59:21 AM

qntmfred
retired
40726 Posts
user info
edit post

Bump

4/27/2011 9:35:25 AM

quagmire02
All American
44225 Posts
user info
edit post

i suck at regex...i have this function to automatically parse text for email addresses:

function emailit($str) {
$regex = '/(\S+@\S+\.\S+)/i';
$replace = "<a href='mailto:$1'>$1</a>";
$str = preg_replace($regex, $replace, $str);
return $str;
$str = preg_match($regex, $str);
return $str;
}
and it works great as long as the string only has the email address and not the href tag...so it works well for:
blah blah blah myemailgoeshere@fakemail.com blah blah blah
but not:
blah blah blah <a href="mailto:myemailgoeshere@fakemail.com">myemailgoeshere</a> blah blah blah
suggestions?

4/27/2011 9:40:50 AM

FroshKiller
All American
51911 Posts
user info
edit post

Why are you parsing the presentation layer?

4/27/2011 10:37:42 AM

BigMan157
no u
103354 Posts
user info
edit post

$regex = '/([^"\'\s]+@\S+\.\[^"\'\s]+)/i';


maybe

4/27/2011 10:46:17 AM

Stein
All American
19842 Posts
user info
edit post

preg_match_all('#[a-zA-Z0-9\-_\.]+@[a-zA-Z\-_\.]+#', $testString, $matches);

You can make it more specific if you really want by adding something that ensure the backhalf actually has a valid domain, but I mean, this will basically work.

[Edited on April 27, 2011 at 11:06 AM. Reason : .]

4/27/2011 11:05:22 AM

quagmire02
All American
44225 Posts
user info
edit post

^^^ i'm not...not exactly, anyway

^^ that did it...thxu

[Edited on April 27, 2011 at 11:07 AM. Reason : carats]

4/27/2011 11:07:17 AM

FroshKiller
All American
51911 Posts
user info
edit post

So you're screen-scraping. Just because it's not your presentation layer doesn't mean it's not the presentation layer.

4/27/2011 11:14:09 AM

quagmire02
All American
44225 Posts
user info
edit post

i'm working with pre-existing data and i'm trying to clean it up to serve my purposes

once again, contributing to a thread by not contributing to it...thanks for your input

4/27/2011 12:24:52 PM

quagmire02
All American
44225 Posts
user info
edit post

actually, BigMan157, that didn't do it...at least, it takes care of the condition i mentioned, but now the other condition is ignored

4/27/2011 1:02:45 PM

FroshKiller
All American
51911 Posts
user info
edit post

The first question you should always ask yourself is whether there's a better approach than the one that has led you to the problem you're currently dealing with. I don't know why that is so hard for you to appreciate.

If you're wanting to scrape e-mail addresses out of HTML and you're using PHP, why don't you just strip out the HREF attributes of any A elements in the document prior to parsing for e-mail addresses? Jesus.

[Edited on April 27, 2011 at 1:10 PM. Reason : The idea being that PHP has easy-to-use DOM traversal and manipulation.]

4/27/2011 1:07:50 PM

quagmire02
All American
44225 Posts
user info
edit post

okay, i'll bite

database entry is exactly this (minus any changes tww's crazy code makes):

My name is Bob.  My email address is bob@email.com.
i have no control over the content of the database, just the display...what is your suggestion as to the best way, using PHP, to make that email address into a mailto link?

4/27/2011 1:19:32 PM

FroshKiller
All American
51911 Posts
user info
edit post

You don't have control over what's in the database record, but what do you expect to find in a record? Is it reasonably reliable that if the record contains a mailto link, the closing tag will be included? If so, you could just run strip_tags() on it prior to regexing for an e-mail address. That would obviate the need to test for an e-mail address in an anchor's HREF entirely.

[Edited on April 27, 2011 at 1:26 PM. Reason : ...]

4/27/2011 1:26:07 PM

quagmire02
All American
44225 Posts
user info
edit post

reasonably, yes...but the content isn't always that (it was just an example, though a realistic one)

sometimes there will be HTML character entities and sometimes the tags will be encoded as their entity name/number

my thought is to create a function to convert all entity names/numbers to their character and then search for URLs and email addresses to convert to their appropriate links

4/27/2011 1:31:29 PM

FroshKiller
All American
51911 Posts
user info
edit post

You could run html_entity_decode() then strip_tags() then run your regex, then. The first function shouldn't hurt anything if there aren't any HTML character references in the input string.

4/27/2011 1:35:17 PM

quagmire02
All American
44225 Posts
user info
edit post

i suppose i'm not sure what that will accomplish

html_entity_decode() is obvious, and i'm doing that already...but why would i WANT to strip tags? i want to keep them there (and yes, i realize i could except certain tags, but i want to keep them all)

4/27/2011 1:39:36 PM

FroshKiller
All American
51911 Posts
user info
edit post

Maybe I'm not fully understanding the issue. I thought you were saying your regex wasn't working as expected when the input string contained an anchor with a mailto HREF. Are you actually trying to replace instances of e-mail addresses in your input with a different e-mail address?

4/27/2011 1:43:13 PM

quagmire02
All American
44225 Posts
user info
edit post

okay, i'll try to do a better job of explaining...below are possible entries:

My name is Bob. My email address is bob@email.com
My name is John. You can email me <a href="mailto:john@email.com">here</a>.
<p>My name is Fred. My email address is <a href="mailto:fred@email.com">fred@email.com</a>.</p>
<p>My name is Mary.<br /><br />You should email me at mary@email.com!</p>
My name is Anna. My website is http://www.mywebsite.com/.
or any variation

if HTML characters are there, i want them displayed...if not, i want to convert email address and URLs into the appropriate link

[Edited on April 27, 2011 at 1:53 PM. Reason : imagine that one of those examples has & l t ; and & g t ; since TWW converted them]

4/27/2011 1:50:40 PM

AstralEngine
All American
3864 Posts
user info
edit post

^ So those are the possible entries, but I'm not sure what you want:

1. to strip out everything except the email address and return it, or

2. replace the email address with a mailto tag and return that?

[Edited on April 27, 2011 at 2:03 PM. Reason : ]

4/27/2011 1:56:57 PM

FroshKiller
All American
51911 Posts
user info
edit post

Okay, so for e-mail addresses, couldn't you just include the colon as a potential starting character?

4/27/2011 2:04:02 PM

quagmire02
All American
44225 Posts
user info
edit post

nevermind, i think i've got it all in one function now:

function linkylinky($str) {
$str = ' '.$str;
$str = preg_replace("#(^|[\n ])([\w]+?://[\w]+[^ \"\n\r\t<]*)#ise", "'\\1<a href=\"\\2\" >\\2</a>'", $str);
$str = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r<]*)#ise", "'\\1<a href=\"http://\\2\" >\\2</a>'", $str);
$str = preg_replace("#(^|[\n ])([a-z0-9&\-_\.]+?)@([\w\-]+\.([\w\-\.]+\.)*[\w]+)#i", "\\1<a href=\"mailto:\\2@\\3\">\\2@\\3</a>", $str);
$str = substr($str, 1);
return $str;
}


[Edited on April 27, 2011 at 2:20 PM. Reason : code...i had the first two lines working fine, but couldn't get the third...now it works, i think]

4/27/2011 2:09:37 PM

moron
All American
34142 Posts
user info
edit post

This question relates to the second post, but just out of curiosity, why would you use the session header to pass a message between pages instead of a form?

4/27/2011 2:26:14 PM

quagmire02
All American
44225 Posts
user info
edit post

oh, that was a long time ago

page1 (front-end): form, submit to page2
page2 (back-end): process form variables, generate message (success or fail)
page3 (front-end): display message

[Edited on April 27, 2011 at 2:49 PM. Reason : is there something wrong with that process?]

4/27/2011 2:48:58 PM

Stein
All American
19842 Posts
user info
edit post

It's just sort of a silly way to do it if you're not passing anything you're planning on displaying.

4/27/2011 2:51:25 PM

quagmire02
All American
44225 Posts
user info
edit post

what do you mean? the message is displayed

4/27/2011 2:57:00 PM

Stein
All American
19842 Posts
user info
edit post

You made it sound like you're just sending "Success" or "Failed", which is something you could just pass in the URL and then use an if statement to actually display whatever message you wanted to show.

If that's the case, you're creating additional server overhead using a session for no real reason.

Now if you're transmitting a whole error message, like "The operation failed for X, Y, Z" reason, that's a different story.

4/27/2011 3:03:12 PM

quagmire02
All American
44225 Posts
user info
edit post

Quote :
"Now if you're transmitting a whole error message, like "The operation failed for X, Y, Z" reason, that's a different story."

exactly...it's not common, but when it happens, it's usually a paragraph or two

4/27/2011 3:08:37 PM

 Message Boards » Tech Talk » PHP regex question Page [1]  
go to top | |
Admin Options : move topic | lock topic

© 2024 by The Wolf Web - All Rights Reserved.
The material located at this site is not endorsed, sponsored or provided by or on behalf of North Carolina State University.
Powered by CrazyWeb v2.39 - our disclaimer.