Link Details

Link 91060 thumbnail
User 261337 avatar

By Rocky1138
via johnrockefeller.net
Published: Jun 30 2008 / 04:28

I wanted to scrub any characters out of a string that were not alphanumeric. So, I wrote this function that uses a simple regular expression to detect the unwanted characters.
  • 6
  • 3
  • 1608
  • 357

Comments

Add your comment
User 297562 avatar

Sven Arild Helleland replied ago:

0 votes Vote down Vote up Reply

Umm... Why reinvent the wheel? Even more so when the new wheel is performing terrible compared to the old.

You should use regex for this, the regex below will work in any languages supporting PCRE. (Obviously the syntax of variables and function calls need to be updated but the regex code is valid

$find = array('|[ ]{1,}|', '|[^0-9a-z_\-.]|i', '|[.]{2,}|');
$replace = array('_', '', '.');

$valid_filename = strtolower(preg_replace($find, $replace, $filename));

The code above, is smaller, faster and better than the function you created.

You can also easily wrap it into a function:

function convert_to_filename ($string) {
$find = array('|[ ]{1,}|', '|[^0-9a-z_\-.]|i', '|[.]{2,}|');
$replace = array('_', '', '.');

return strtolower(preg_replace($find, $replace, $filename));
}

Though, your better off setting up a real OOP structure from the start than using procedural code.

Just to show why the code above is better to use:

If you take this filename:
this is 2&023 .../. 3DDD my file.zip

This is what your function returns:
this_is_2023__..._3ddd_my_file.zip

This is what the regex above returns:
this_is_2023_._3ddd_my_file.zip

As you can see, your function has problems with multiple spaces after eachother as well as many dots after eachother or inbetween illigal characters.

On a side note, unless these links should be visible for the users (public), you will be better off just using an id for the file name and saving the filename to the db.
Then when someone download the file, you just give it the required filename using the header. Though this only work if you stream the filenames to the users through a php function, or by using symlink.

User 261337 avatar

John Rockefeller replied ago:

1 votes Vote down Vote up Reply

Hi Sven, thanks for the ownage yet again.

A couple of points will make this post not seem as bad as it seems. First and foremost the post was designed more for newbies that might not have extensive knowledge about regular expressions. I planned on doing a follow-up post later on that showed a better, quicker way to do it. I remember trying to learn regular expressions while learning PHP and there were a few hours wasted on simple things.

This is also why I used multiple str_replace functions instead of doing it in one fell swoop with regular expressions. I wanted to make it simple for the layman to follow. Speed is not important as this code is only run once whenever you're saving a new video (maybe 10 times per week tops) so the difference in performance between preg_replace and str_replace is negligible. Seems contradictory to programming logic, but for some reason the PHP str_replace page instructs users to use str_replace instead of preg_replace. Do you know why?

Also, I do realize that it's better to use an ID for the filename but there is a specific reason I need the filename to be determined by the user: SEO. I am working on building a video portal and the filenames of the videos must match the name of the video so that the search ranking will improve. These videos will be keyword-heavy (e.g., kid_gets_hit_with_baseball_and_cries.flv).

Finally, it's not such a big deal if there are multiple periods, it's the one at the end that's important. Same goes for multiple underscores. Unless of course you think that it's a violation of naming rules for UNIX or something like that.

Thank you for the feedback!

User 297562 avatar

Sven Arild Helleland replied ago:

0 votes Vote down Vote up Reply

Hello John,

I dont quite agree with your analogy, as by showing a non optimal approach there will always be someone that will start using the approach. While if you show the optimal, they would have started using that one instead even if they could not grasp completly how it worked.

For your questions:
str_replace is faster than preg_replace off the bat for a simple text replace, but when you need more complex replaces or can combine several ones by using a regex, preg is faster.

You can also pass along arrays with find/replace arguments to str_replace, doing so will be slightly faster than calling multiple str_replaces. Not to mention it will make the code easier to read.

Note. It is the preg_match_all and the foreach loop you use that makes your code take around twice as long, not the str_replaces in itself.

For the periods and multiple underscores, no they do not violate any rules. But imo it looks better if they are stripped since they are not needed :)

Btw, if your using the filename for a possible seo gain you want to use dashes instead of underscores. How much better it is I dont know, but basically every expert in the field says it is better with dashes.
Example: kid-gets-hit-with-baseball-and-cries.flv

Here is the updated regex you would use in that case:
$find = array('|[_]{1,}|', '|[ ]{1,}|', |[^0-9a-z\-.]|i', '|[.]{2,}|', '|[-]{2,}|');
$replace = array('-', '-', '', '.', '-');

Add your comment


Html tags not supported. Reply is editable for 5 minutes. Use [code lang="java|ruby|sql|css|xml"][/code] to post code snippets.

Voters For This Link (6)



Voters Against This Link (3)