PHP Remove Non ASCII Characters from a String
I hate the stupid odd characters that sometimes seem to magically appear from nowhere in text. Some call the odd characters, some call them special characters, I call them non ASCII characters but whatever you call them they are inconvenient. Often they come along with text pasted into a content management system from Microsoft Word. Today I had a large galley of copy that was filled with these that I needed to clean the invalid characters from. Instead of going through and removing each instance I simple wrote a regular expression that takes the text string and replaces them all with nothing. It works great, the only problem is that in some instances these may represent real characters you need like ‘ or “ so you have to be careful using a function like this. If you don’t then give it a try:
<?php $output = "Clean this copy of invalid non ASCII äócharacters."; $output = preg_replace('/[^(\x20-\x7F)]*/','', $output); echo($output); ?>
No related posts.
If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.
Comments
Wow. This helped me tremendously! I was having problems with people copying/pasting “non-ascii” text, mostly from MS Word into content editor database. This simple function fixed it! Thanks!
I actually figured it out right after I posted the above comment.
$output = preg_replace('/[^(\x20-\x7F)\x0A]*/','', $output);
I just added \x0A This works on unix/linux. For Windows or MAC, you may have to use \x0D or a combination of both.
You can find an ascii table here:
http://www.asciitable.com/
Thanks a lot, that saved me a lot of time… I was having a problem with a query, i didn’t realize that the data had ASCII invalid chars… so, here your regex got inside and made me happy with a “No Error” result.
thanks
Supposing your want to be sure that $string is composed by valid UTF-8 characters:
$string = iconv(“UTF-8″,”UTF-8//IGNORE”,$string);
If you want to remove non ISO-8859-1 characters from this string:
$string = iconv(“UTF-8″,”ISO-8859-1//IGNORE”,$string);
$string = iconv(“ISO-8859-1″,”UTF-8″,$string);
If you still have characters you want to be ignored, try this:
$string = recode_string(“us..flat”, $string);
I must agree this was such a simple and good solution. I had hit a mental road block and this helped me alot! Thanks!
Good stuff! Thanks! For a long time I’ve been attempting to remove specific garbage chars from text, but this is a much better solution, and easy too.
I can’t even articulate how thrilled I am that Google kicked you to the top of the results because you just fixed in seconds what I’ve been struggling with for hours.
Just fantastic!
Thanks man, this little code snippet helped me tremendously while cleaning up feeds from Google News. Nice work!
@Andrew. Try this site: http://www.ascii.cl/ – much better, not images where you can’t copy/paste like that other one!
THANK YOU !!!!! I had troubles with “non-readable” characters…Thanks to you, my problem is over !
See you !
can someone help me with url cleaning. I just want to use characters A to Z, a to z and -. I want to clean up the rest. I have absolutely no idea how regular expression works.
Thanks for posting! Saved me some hassle with a database import script I was writing where the non-ASCII characters showed up.
The method exposed at article remove accented vowels (à) or with “umlauts” (ä)… so it don’t help! and it remove also possible break lines (\n).
This is my solution which run for me perfect (taking the suggestion above for breaklines):
function f_remove_odd_characters($string){
$string = str_replace(“\n”,”[NEWLINE]“,$string);
$string=htmlentities($string);
$string=preg_replace(‘/[^(\x20-\x7F)]*/’,”,$string);
$string=html_entity_decode($string);
$string = str_replace(“[NEWLINE]“,”\n”,$string);
return $string;
}
Thanks to all you! ![]()
SERGI
Dude, awesome! have been working so hard on something like this and was using way more lines of code.
crazy sick!
thx, Shawn
I’ve come to terms with the fact that I will never understand regular expressions. Thankfully the internet has people like you to do it for me.
Thanks,
The method exposed at article remove accented vowels (à) or with “umlauts” (ä)… so it don’t help! and it remove also possible break lines (\n).
for that use this
$string = preg_replace(‘/[^(\x20-\xFE)]*/’,”, $string);
You could also use the php filter options introduced in php 5.2
http://uk3.php.net/manual/en/book.filter.php
Your example would become the following and also have tags automatically striped.
$output = filter_var($output, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH);
Hey dude!
Love the way you name it “non ASCII characters” … that’s straight !
The line of code save me a lot of work =]
Thanks
() are not needed there. It works because ( and ) are x28 and x29, so they are between x20 and x7F.
For example, If you would need to keep only numbers 0123456789 then
echo preg_replace(‘/[^(\x30-\x39)]*/’,”, ‘()123456789′);
would give you ()0123456789
echo preg_replace(‘/[^\x30-\x39]*/’,”, ‘()123456789′);
would give you 0123456789
I needed do keep only a-z,A-Z,0-9 so I wrote
preg_replace(‘/[^a-zA-Z0-9]*/’,”, $output);
It is much better to read.
I was just thinking what you wrote about those annoying characters, and you have a webpage about it. awesome.
you people be crazy, you don’t need the * either as the [] alone just selects all instances.
$output = preg_replace(‘/[^\x20-\x7F]/’,”, $output);
Your regular expression is horrible. There is no need for multiple statements to handle newlines, nor any capturing brackets.
Here’s the proper way to do it. You’re free to update your original post with this fixed version:
preg_replace( “/[^\x0A\x0D\x20-\x7E]/”, “”, $string );
It preserves Linefeed \n (0x0A), Carriage Return \r (0x0D), and all printable ASCII characters (0×20 to 0x7E).
You can verify the validity of what I am saying at http://www.asciitable.com/


Works really well! Simple but effective. Thanks for the tip!