PHP Remove Non ASCII Characters from a String

kennypooI hate the stupid odd characters that sometimes seem to magically appear from nowhere in text. Some call the odd characters, some call them special characters, I call them non ASCII characters but whatever you call them they are inconvenient. Often they come along with text pasted into a content management system from Microsoft Word. Today I had a large galley of copy that was filled with these that I needed to clean the invalid characters from. Instead of going through and removing each instance I simple wrote a regular expression that takes the text string and replaces them all with nothing. It works great, the only problem is that in some instances these may represent real characters you need like ‘ or “ so you have to be careful using a function like this. If you don’t then give it a try:

<?php
    $output = "Clean this copy of invalid non ASCII äócharacters.";
    $output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
    echo($output);
?>

No related posts.

If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.

Comments

Works really well! Simple but effective. Thanks for the tip!

Wow. This helped me tremendously! I was having problems with people copying/pasting “non-ascii” text, mostly from MS Word into content editor database. This simple function fixed it! Thanks!

Any easy way to modify this to not remove newline characters?

An easy way to do it would be to replace the new line characters first before running the regular expression. For example:

<?php
    $output = "Clean this copy of invalid \n non ASCII äócharacters.";
    $output = str_replace("\n","[NEWLINE]",$output);
    $output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
    $output = str_replace("[NEWLINE]","\n",$output);
    echo($output);
?>

I actually figured it out right after I posted the above comment.

$output = preg_replace('/[^(\x20-\x7F)\x0A]*/','', $output);

I just added \x0A This works on unix/linux. For Windows or MAC, you may have to use \x0D or a combination of both.

Hello.
Plz, tell me, where I can get character codes, such as x0A?

You can find an ascii table here:
http://www.asciitable.com/

Thanks a lot, that saved me a lot of time… I was having a problem with a query, i didn’t realize that the data had ASCII invalid chars… so, here your regex got inside and made me happy with a “No Error” result.

thanks

Supposing your want to be sure that $string is composed by valid UTF-8 characters:

$string = iconv(“UTF-8″,”UTF-8//IGNORE”,$string);

If you want to remove non ISO-8859-1 characters from this string:

$string = iconv(“UTF-8″,”ISO-8859-1//IGNORE”,$string);
$string = iconv(“ISO-8859-1″,”UTF-8″,$string);

If you still have characters you want to be ignored, try this:

$string = recode_string(“us..flat”, $string);

This is really awesome. It helped me a lot. Thank you very much.

I must agree this was such a simple and good solution. I had hit a mental road block and this helped me alot! Thanks!

Thanks, huge time saver.

Thanks a lot man, this save me a lot of time…

Good stuff! Thanks! For a long time I’ve been attempting to remove specific garbage chars from text, but this is a much better solution, and easy too.

Hi
Very nice article.
Your code is a life saver for me.
Thanks a lot.

Thanks again
Avi

I can’t even articulate how thrilled I am that Google kicked you to the top of the results because you just fixed in seconds what I’ve been struggling with for hours.

Just fantastic!

Thanks!!

Thanks man, this little code snippet helped me tremendously while cleaning up feeds from Google News. Nice work!

@Andrew. Try this site: http://www.ascii.cl/ – much better, not images where you can’t copy/paste like that other one! :)

Thanks for the code. Invalid HTML chars have been causing XML issues on my end, so this will help.

THANK YOU !!!!! I had troubles with “non-readable” characters…Thanks to you, my problem is over !
See you !

Thank you!

The simplest, most effective way I’ve found of doing this – thank you.

can someone help me with url cleaning. I just want to use characters A to Z, a to z and -. I want to clean up the rest. I have absolutely no idea how regular expression works.

Avoid all these problem by switching to UTF-8

Hi,

how to replace with ascii instead remove it?
like wordpress did ..

regards

Thanks for posting! Saved me some hassle with a database import script I was writing where the non-ASCII characters showed up.

Thank you!. Great solution. Muchas Gracias

The method exposed at article remove accented vowels (à) or with “umlauts” (ä)… so it don’t help! and it remove also possible break lines (\n).

This is my solution which run for me perfect (taking the suggestion above for breaklines):

function f_remove_odd_characters($string){

$string = str_replace(“\n”,”[NEWLINE]“,$string);
$string=htmlentities($string);
$string=preg_replace(‘/[^(\x20-\x7F)]*/’,”,$string);
$string=html_entity_decode($string);
$string = str_replace(“[NEWLINE]“,”\n”,$string);
return $string;

}

Thanks to all you! ;)
SERGI

Thank you very much.

This was a magic script

Dude, awesome! have been working so hard on something like this and was using way more lines of code.

crazy sick!

thx, Shawn

Thanks u so much ….

U saved my days…

I’ve come to terms with the fact that I will never understand regular expressions. Thankfully the internet has people like you to do it for me.

Thanks,

The method exposed at article remove accented vowels (à) or with “umlauts” (ä)… so it don’t help! and it remove also possible break lines (\n).

for that use this

$string = preg_replace(‘/[^(\x20-\xFE)]*/’,”, $string);

[...] http://www.stemkoski.com/php-remove-non-ascii-characters-from-a-string/ PHP Snippet ← Makna Doa Antara 2 Sujud [...]

You could also use the php filter options introduced in php 5.2

http://uk3.php.net/manual/en/book.filter.php

Your example would become the following and also have tags automatically striped.

$output = filter_var($output, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH);

Thanks for such a magic script, really simple but effective.Bookmarked :P cheers!!!

Hey dude!

Love the way you name it “non ASCII characters” … that’s straight !

The line of code save me a lot of work =]

Thanks

thanks, its usefull one :)

() are not needed there. It works because ( and ) are x28 and x29, so they are between x20 and x7F.

For example, If you would need to keep only numbers 0123456789 then

echo preg_replace(‘/[^(\x30-\x39)]*/’,”, ‘()123456789′);

would give you ()0123456789

echo preg_replace(‘/[^\x30-\x39]*/’,”, ‘()123456789′);

would give you 0123456789

I needed do keep only a-z,A-Z,0-9 so I wrote

preg_replace(‘/[^a-zA-Z0-9]*/’,”, $output);

It is much better to read.

Very useful for my php work.. thanks

I was just thinking what you wrote about those annoying characters, and you have a webpage about it. awesome.

you people be crazy, you don’t need the * either as the [] alone just selects all instances.

$output = preg_replace(‘/[^\x20-\x7F]/’,”, $output);

Thanks, it works well.
These kept popping up in my xml.

Your regular expression is horrible. There is no need for multiple statements to handle newlines, nor any capturing brackets.

Here’s the proper way to do it. You’re free to update your original post with this fixed version:
preg_replace( “/[^\x0A\x0D\x20-\x7E]/”, “”, $string );

It preserves Linefeed \n (0x0A), Carriage Return \r (0x0D), and all printable ASCII characters (0×20 to 0x7E).

You can verify the validity of what I am saying at http://www.asciitable.com/

Leave a comment

(required)

(required)