Remove All Punctuation From String, Except If It's Between Digits
Solution 1:
You can use regex lookarounds like this:
(?<!\d)[.,;:](?!\d)
The idea is to have a character class gathering the punctuation you want to replace and use lookarounds to match punctuation that does not have digits around
regex = r"(?<!\d)[.,;:](?!\d)"
test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"
result = re.sub(regex, "", test_str, 0)
Result is:
This is a 1example of the text But it only is2.5 percent of all data
Solution 2:
Okay folks, here is an answer (the best ? I don't know but it seems to work) :
item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace(" "," ")
print item
Solution 3:
I am out of touch with Python, but have some insight into the regexps.
My I suggest the usage of or?
I would use this regexp: "(\d+)([a-zA-Z])|([a-zA-Z])(\d+)"
, and then as the replacement string use: "\1 \2"
If some corner cases plague you, you can pass the back-reference to a procedure, and then deal 1-by-1, probably by checking if your "\1\2" can translate to float. TCL has such built-in functionality, Python should too.
Solution 4:
I tried this and it worked very well.
a = "This is a 1example of the text. But, it only is 2.5 percent of all data"
a.replace(". ", " ").replace(", "," ")
Notice that, in replace function there is space after punctuation. I just replaced punctuation and space with only space.
Solution 5:
Code:
from itertools import groupby
s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace(" ", " ").replace(" ", " ")
#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")
s5 = s4.replace(" ", " ")
print(s5)
Output:
'Thisis a 1 example of the text But it only is2.5 percent of all data'
P.s.: You can take a look at Punctuation Marks and keep adding them inside the .replace()
function.
Post a Comment for "Remove All Punctuation From String, Except If It's Between Digits"