Skip to content Skip to sidebar Skip to footer

Extract Numbers, Letters, Or Punctuation From Left Side Of String Column In Python

Say I have the following data frame which comes from OCR has company_info column contains numbers, letters, or punctuation and Chinese characters: import pandas as pd data = '''\

Solution 1:

Use Series.str.extract with DataFrame.pop for extract column:

pat = r'([\x00-\x7F]+)([\u4e00-\u9fff]+.*$)'
df[['office_name','company_info']] = df.pop('company_info').str.extract(pat)
print (df)
   id   office_name         company_info
01         05B01  北京企商联登记注册代理事务所(通合伙)
12    Unit-D 608     华夏启商(北京企业管理有限公司)
231004-1005       北京中睿智诚商业管理有限公司
3417/F(1706)        北京美泰德商务咨询有限公司
45   A2006~A2007        北京新曙光会计服务有限公司
562906-10          中国建筑与室内设计师网

Solution 2:

You can use this

^(\d+),\s+([^\u4e00-\u9fff]+).*$
  • ^ - Start of string
  • (\d+) - Matches one or more digits
  • ,\s+ - Matches , followed by one or more space character
  • ([^\u4e00-\u9fff]+) - Match anything except chinese character
  • .+ - Match anything except new line one or more time
  • $ - End of string

Demo

Post a Comment for "Extract Numbers, Letters, Or Punctuation From Left Side Of String Column In Python"