Skip to content Skip to sidebar Skip to footer

Split Text In Text File On The Basis Of Comma And Space (python)

I need to parse text of text file into two categories: University Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad) but the text file contain following text: 'Imperial Co

Solution 1:

It would appear that you can only be certain that a line has a location if there is a comma. So it would make sense to parse the file in two passes. The first pass can build a set holding all known locations. You can start this off with some known examples or problem cases.

Pass two could then also use the comma to match known locations but if there is no comma, the line is split into a set of words. The intersection of these with the location set should give you the location. If there is no intersection then it is flagged as "unknown".

locations = set(["London", "Faisalabad"])

withopen(path, 'r') as f_input:
    unknown = 0# Pass 1, build a set of locationsfor line in f_input:
        line = line.strip(' ,"\n')
        if','in line:
            loc = line.rsplit("," ,1)[1].strip()
            locations.add(loc)

    # Pass 2, try and find location in line
    f_input.seek(0)

    for line in f_input:
        line = line.strip(' "\n')
        if','in line:
            uni, loc = line.rsplit("," ,1)
            loc = loc.strip()
        else:
            uni = line
            loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)

            if loc_matches:
                loc = list(loc_matches)[0]
            else:
                loc = "<unknown location>"
                unknown += 1

        uni = uni.strip()

        print"uni:", uni
        print"Loc:", loc

    print"Unknown locations:", unknown

Output would be:

uni:ImperialCollegeofBusinessStudiesLoc:Lahoreuni:GovernmentCollegeUniversityFaisalabadLoc:Faisalabaduni:ImperialCollegeofBusinessStudiesLahoreLoc:Lahoreuni:UniversityofPeshawarLoc:Peshawaruni:UniversityofSindhLoc:Jamshorouni:LondonSchoolofEconomicsLoc:Londonuni:LahoreSchoolofEconomicsLoc:LahoreUnknown locations:0

Solution 2:

Your input file does not have commas on every line, causing the code to fail. You could do something like

if','in line:
    loc = rep.split(',')[1].strip()
else:
    loc = rep.split()[-1].strip()

to handle the lines without comma differently, or simply reformat the input.

Solution 3:

You can split using a comma, the result is always a list, you can check its size, if it is more than one, then you had already at least one comma, otherwise (if the size is one) you didn't have any comma

>>>word = "somethign without a comma">>>afterSplit = word.split(',')>>>afterSplit
['somethign without a comma']
>>>word2 = "something with, just one comma">>>afterSplit2 = word2.split(',')>>>afterSplit2
['something with', ' just one comma']

Solution 4:

I hope this will work, but I couldn't get 'London' though. May be the data should be constant.

f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
    p = p.replace('"', '')
    if',' in p:
        city = p.split(',')[-1].strip()
    else:
        city = p.split(' ')[-1].strip()
    if city not in places and city not in stop_words:
            places.append(city)
print places

o/p [' Lahore', ' Faisalabad', 'Lahore', 'Peshawar', ' Jamshoro']

Post a Comment for "Split Text In Text File On The Basis Of Comma And Space (python)"