Monday, July 6, 2020

Parsing city of origin and destination city from a string

TL;DR

Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components.

In Long

From first look, it seems like you're asking to solve a natural language problem magically. But lets break it down and scope it to a point where something is buildable.

First, to identify countries and cities, you need data that enumerates them, so lets try: https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

And top of the search results, we find https://datahub.io/core/world-cities that leads to the world-cities.json file. Now we load them into sets of countries and cities.

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])

Now given data, lets try to build component ONE:

  • Task: Detect if any substring in the texts matches a city/country.
  • Tool: https://github.com/vi3k6i5/flashtext (a fast string search/match)
  • Metric: No. of correctly identified cities/countries in string

Lets put them together.

import requests
import json
from flashtext import KeywordProcessor

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])


keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))


texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])

[out]:

['York', 'Venice', 'Italy']

Hey, what went wrong?!

Doing due diligence, first hunch is that "new york" is not in the data,

>>> "New York" in cities
False

What the?! #$%^&* For sanity sake, we check these:

>>> len(countries)
244
>>> len(cities)
21940

Yes, you cannot just trust a single data source, so lets try to fetch all data sources.

From https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json, you find another link https://github.com/dr5hn/countries-states-cities-database Lets munge this...

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])

dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"

cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))

countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])

countries = countries2.union(countries1)
cities = cities2.union(cities1)

And now that we are neurotic, we do sanity checks.

>>> len(countries)
282
>>> len(cities)
127793

Wow, that's a lot more cities than previously.

Lets try the flashtext code again.

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']

keyword_processor.extract_keywords(texts[0])

[out]:

['York', 'Venice', 'Italy']

Seriously?! There is no New York?! $%^&*

Okay, for more sanity checks, lets just look for "york" in the list of cities.

>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
 'West York',
 'West New York',
 'Yorktown Heights',
 'East Riding of Yorkshire',
 'Yorke Peninsula',
 'Yorke Hill',
 'Yorktown',
 'Jefferson Valley-Yorktown',
 'New York Mills',
 'City of York',
 'Yorkville',
 'Yorkton',
 'New York County',
 'East York',
 'East New York',
 'York Castle',
 'York County',
 'Yorketown',
 'New York City',
 'York Beach',
 'Yorkshire',
 'North Yorkshire',
 'Yorkeys Knob',
 'York',
 'York Town',
 'York Harbor',
 'North York']

Eureka! It's because it's call "New York City" and not "New York"!

You: What kind of prank is this?!

Linguist: Welcome to the world of natural language processing, where natural language is a social construct subjective to communal and idiolectal variant.

You: Cut the crap, tell me how to solve this.

NLP Practitioner (A real one that works on noisy user-generate texts): You just have to add to the list. But before that, check your metric given the list you already have.

For every texts in your sample "test set", you should provide some truth labels to make sure you can "measure your metric".

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

Actually, it doesn't look that bad. We get an accuracy of 90%:

>>> true_positives / total_truth
0.9

But I %^&*(-ing want 100% extraction!!

Alright, alright, so look at the "only" error that the above approach is making, it's simply that "New York" isn't in the list of cities.

You: Why don't we just add "New York" to the list of cities, i.e.

keyword_processor.add_keyword('New York')

print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))

[out]:

['New York', 'Venice', 'Italy']

You: See, I did it!!! Now I deserve a beer. Linguist: How about 'I live in Marawi'?

>>> keyword_processor.extract_keywords('I live in Marawi')
[]

NLP Practitioner (chiming in): How about 'I live in Jeju'?

>>> keyword_processor.extract_keywords('I live in Jeju')
[]

A Raymond Hettinger fan (from farway): "There must be a better way!"

Yes, there is what if we just try something silly like adding keywords of cities that ends with "City" into our keyword_processor?

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])
            print(c[:-5])

It works!

Now lets retry our regression test examples:

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')), 
('I live in Marawi', ('Marawi')), 
('I live in jeju', ('Jeju'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

[out]:

new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')

return flights from brussels to bangkok with etihad from €407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')

from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')

fly to australia new zealand from paris from €422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')

I live in Florida
['Florida']
Florida

I live in Marawi
['Marawi']
Marawi

I live in jeju
['Jeju']
Jeju

100% Yeah, NLP-bunga !!!

But seriously, this is only the tip of the problem. What happens if you have a sentence like this:

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']

WHY is Adam extracted as a city?!

Then you do some more neurotic checks:

>>> 'Adam' in cities
Adam

Congratulations, you've jumped into another NLP rabbit hole of polysemy where the same word has different meaning, in this case, Adam most probably refer to a person in the sentence but it is also coincidentally the name of a city (according to the data you've pulled from).

I see what you did there... Even if we ignore this polysemy nonsense, you are still not giving me the desired output:

[in]:

['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]

[out]:

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

Linguist: Even with the assumption that the preposition (e.g. from, to) preceding the city gives you the "origin" / "destination" tag, how are you going to handle the case of "multi-leg" flights, e.g.

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')

What's the desired output of this sentence:

> Adam flew to Bangkok from Singapore and then to China

Perhaps like this? What is the specification? How (un-)structured is your input text?

> Origin: Singapore
> Departure: Bangkok
> Departure: China

Lets try to build component TWO to detect prepositions.

Lets take that assumption you have and try some hacks to the same flashtext methods.

What if we add to and from to the list?

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)
    print(extracted)
    print()

[out]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from €407
['from', 'Brussels', 'to', 'Bangkok', 'from']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from €422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']

Heh, that's pretty crappy rule to use to/from,

  1. What if the "from" is referring the price of the ticket?
  2. What if there's no "to/from" preceding the country/city?

Okay, lets work with the above output and see what we do about the problem 1. Maybe check if the term after the from is city, if not, remove the to/from?

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)

    new_extracted = []
    extracted_next = extracted[1:]
    for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
        if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
            print(e_i, e_iplus1)
            continue
        elif e_i == 'from' and e_iplus1 == None: # last word in the list.
            continue
        else:
            new_extracted.append(e_i)

    print(new_extracted)
    print()

That seems to do the trick and remove the from that doesn't precede a city/country.

[out]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from €407
from None
['from', 'Brussels', 'to', 'Bangkok']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from €422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']

But the "from New York" still isn't solve!!

Linguist: Think carefully, should ambiguity be resolved by making an informed decision to make ambiguous phrase obvious? If so, what is the "information" in the informed decision? Should it follow a certain template first to detect the information before filling in the ambiguity?

You: I'm losing my patience with you... You're bringing me in circles and circles, where's that AI that can understand human language that I keep hearing from the news and Google and Facebook and all?!

You: What you gave me are rule based and where's the AI in all these?

NLP Practitioner: Didn't you wanted 100%? Writing "business logics" or rule-based systems would be the only way to really achieve that "100%" given a specific data set without any preset data set that one can use for "training an AI".

You: What do you mean by training an AI? Why can't I just use Google or Facebook or Amazon or Microsoft or even IBM's AI?

NLP Practitioner: Let me introduce you to

Welcome to the world of Computational Linguistics and NLP!

In Short

Yes, there's no real ready-made magical solution and if you want to use an "AI" or machine learning algorithm, most probably you would need a lot more training data like the texts_labels pairs shown in the above example.



from Hacker News https://ift.tt/2O8lw5H

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.