• Review
  • Image Data
  • Tabular Data
  • Text Data
  • Assignments
    • Haiteng Engineering: Quality Control System Analysis
    • AirBnBarcelona
  1. Text Data
  2. How Computers “Read”
  • Review
    • Introduction to Python
    • NumPy: Numerical Python
  • Image Data
    • How Computers See
    • Computer Vision
  • Tabular Data
    • Pandas: Better than Excel
    • Investment Drivers
    • From Data to Model: AirBnB
    • Time Series Data in Pandas
  • Text Data
    • How Computers “Read”
    • 30-Text-Data/060-Embeddings.ipynb

Navigation

On this page

  • Strings
  • Characters as numbers
  • Strings as lists
  • Python string methods
  • Pandas string methods
    • Application: Google Job Post Analytics
    • The data set
    • Importing the data
    • Exploring the company & Job titles
    • Exploring the location
    • Analysis of the preferred qualifications
  1. Text Data
  2. How Computers “Read”

How Computers “Read”

Strings

Text data always starts out in the form of a string. A string is a sequence (list) of characters. This includes the (English) alphanumeric characters and also special characters like white space, punctuation, etc. Other symbols, like emoticons, can also appear in your data, specially in social networks data. Besides that, you can also find letters from other languages (Spanish, Portuguese, etc) or alphabets (Cyrillic, hiragana, etc), and even ideographs (such as Han characters).

Characters as numbers

There is a basic set of 127 characters, called the ASCII characters, which are encoded in the same way by all the computers, so you will never have trouble with them. They include the English letters (without accents), the numbers, basic punctuation (not curly quote marks or long dashes), white space, control characters such as the new line, represented in programming languages as , and other symbols familiar to you, such as the dollar ($) and the hash (#) symbols. The complete list can be easily found in Internet.

Non-ASCII characters can be encoded by different computers or different text editors in different ways. Mind that, if you capture string data on your own, you will probably find some of these characters in your data. Even when the documents are expected to be in English, they can be contaminated by other languages: Han characters, German dieresis, Spanish eñe, etc.

The preferred encoding is UTF-8 (utf-8), which is the default encoding in Macintosh computers. Reading and writing text files in Pandas, the argument encoding allows you to manage both UTF-8 and the alternative encoding Latin-1 (latin1). Windows computers use their own system, which is region specific. In US and Western Europe, this is Windows-1252, which is very close to Latin-1, though not exactly the same.

the_letter_A = chr(65)  # this
the_letter_A = 'A'      # is the same as this

the_letter_A 
'A'
for i in range(97,97+26):
    print(i, chr(i))
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
# This one is in Farsi, it may not render on your computer
print(chr(0x0604))
؄

Strings as lists

Both lists and strings are sequences, so they share some core methods. A string can be regarded as a list of characters (letters, numbers, punctuation, etc). For instance, the function len gives you the number of characters of a string:

iese = 'IESE Business School'
iese
'IESE Business School'
len(iese)
20
iese[:4]
'IESE'

You try it

Write your name as a string, convert it to a list of numbers using ord() (this is the inverse of chr).

...
Ellipsis
Solution
myname = "Enric Junque de Fortuny"
[ ord(c) for c in myname]
[69,
 110,
 114,
 105,
 99,
 32,
 74,
 117,
 110,
 113,
 117,
 101,
 32,
 100,
 101,
 32,
 70,
 111,
 114,
 116,
 117,
 110,
 121]

Python string methods

Besides the methods shared with lists, Python has a collection of methods for manipulating strings:

  • lower(): makes the string into all lowercase, this is useful when you’re comparing strings and want to make the comparison case insensitive.
  • upper(): same, but makes it uppercase
  • replace(): replace a substring with something else, just like the usual “Find and Replace” method of text editors
  • split(): splits a string into different segments (creates a list of segment strings). The split can be based on any separator. If no separator is specified, any whitespace string (containing only white space, line breaks or tabs) is a separator.
  • count(): counts the number of occurrences of a pattern within a string:
name = 'Enric Junque de Fortuny'

print(name.upper(),name.lower())
ENRIC JUNQUE DE FORTUNY enric junque de fortuny
print("Number of e's:",name.count('e'),"---",name.lower().count('e'))
Number of e's: 2 --- 3

Pandas string methods

When a Pandas column contains string, we can access its string methods. Pandas string methods are vectorized versions of the above methods. They return a series of the same length, in which each term results from applying a string function to the corresponding term of s. The syntax of these functions is typically s.str.fname(args), where s is a Pandas series of strings (or a list of strings) and args is a set of ad hoc arguments.

Application: Google Job Post Analytics

As a recent graduate, Niyamat Ullah is eager to know which language, skills, and experience would be beneficial to add to his toolbox in order to increase his chances of getting a job at Google. Fortunately, Google provides a comprehensive list of job opportunities on their official careers website at careers.google.com. To gather valuable insights, Niyamat employed Selenium, a tool capable of capturing (crawling) the source code from each job page, enabling the extraction of essential information such as job titles, locations, responsibilities, as well as minimum and preferred qualifications.

The data set

The file google.csv contains for 1,250 jobs. The variables are:

  • company: either Google or Youtube.

  • title: the title of the job.

  • category: the category of the job.

  • location: the location of the job.

  • responsibilities: the responsibilities for the job.

  • minqual: the minimum qualifications for the job.

  • prefqual: the preferred qualifications for the job.

Source: Kaggle.

Importing the data

First download the dataset from here - I’ve already cleaned it up for you (no NA’s and no duplicates). We then load the CSV file that contains the data. The encoding is typically specified for files that contain text, to prevent problems with systems that do not use UTF-8 (Windows). This is probably not needed here, but it may save trouble in other cases.

import pandas as pd
df = pd.read_csv('./resources/google_clean.csv', encoding='utf_8')

Exploring the company & Job titles

The first column has the company. The analysis is very simple, since it is either Google or YouTube (a few cases). The method value_counts does the job.

df['company'].value_counts()
company
Google     1092
YouTube      20
Name: count, dtype: int64
df['title'].value_counts().head(5)
title
Business Intern 2018                        33
Field Sales Representative, Google Cloud    17
Interaction Designer                        12
MBA Intern, Summer 2018                     10
User Experience Researcher                   9
Name: count, dtype: int64

The interns seem to dominate the picture, but, with 794 different titles, we are missing out on quite a few here. So, let’s check a couple of other possibilities.

You try it

Remember that the method str.contains returns a Boolean mask indicating whether a given pattern is contained in every term of the series. Use it to find the number of jobs which have the following keywords in the title: ‘Intern’, ‘Sales’, ‘Cloud’, ‘Google Cloud’.

...
Ellipsis
Solution
f_intern      = df['title'].str.lower().str.contains('intern')
f_sales       = df['title'].str.lower().str.contains('sales')
f_cloud       = df['title'].str.lower().str.contains('cloud')
f_googlecloud = df['title'].str.lower().str.contains('google cloud')

# The individual breakdown:
print(f_intern.sum(), f_sales.sum(), f_cloud.sum(), f_googlecloud.sum())

# What's the difference between this approach:
print(f_intern.sum() + f_sales.sum() + f_cloud.sum() + f_googlecloud.sum())


# And this one?
all = (f_intern | f_sales | f_cloud | f_googlecloud)
print(all.sum())

# Note that in all of these: internet is counted as an intern!
# To avoid this issue use a \b (word boundary) at the end:
f_intern      = df['title'].str.lower().str.contains("intern\\b")
all = (f_intern | f_sales | f_cloud | f_googlecloud)
print(all.sum())
108 135 266 248
757
430
416

Exploring the location

df['location'].head()
0                           Singapore
1                     Shanghai, China
2         New York, NY, United States
3    Mountain View, CA, United States
4        Sunnyvale, CA, United States
Name: location, dtype: object

It is typical of these data sets that the location comes ‘city, state, country’ for US and Canada (sometimes also for India and others), but as ‘city, country’ for other countries. The country can be splitting the string into comma separated words and then taking the last one.

splitted = df['location'].str.split(', ')

splitted.head()
0                           [Singapore]
1                     [Shanghai, China]
2         [New York, NY, United States]
3    [Mountain View, CA, United States]
4        [Sunnyvale, CA, United States]
Name: location, dtype: object

Pandas .str accessor can also operate on a Series of lists of strings, though with a limited set of operations. Supported operations include: - .len() to get the length of each list - .join(delimiter) to combine elements of each list into a single string - [index] to access specific elements within each list

splitted.str.len().head()
0    1
1    2
2    3
3    3
4    3
Name: location, dtype: int64
splitted.str[-1].head()
0        Singapore
1            China
2    United States
3    United States
4    United States
Name: location, dtype: object

You try it

Extract the country from the location field, then count the number of job openings in each country.

Bonus: plot the top10

...
Ellipsis
Simple Solution
country = df['location'].str.split(', ').str[-1]
country.value_counts()
country.value_counts()[:15][::-1].plot.barh();

On a map (this requires that you download countries.geojson first) and then install geopandas (!pip install geopandas):

import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

# We want to plot the country job add counts, we'll take the log 
# of that to have a nicer result. The third line renames the U.S.
# so that it is recognized by the map commands.
country_counts = country.value_counts().rename('job_count').rename_axis('country').reset_index()
country_counts["log_job_count"] = np.log(country_counts.job_count)
country_counts.loc[country_counts.country == 'United States','country'] = 'United States of America'

# Here, we load the boundaries of countries to plot and merge the
# count data with the geographical boundary data.
world    = gpd.read_file("./resources/countries.geojson")
merged   = world.merge(country_counts, left_on='ADMIN', right_on='country', how='left').fillna(0)

# Now, we plot the countries and fill in the countries with a color
# proportional to the (log) count
fig, ax  = plt.subplots(figsize=(12, 8),dpi=200,facecolor='black')
merged.plot(column='log_job_count', cmap='Blues', linewidth=0.8, ax=ax, edgecolor='grey', legend=False)
ax.set_facecolor('black')
plt.tight_layout()
plt.show()

Analysis of the preferred qualifications

What do they expect a candidate to know? What qualifications are they looking for? The preferred qualifications are in the column ‘prefqual’, let’s take a look:

# explore
df.prefqual[0]
'Experience in the business technology market as a program manager in SaaS cloud computing and or emerging technologies Significant cross functional experience across engineering sales and marketing teams in cloud computing or related technical fields Proven successful program outcomes from idea to launch in multiple contexts throughout your career Ability to manage the expectations demands and priorities of multiple internal stakeholders based on overarching vision and success for global team health Ability to work under pressure and possess flexibility with changing needs and direction in a rapidly growing organization Strong organization and communication skills'

Let’s create a word-cloud of all the terms mentioned to get a better view. The steps are:

  1. Convert to lower-case so Experience and experience are treated the same.
  2. Split all strings into its invidiual words using .split()
  3. Bunch them together in one big super-list.
  4. Analyze / plot
# Step 1 and 2
import numpy as np

experience_splitted = df.prefqual.str.lower().str.split(' ')
all_words = experience_splitted.sum()

np.array(all_words)[[len(w) < 2 for w  in all_words]]
experience_splitted[:5]
0    [experience, in, the, business, technology, ma...
1    [bsee, bsme, or, bsie, degree, experience, of,...
2    [experience, partnering, or, consulting, cross...
3    [experience, as, a, software, developer, archi...
4    [cts, certification, experience, in, the, cons...
Name: prefqual, dtype: object
# Step 3
# experience_splitted[0] + experience_splitted[1] + ...

all_words = experience_splitted.sum()
all_words[:5]
['experience', 'in', 'the', 'business', 'technology']

The top words are …

pd.Series(all_words).value_counts().head()
and           5800
to            2676
in            2214
with          2202
experience    2168
Name: count, dtype: int64

As could be expected, we find on top terms that do not convey information, which is typical. There terms are called stopwords. Lists of stopwords for many languages can be found in online sources. Let us load the one provided for this session called stopwords.csv (download it from here). There is only one word in every row.

# stopwords
stopwords = pd.read_csv("./resources/stopwords.csv")

As it’s just a list of words we might as well convert it to this representation now:

stopwords = stopwords.iloc[:,0].values.tolist()
stopwords[:5]
['a', "a's", 'able', 'about', 'above']

Next, we want to drop all the stopwords from the list all_words:

terms_wo_stopwords = [t for t in all_words if t not in stopwords]
terms_wo_stopwords[:5]
['experience', 'business', 'technology', 'market', 'program']

You try it

What are the 10 most frequent terms used after filtering out stopwords?

...
Ellipsis
Solution
pd.Series(terms_wo_stopwords).value_counts().head(10)
experience       2168
ability          1618
skills           1224
management        645
demonstrated      542
excellent         505
work              504
business          501
communication     501
strong            453
Name: count, dtype: int64

Let’s try to plot this as a wordcloud. The beauty and power of python is that - regardless of your objective - you can always find some package that will get close to what you want:

! pip3 install wordcloud
Requirement already satisfied: wordcloud in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (1.9.4)
Requirement already satisfied: numpy>=1.6.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (10.3.0)
Requirement already satisfied: matplotlib in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (23.2)
Requirement already satisfied: pyparsing>=2.3.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Plot the word cloud
def plot_wordcloud(words):
    wordcloud = WordCloud(
        width=1200, 
        height=400, 
        background_color='black', 
        colormap='cividis',
        collocations=False,
    ).generate(words)

    plt.figure(figsize=(15, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

plot_wordcloud(' '.join(terms_wo_stopwords))

The top-term seems to be experience. How often the preferred qualifications include it?

prefqual = df['prefqual']
prefqual.str.contains('experience').mean().round(3)
0.48

Pretty often. Let’s check a few more terms:

print(prefqual.str.contains('ability').mean().round(3)*100)
print(prefqual.str.contains('skills').mean().round(3)*100)
print(prefqual.str.contains('management').mean().round(3)*100)
51.6
69.69999999999999
41.4

Let’s check out programming skills:

print(prefqual.str.contains('python').mean().round(3)*100)
print(prefqual.str.contains('sql').mean().round(3)*100)
0.0
0.0

You try it

About 6% of jobs prefer a qualification of python or SQL. What about MINIMUM qualifications?

...
Ellipsis
Solution
minqual = df.minqual.str.lower()
minqual.str.contains('python').mean().round(3), minqual.str.contains('sql').mean().round(3)
(0.082, 0.058)

References

Adapted and expanded from: Miguel Ángel Canela