= chr(65) # this
the_letter_A = 'A' # is the same as this
the_letter_A
the_letter_A
'A'
Navigation
Text data always starts out in the form of a string. A string is a sequence (list) of characters. This includes the (English) alphanumeric characters and also special characters like white space, punctuation, etc. Other symbols, like emoticons, can also appear in your data, specially in social networks data. Besides that, you can also find letters from other languages (Spanish, Portuguese, etc) or alphabets (Cyrillic, hiragana, etc), and even ideographs (such as Han characters).
There is a basic set of 127 characters, called the ASCII characters, which are encoded in the same way by all the computers, so you will never have trouble with them. They include the English letters (without accents), the numbers, basic punctuation (not curly quote marks or long dashes), white space, control characters such as the new line, represented in programming languages as , and other symbols familiar to you, such as the dollar ($) and the hash (#) symbols. The complete list can be easily found in Internet.
Non-ASCII characters can be encoded by different computers or different text editors in different ways. Mind that, if you capture string data on your own, you will probably find some of these characters in your data. Even when the documents are expected to be in English, they can be contaminated by other languages: Han characters, German dieresis, Spanish eñe, etc.
The preferred encoding is UTF-8 (utf-8), which is the default encoding in Macintosh computers. Reading and writing text files in Pandas, the argument encoding allows you to manage both UTF-8 and the alternative encoding Latin-1 (latin1). Windows computers use their own system, which is region specific. In US and Western Europe, this is Windows-1252, which is very close to Latin-1, though not exactly the same.
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
Both lists and strings are sequences, so they share some core methods. A string can be regarded as a list of characters (letters, numbers, punctuation, etc). For instance, the function len
gives you the number of characters of a string:
You try it
Write your name as a string, convert it to a list of numbers using ord()
(this is the inverse of chr
).
Besides the methods shared with lists, Python has a collection of methods for manipulating strings:
lower()
: makes the string into all lowercase, this is useful when you’re comparing strings and want to make the comparison case insensitive.upper()
: same, but makes it uppercasereplace()
: replace a substring with something else, just like the usual “Find and Replace” method of text editorssplit()
: splits a string into different segments (creates a list of segment strings). The split can be based on any separator. If no separator is specified, any whitespace string (containing only white space, line breaks or tabs) is a separator.count()
: counts the number of occurrences of a pattern within a string:ENRIC JUNQUE DE FORTUNY enric junque de fortuny
When a Pandas column contains string, we can access its string methods. Pandas string methods are vectorized versions of the above methods. They return a series of the same length, in which each term results from applying a string function to the corresponding term of s
. The syntax of these functions is typically s.str.fname(args)
, where s
is a Pandas series of strings (or a list of strings) and args is a set of ad hoc arguments.
As a recent graduate, Niyamat Ullah is eager to know which language, skills, and experience would be beneficial to add to his toolbox in order to increase his chances of getting a job at Google. Fortunately, Google provides a comprehensive list of job opportunities on their official careers website at careers.google.com. To gather valuable insights, Niyamat employed Selenium, a tool capable of capturing (crawling) the source code from each job page, enabling the extraction of essential information such as job titles, locations, responsibilities, as well as minimum and preferred qualifications.
The file google.csv
contains for 1,250 jobs. The variables are:
company
: either Google or Youtube.
title
: the title of the job.
category
: the category of the job.
location
: the location of the job.
responsibilities
: the responsibilities for the job.
minqual
: the minimum qualifications for the job.
prefqual
: the preferred qualifications for the job.
Source: Kaggle.
First download the dataset from here - I’ve already cleaned it up for you (no NA’s and no duplicates). We then load the CSV file that contains the data. The encoding is typically specified for files that contain text, to prevent problems with systems that do not use UTF-8 (Windows). This is probably not needed here, but it may save trouble in other cases.
The first column has the company. The analysis is very simple, since it is either Google or YouTube (a few cases). The method value_counts
does the job.
title
Business Intern 2018 33
Field Sales Representative, Google Cloud 17
Interaction Designer 12
MBA Intern, Summer 2018 10
User Experience Researcher 9
Name: count, dtype: int64
The interns seem to dominate the picture, but, with 794 different titles, we are missing out on quite a few here. So, let’s check a couple of other possibilities.
You try it
Remember that the method str.contains
returns a Boolean mask indicating whether a given pattern is contained in every term of the series. Use it to find the number of jobs which have the following keywords in the title: ‘Intern’, ‘Sales’, ‘Cloud’, ‘Google Cloud’.
f_intern = df['title'].str.lower().str.contains('intern')
f_sales = df['title'].str.lower().str.contains('sales')
f_cloud = df['title'].str.lower().str.contains('cloud')
f_googlecloud = df['title'].str.lower().str.contains('google cloud')
# The individual breakdown:
print(f_intern.sum(), f_sales.sum(), f_cloud.sum(), f_googlecloud.sum())
# What's the difference between this approach:
print(f_intern.sum() + f_sales.sum() + f_cloud.sum() + f_googlecloud.sum())
# And this one?
all = (f_intern | f_sales | f_cloud | f_googlecloud)
print(all.sum())
# Note that in all of these: internet is counted as an intern!
# To avoid this issue use a \b (word boundary) at the end:
f_intern = df['title'].str.lower().str.contains("intern\\b")
all = (f_intern | f_sales | f_cloud | f_googlecloud)
print(all.sum())
108 135 266 248
757
430
416
0 Singapore
1 Shanghai, China
2 New York, NY, United States
3 Mountain View, CA, United States
4 Sunnyvale, CA, United States
Name: location, dtype: object
It is typical of these data sets that the location comes ‘city, state, country’ for US and Canada (sometimes also for India and others), but as ‘city, country’ for other countries. The country can be splitting the string into comma separated words and then taking the last one.
0 [Singapore]
1 [Shanghai, China]
2 [New York, NY, United States]
3 [Mountain View, CA, United States]
4 [Sunnyvale, CA, United States]
Name: location, dtype: object
Pandas .str
accessor can also operate on a Series of lists of strings, though with a limited set of operations. Supported operations include: - .len()
to get the length of each list - .join(delimiter)
to combine elements of each list into a single string - [index]
to access specific elements within each list
0 Singapore
1 China
2 United States
3 United States
4 United States
Name: location, dtype: object
You try it
Extract the country from the location field, then count the number of job openings in each country.
Bonus: plot the top10
On a map (this requires that you download countries.geojson first) and then install geopandas (!pip install geopandas
):
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
# We want to plot the country job add counts, we'll take the log
# of that to have a nicer result. The third line renames the U.S.
# so that it is recognized by the map commands.
country_counts = country.value_counts().rename('job_count').rename_axis('country').reset_index()
country_counts["log_job_count"] = np.log(country_counts.job_count)
country_counts.loc[country_counts.country == 'United States','country'] = 'United States of America'
# Here, we load the boundaries of countries to plot and merge the
# count data with the geographical boundary data.
world = gpd.read_file("./resources/countries.geojson")
merged = world.merge(country_counts, left_on='ADMIN', right_on='country', how='left').fillna(0)
# Now, we plot the countries and fill in the countries with a color
# proportional to the (log) count
fig, ax = plt.subplots(figsize=(12, 8),dpi=200,facecolor='black')
merged.plot(column='log_job_count', cmap='Blues', linewidth=0.8, ax=ax, edgecolor='grey', legend=False)
ax.set_facecolor('black')
plt.tight_layout()
plt.show()
What do they expect a candidate to know? What qualifications are they looking for? The preferred qualifications are in the column ‘prefqual’, let’s take a look:
'Experience in the business technology market as a program manager in SaaS cloud computing and or emerging technologies Significant cross functional experience across engineering sales and marketing teams in cloud computing or related technical fields Proven successful program outcomes from idea to launch in multiple contexts throughout your career Ability to manage the expectations demands and priorities of multiple internal stakeholders based on overarching vision and success for global team health Ability to work under pressure and possess flexibility with changing needs and direction in a rapidly growing organization Strong organization and communication skills'
Let’s create a word-cloud of all the terms mentioned to get a better view. The steps are:
Experience
and experience
are treated the same..split()
# Step 1 and 2
import numpy as np
experience_splitted = df.prefqual.str.lower().str.split(' ')
all_words = experience_splitted.sum()
np.array(all_words)[[len(w) < 2 for w in all_words]]
experience_splitted[:5]
0 [experience, in, the, business, technology, ma...
1 [bsee, bsme, or, bsie, degree, experience, of,...
2 [experience, partnering, or, consulting, cross...
3 [experience, as, a, software, developer, archi...
4 [cts, certification, experience, in, the, cons...
Name: prefqual, dtype: object
# Step 3
# experience_splitted[0] + experience_splitted[1] + ...
all_words = experience_splitted.sum()
all_words[:5]
['experience', 'in', 'the', 'business', 'technology']
The top words are …
and 5800
to 2676
in 2214
with 2202
experience 2168
Name: count, dtype: int64
As could be expected, we find on top terms that do not convey information, which is typical. There terms are called stopwords. Lists of stopwords for many languages can be found in online sources. Let us load the one provided for this session called stopwords.csv
(download it from here). There is only one word in every row.
As it’s just a list of words we might as well convert it to this representation now:
['a', "a's", 'able', 'about', 'above']
Next, we want to drop all the stopwords from the list all_words
:
['experience', 'business', 'technology', 'market', 'program']
You try it
What are the 10 most frequent terms used after filtering out stopwords?
experience 2168
ability 1618
skills 1224
management 645
demonstrated 542
excellent 505
work 504
business 501
communication 501
strong 453
Name: count, dtype: int64
Let’s try to plot this as a wordcloud. The beauty and power of python is that - regardless of your objective - you can always find some package that will get close to what you want:
Requirement already satisfied: wordcloud in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (1.9.4)
Requirement already satisfied: numpy>=1.6.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (10.3.0)
Requirement already satisfied: matplotlib in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from wordcloud) (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (23.2)
Requirement already satisfied: pyparsing>=2.3.1 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in /home/enric/miniconda3/envs/dsfb/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Plot the word cloud
def plot_wordcloud(words):
wordcloud = WordCloud(
width=1200,
height=400,
background_color='black',
colormap='cividis',
collocations=False,
).generate(words)
plt.figure(figsize=(15, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
plot_wordcloud(' '.join(terms_wo_stopwords))
The top-term seems to be experience. How often the preferred qualifications include it?
Pretty often. Let’s check a few more terms:
print(prefqual.str.contains('ability').mean().round(3)*100)
print(prefqual.str.contains('skills').mean().round(3)*100)
print(prefqual.str.contains('management').mean().round(3)*100)
51.6
69.69999999999999
41.4
Let’s check out programming skills:
print(prefqual.str.contains('python').mean().round(3)*100)
print(prefqual.str.contains('sql').mean().round(3)*100)
0.0
0.0
You try it
About 6% of jobs prefer a qualification of python or SQL. What about MINIMUM qualifications?
(0.082, 0.058)
Adapted and expanded from: Miguel Ángel Canela