In Germany, we students have a website on which we could write a review of our university. At this moment we have 509 enlisted universities with about 208.000 reviews from students, mostly with very rich content. My task is fetching the data onto my local computer and analyzing the national view of students on their studying.

This project will be divided into two main parts, firstly the data has to be scraped from the homepage and stored in a database, for this purpose I am pretty with raw data. Secondly, the data has to be processed into the right form for the learning algorithm to consume.

Since the data itself could be quite big, I employ a server cluster of 8 machines with one central PostgreSQL database. The code to fetch the data itself is quite simple and is written in Python; which is, in my opinion, the right tool for the job since CPU performance is not the bottleneck. To deploy the agents to the cloud, I use a combination of Docker container and Ansible, which allow me to parallelize the deployment on many machines.

Note: I will try to post least Python code as possible since Python itself is not universal as SQL. But in the data analyzing section Python code will not be avoidable.

Let’s start with the data schema. In order to understand which data is needed, I have to read the HTML code of the website many times to find a pattern, which I can formularize in Python. The raw data’s structured could be expressed with the following relational database’s script:

CREATE TABLE uni(
id serial primary key,
uri text unique not null,
created date default now(),
fetched boolean default false
);

id serial primary key,
uni_id bigint references uni(id) on delete cascade on update cascade,
fetched boolean default false,
uri text unique not null
);

CREATE TABLE entry_page(
id serial primary key,
content text,
fetched boolean default false
);

Basically I create a simple table for store information about each individual university. Each university’s list of reviews can be reached easily per URI since the developers of the website use a very humanly readable REST API (https://www.studycheck.de/hochschulen/seite-<index>).

Every fetched review’s URI of each university will be stored in the second table. The third table will store the raw data of each review.

In order to balance the workload between each agent, I wrote simple SQL queries to so each agent can get itself a new task without human regularisation. For example, since we have 506 universities, each agent will query once at the time one university’s name and fetch all reviews of this university at a time with the following script:

WITH updated AS (UPDATE uni SET fetched = true where id =
(select id from uni where fetched = false limit 1)  RETURNING id)

SELECT id, uri FROM uni WHERE id = (SELECT id FROM updated);

The script itself, however, has its weakness since I did not have time to read into the documentation of PostgreSQL on how to lock a row on querying. Therefore this query will sometimes result in a situation where two agents can get the same university’s name and do duplicate work. But since duplicated data is not the problem, I could accept some KBs more in the database.

The Python to execute the querying and scraping process will be deployed to every machine of the cluster with Ansible. Following is the playbook:

---
- name: Setup cluster
hosts: all
- name: Clone repository
shell: cd /home/long/scraper && git pull && docker-compose up -d --build

It does nothing less than logging into the remote server via SSH, go to the directory /home/long/scraper, pull the newest code from the git repository and rebuild the docker containers.

The same process will be repeated for fetching each review’s content. By the way, the scraping can be done with the help of any Python library out there, I am a fan of BeautifulSoup4.

soup = BeautifulSoup(web_content, 'html.parser')
ul_list = soup.find_all("ul", {"class": "list-unstyled"})

div = soup.find("div", {'class': 'report-text'})
report = re.sub(' +', ' ', div.text) # The actual review

for ul in ul_list:
for li in ul.find_all('li'):
processed_text = re.sub(' +', ' ', li.text).strip().replace("'", "\"")
if processed_text.startswith('Studieninhalte'):
studieninhalte = processed_text
if processed_text.startswith("Dozenten"):
dozenten = processed_text
if processed_text.startswith("Lehrveranstaltungen"):
lehrveranstaltungen = processed_text
#..etc

As you can see, the code itself is very simple and you don’t have to be a coding expert to understand the code itself. If you are familiar with Python you can even say that the code is pretty dumb, but I don’t like that kind of elegant one-line code which could not be understood by anyone except the author, so dumb code is for me a sign of a mature coder.

After the final step, the data has to be stored in a spread-sheet liked database’s table, image an excel table with about 200.000 rows. So let’s do it:

CREATE TABLE processed_entry_page(
id serial primary key,
entry_page_id bigint references entry_page(id) on delete cascade on update cascade,
studieninhalte varchar(1000),
dozenten varchar(1000),
lehrveranstaltungen varchar(1000),
ausstattung varchar(1000),
organisation varchar(1000),
gesamtbewertungen varchar(1000),
alter varchar(1000),
geschlecht varchar(1000),
studienbeginn varchar(1000),
studienform varchar(1000),
weiterempfehlung varchar(1000),
geschrieben_am varchar(1000),
pro text,
contra text,
views varchar(1000),
likes varchar(1000),
report text
);

The table structure contains every attribute I could find in the Html code. Now we can go to the data analyzing step.

For this step, I utilize the standard machine learning stack of Python, which includes MatplotLib, Seaborn, Numpy, Pandas, Scikit-learn.

import pandas as pd
df = df.drop_duplicates() # Drop duplicates data
df.describe()
gesamtbewertungen alter studienbeginn weiterempfehlung likes
count 203998.000000 159204.000000 184904.000000 203998.000000 41769.000000
mean 3.678591 22.599658 2014.327711 0.897524 2.849913
std 0.659417 3.061733 2.186961 0.303275 5.311090
min 1.000000 17.000000 1992.000000 0.000000 1.000000
25% 3.300000 19.000000 2013.000000 1.000000 1.000000
50% 3.800000 22.000000 2014.000000 1.000000 2.000000
75% 4.200000 25.000000 2016.000000 1.000000 4.000000
max 5.000000 36.000000 2019.000000 1.000000 200.000000

The dataset has its first review from the year 1992 and the latest data comes from 2019. The age of people has a range of 17 to max 36. The reviews classify themselves into stars review (from one star to five stars). The student can also make a statement if they would recommend their university.

For my purpose, I am only interested in the review in natural language and the recommendation, but it can not be so bad to explore some other features of the dataset.

List every feature:

list(df.columns)

>> ['studieninhalte',
'dozenten',
'lehrverantstaltungen',
'ausstattung',
'organisation',
'gesamtbewertungen',
'alter',
'geschlecht',
'studienbeginn',
'studienform',
'weiterempfehlung',
'geschrieben_am',
'pro',
'contra',
'views',
'likes',
'report']

General info about every feature. Many features need more preprocessing, which is however not the scope of this article.

df.info()

>> Int64Index: 203998 entries, 0 to 343767
Data columns (total 17 columns):
studieninhalte          203563 non-null object
dozenten                203121 non-null object
lehrverantstaltungen    203970 non-null object
ausstattung             203836 non-null object
organisation            203014 non-null object
gesamtbewertungen       203998 non-null float64
alter                   159204 non-null float64
geschlecht              203653 non-null object
studienbeginn           184904 non-null float64
studienform             203865 non-null object
weiterempfehlung        203998 non-null int64
geschrieben_am          203998 non-null object
pro                     38983 non-null object
contra                  35511 non-null object
views                   203998 non-null object
likes                   41769 non-null float64
report                  203998 non-null object
dtypes: float64(4), int64(1), object(12)
memory usage: 28.0+ MB

Let’s see how much null data we have. The columns ‘likes’, ‘pros’, ‘contras’ are full with white holes like swiss cheese, they are therefore not useable.

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 20))
sns.heatmap(df.isna())
plt.savefig('null_data.png')
plt.show()

Calculation of how much data we are missing. ‘pro’ and ‘contra’ as well as ‘like’ are leading with all about 80% missing data, unusable. The attribute ‘age’ is also missing 1/5 of its data, imputation, therefore, must be done carefully without shifting the data’s standard deviance and variance. The recommending technique is the mean/median imputation.

df.isna().sum().sort_values(ascending=False) / len(df) * 100

>> contra 82.592476
pro 80.890499
likes 79.524799
alter 21.958058
studienbeginn 9.359896
organisation 0.482358
dozenten 0.429906
studieninhalte 0.213237
geschlecht 0.169119
ausstattung 0.079413
studienform 0.065197
lehrverantstaltungen 0.013726
weiterempfehlung 0.000000
gesamtbewertungen 0.000000
geschrieben_am 0.000000
views 0.000000
report 0.000000
dtype: float64

Plotting the distribution of age. As expected, most people begin studying with 18 years of life.

plt.figure(figsize=(20, 20))
plt.hist(df['alter'].dropna())
plt.savefig('alter-not-imputed')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age distribution')
plt.legend()
plt.show()


Imputation the attribute age with mean and plot the distribution again. We can see not that much difference.

import numpy as np
plt.figure(figsize=(20, 20))
plt.hist(df['alter'].fillna(value=np.average(df['alter'].dropna())))
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age distribution')
plt.legend()
plt.savefig('age-avg-imputed')
plt.show()


And how about the median? Again excellent choice.

plt.figure(figsize=(20, 20))
plt.hist(df['alter'].fillna(value=np.median(df['alter'].dropna())))
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age distribution')
plt.legend()
plt.savefig('age-median-imputed')
plt.show()


I can go on and on many days with every other feature, which will make this post unnecessarily long, so I will make a stop here since you have hopefully and idea who the thing will go on.

At this moment my goal is building a machine learning model which can predict if a student would recommend his studying field, based alone on his natural language review. Let’s see what we have to do in order to ace this task.

The first step is throwing away every attribute we don’t need. The next thing is to use a stopword list to remove frequently used word to make the information retrieval faster and reduce noise.

Following are some of Germany stopwords:

from nltk.corpus import stopwords
stop = stopwords.words('german')
stop

>> ['aber',
'alle',
'allem',
'allen',
'aller',
'alles',
'als',
'also',
'am',
'an',
....
]

Remove stopwords from our reviews, left use with the essential:

df['report'] = df['report'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head()
report weiterempfehlung report_stopwords
0 Die HRW als Hochschule ist eine sehr vielse… True Die HRW Hochschule vielseitige moderne Einrich…
1 Das Studim ist ein Vollzeitstudium, welches… True Das Studim Vollzeitstudium, ziemlich anspruchs…
2 Es erfordert wie alle Studiengänge sehr vie… True Es erfordert Studiengänge Arbeit natürlich Mat…
3 Ich selbst bin von der RWTH Aachen an die F… True Ich RWTH Aachen Fachhochschule gewechselt mehr…
4 Die Professoren vermitteln einem, wozu man … True Die Professoren vermitteln einem, wozu jeweili…

Another thing we have to worry about is the distribution of our data, which is heavily imbalanced. We have about 160k which would recommend their studying field and only 20k which would not recommend their studying field. Which would result in machine learning models, which are biased to the majority and malfunction? Therefore we want to bring balance into the dataset by undersampling the yes-sayers:

positive = df[df['weiterempfehlung']]
negative = df[~df['weiterempfehlung']]
positive = positive.sample(n=len(negative), random_state=42)
new_df = pd.concat([positive, negative])
new_df.head()
report weiterempfehlung
73640 Die Inhalte Studiengangs gut dargestellt. Die … True
171144 Den einzigen negativen Punkt nennen möchte/kan… True
96263 Sehr kleine Hochschule 3000 Einwohner Dorf- Ho… True
121973 Das Studium gibt Studenten guten Überblick ver… True
68957 Ich zufrieden Wahl Studiums. Die Hochschule mo… True

The next few steps will be tokenizing the text into individual words, count appearance of each word and create therefore new features for our datasets. This is a very simple trick and can be very effective for classifying text with the help of naive Bayes. For example:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

This method however only counts the frequency of a word inside a review if we want more sophisticated features, we have to involve the term frequency – inverse document frequency. Wikipedia defines TF-IDF as:

a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

TF-IDF itself is the product of TF and IDF.

IF simply determines how often a term appears in the document and can be done with the count vectorizer.

IDF, on the other hand, is a simple probabilistic function which determines the importance of a term for a document inside a corpus:

idf( t_i, D_j ) = log( N / a + K)

t_i: Term i

D_j: Document j

N: Total number of documents of the corpus.

a: A bias to avoid division through zero

K: Times the term I appears inside the document D j.

For our purpose, TFIDF and its implementation in scikit-learn is the better choice:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)

Back to our little project. It is time to use the first algorithm. I will start with a simple term frequency matrix as input and a naive Bayes variant:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(new_df['report'], new_df['weiterempfehlung'], random_state = 0)

count_vect = CountVectorizer()
from sklearn.metrics import confusion_matrix, classification_report
X_train_counts = count_vect.fit_transform(X_train)
X_test_counts = count_vect.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_counts, y_train)
predict = clf.predict(X_test_counts)

print(confusion_matrix(predict, y_test))
print(classification_report(predict, y_test))

>>> [[3444  997]
[1424 6416]]
precision    recall  f1-score   support

False       0.71      0.78      0.74      4441
True       0.87      0.82      0.84      7840

accuracy                           0.80     12281
macro avg       0.79      0.80      0.79     12281
weighted avg       0.81      0.80      0.80     12281

With a precision of 0.81 and a recall of 0.80, I am not very satisfied. That means only 80% of the predictions are correct and 80% of the test data are predicted correctly.

So how about involving IF-IDF factors into the dataset?

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_train_counts = tfidf_vec.fit_transform(X_train)
X_test_counts = tfidf_vec.transform(X_test)

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_test_tf = tf_transformer.transform(X_test_counts)

ffclf = MultinomialNB()
ffclf.fit(X_train_tf, y_train)
print(confusion_matrix(fclf.predict(X_test_tf), y_test))
print(classification_report(fclf.predict(X_test_tf), y_test))

>>> [[1318  177]
[3550 7236]]
precision    recall  f1-score   support

False       0.27      0.88      0.41      1495
True       0.98      0.67      0.80     10786

accuracy                           0.70     12281
macro avg       0.62      0.78      0.60     12281
weighted avg       0.89      0.70      0.75     12281

Terrible, we have a precision of 30% for not recommendation prediction and only 70% for not recommendation are predicted correctly. The algorithm seems very biased and most of the time it will just say: Well, the student clearly recommend his studying field, which is not the case.

But we can not lose hope yet. Since we have a classification problem and the features are perfectly numeric, we can just use the good old logistic regression, let’s give it a shoot:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_tf, y_train)

print(confusion_matrix(lr.predict(X_test_tf), y_test))
print(classification_report(lr.predict(X_test_tf), y_test))

>>> [[3479  923]
[1389 6490]]
precision    recall  f1-score   support

False       0.71      0.79      0.75      4402
True       0.88      0.82      0.85      7879

accuracy                           0.81     12281
macro avg       0.80      0.81      0.80     12281
weighted avg       0.82      0.81      0.81     12281


This model gets used back on track, not terrible but also not very good. We can give two other algorithms a chance before we try our luck with hyperparameter tuning. Lett’s start with a decision tree:

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train_tf, y_train)
print(confusion_matrix(dtc.predict(X_test_tf), y_test))
print(classification_report(dtc.predict(X_test_tf), y_test))

>>> [[2797 1942]
[2071 5471]]
precision    recall  f1-score   support

False       0.57      0.59      0.58      4739
True       0.74      0.73      0.73      7542

accuracy                           0.67     12281
macro avg       0.66      0.66      0.66     12281
weighted avg       0.67      0.67      0.67     12281


Garbage… How about Random Forrest with an initial number of trees?

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train_tf, y_train)
print(confusion_matrix(rfc.predict(X_test_tf), y_test))
print(classification_report(rfc.predict(X_test_tf), y_test))
>>>[[2953 1479]
[1915 5934]]
precision    recall  f1-score   support

False       0.61      0.67      0.64      4432
True       0.80      0.76      0.78      7849

accuracy                           0.72     12281
macro avg       0.70      0.71      0.71     12281
weighted avg       0.73      0.72      0.73     12281

Also garbage. Let’s just do it with some auto hyperparameter tuning:

from sklearn.model_selection import GridSearchCV
bayes_param = {
'alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)
}
bayes_cv = GridSearchCV(estimator=MultinomialNB(),
param_grid=bayes_param)
bayes_cv.fit(X_test_tf, y_test)
bayes_best = bayes_cv.best_estimator_
print(confusion_matrix(bayes_best.predict(X_test_tf), y_test))
print(classification_report(bayes_best.predict(X_test_tf), y_test))

>>> [[4179   88]
[ 689 7325]]
precision    recall  f1-score   support

False       0.86      0.98      0.91      4267
True       0.99      0.91      0.95      8014

accuracy                           0.94     12281
macro avg       0.92      0.95      0.93     12281
weighted avg       0.94      0.94      0.94     12281

I like where this is going to 😀 Try the same method with logistic regression, shall we?

import numpy as np
lg_param = {
'penalty': ['l1', 'l2'],
"C":np.logspace(-3,3,7)
}
lg_cv = GridSearchCV(estimator=LogisticRegression(),
param_grid=lg_param)
lg_cv.fit(X_test_tf, y_test)
lg_best = lg_cv.best_estimator_
print(confusion_matrix(lg_best.predict(X_test_tf), y_test))
print(classification_report(lg_best.predict(X_test_tf), y_test))

>>> [[4808   33]
[  60 7380]]
precision    recall  f1-score   support

False       0.99      0.99      0.99      4841
True       1.00      0.99      0.99      7440

accuracy                           0.99     12281
macro avg       0.99      0.99      0.99     12281
weighted avg       0.99      0.99      0.99     12281


Now we are talking. The model achieved a very good precision in the development environment, which is a good sign. However, the dataset itself is very huge and need more exploration to achieve maximum benefit of data mining. More relationships, cause, and causality could be found if more time could be invested.

Conclusion: With this project, I have learned very much about the process of data collecting, cleaning, processing, and mining. I am very happy with the result and looking forward to more experience and learning process.

Published inMachine Learning