machine-learning-with-python/based_content_filtering.py at master · lordkevinmo/machine-learning-with-python · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 17 21:36:10 2019

@author: Koffi Moïse AGBENYA

CONTENT-BASED FILTERING

Recommendation systems are a collection of algorithms used to recommend items
to users based on information taken from the user. These systems have become
ubiquitous can be commonly seen in online stores, movies databases and job
finders. In this notebook, we will explore Content-based recommendation systems
and implement a simple version of one using Python and the Pandas library.

ABOUT DATASET

This dataset (ml-latest) describes 5-star rating and free-text tagging activity
from [MovieLens](http://movielens.org), a movie recommendation service. It
contains 22884377 ratings and 586994 tag applications across 34208 movies.
These data were created by 247753 users between January 09, 1995 and January
29, 2016. This dataset was generated on January 29, 2016.

Users were selected at random for inclusion. All selected users had rated at
least 1 movies. No demographic information is included. Each user is
represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv`
and `tags.csv`. More details about the contents and use of all these files
follows.

This is a *development* dataset. As such, it may change over time and is not an
appropriate dataset for shared research results.

"""

#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('movies.csv')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df.head()

#Let's remove the year from the title column by using pandas' replace
#function and store in a new year column.

#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

"""

Since keeping genres in a list format isn't optimal for the content-based
recommendation system technique, we will use the One Hot Encoding technique to
convert the list of genres to a vector where each column corresponds to one
possible value of the feature. This encoding is needed for feeding categorical
data. In this case, we store every different genre in columns that contain
either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't.
Let's also store this dataframe in another variable since genres won't be
important for our first recommendation system.

"""

#Copying the movie dataframe into a new one since we won't need to use the
#genre information in our first case.
moviesWithGenres_df = movies_df.copy()

#For every row in the dataframe, iterate through the list of genres and place a
#1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that
#column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

#Lets look at the ratings dataframe
ratings_df.head()

#Every row in the ratings dataframe has a user id associated with at least one
#movie, a rating and a timestamp showing when they reviewed it. We won't be
#needing the timestamp column, so let's drop it to save on memory.

#Drop removes a specified row or column from a dataframe
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

#Content-Based recommendation system

"""

Now, let's take a look at how to implement Content-Based or Item-Item
recommendation systems. This technique attempts to figure out what a user's
favourite aspects of an item is, and then recommends items that present those
aspects. In our case, we're going to try to figure out the input's favorite
genres from the movies and ratings given.

Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the
userInput. Feel free to add more in! Just be sure to write it in with capital
letters and if a movie starts with a "The", like "The Matrix" then write it in
like this: 'Matrix, The' .

"""

userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ]
inputMovies = pd.DataFrame(userInput)
inputMovies

"""

Add movieId to input user
With the input complete, let's extract the input movies's ID's from the movies
dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input
movies' title and then merging this subset with the input dataframe. We also
drop unnecessary columns for the input to save memory space.

"""

#Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

#We're going to start by learning the input's preferences, so let's get the
#subset of movies that the input has watched from the Dataframe containing
#genres defined with binary values.
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

#We'll only need the actual genre table, so let's clean this up a bit by
#resetting the index and dropping the movieId, title, genres and year columns.

#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable

"""

Now we're ready to start learning the input's preferences!

To do this, we're going to turn each genre into weights. We can do this by
using the input's reviews and multiplying them into the input's genre table and
then summing up the resulting table by column. This operation is actually a dot
product between a matrix and a vector, so we can simply accomplish by calling
Pandas's "dot" function.

"""

inputMovies['rating']

#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
#The user profile
userProfile

#Now, we have the weights for every of the user's preferences. This is known as
#the User Profile. Using this, we can recommend movies that satisfy the user's
#preferences.
#Let's start by extracting the genre table from the original dataframe:

#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

genreTable.shape

#With the input's profile and the complete list of movies and their genres in
#hand, we're going to take the weighted average of every movie based on the
#input profile and recommend the top twenty movies that most satisfy it.

#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

#Now here's the recommendation table

#The final recommendation table
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]