This notebook shows the process of cleaning the Wikipedia ratings data, downloaded from:
https://ckannet-storage.commondatastorage.googleapis.com/2012-10-22T184507/aft4.tsv.gz
The data comes in a database-like format where each rating has several rows, one for each dimension of the rating. The dimensions are (according to the data dictionary:
1: trustworthy, 2: objective, 3: complete, 4: well-written
(Abbreviated here: trust
, obj
, comp
, writ
)
At the end of this file, we want a dataset that has one row for each rating. It should include:
page_id
, so we can connect it to other page info from Wikipedia's databaserev_id
(revision ID), so we can connect it to information about that particular version of the page, and thus compare it to other versions of the page (the changes from one version to another being, of course, edits)rating_value
s for each of the four dimensions (*_rating
)datetime
field, which will be a Pandas-Datetime version of the timestamp
variable, indicating when (in time, as opposed to in the page history) a rating was madelogged_in
field indicating whether the rater was logged in or not (based on the anonymized version of user_id
included in the datapage_title
field, which is just nice to haveWe will also compute some of our own metrics, namely:
page_rate_count
- the number of ratings for each page (also versions for each dimension)rev_rate_count
- the number of ratings for each version (i.e. revision) of each page (also versions for each dimension)rating_dim_count
- the number of dimensions rated for each rating (0-4)rating_any_mean
- the mean of all rated dimensions for each rating, regardless of how many are missingrating_all_mean
- the mean of all rated dimensions only when all dimensions are nonmissingpage_rate_mean_*
- the mean for each page of all ratings of that page, in each dimensionrev_rate_mean_*
- the mean for each page version of all ratings of that version, in each dimensionimport pandas as pd
import numpy as np
We leave it gzipped because it is quite large.
df = pd.read_csv('data/aft4.tsv.gz', compression='gzip', delim_whitespace=True)
df = df[df.page_namespace==0] # only want main wiki pages
df.info()
The TSV file is actually malformed. timestamp
should be an int64
, but it's an object
because a number of rows have string values.
df['int_timestamp'] = df.timestamp.convert_objects(convert_numeric=True)
df[df.int_timestamp.isnull()]
Judging by the fact that user_id == 1
here, I suspect that when the data was created, there was an issue with logged in users' user names (which have since been scrubbed for anonymity) being stored in a way that went across columns, causing other columns to be pushed around. To check this, we can look at the original file:
f = gzip.open('data/aft4.tsv.gz','rb')
for i in range(0,50):
print "%s: %s" % (i, f.readline()),
Results (last lines, line numbers inserted by me):
44: 20110722000020 1148043 New_Zealand_cuisine 0 437274630 0 4 5
45: Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 1
46: Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 2
47: Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 3
48: Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 4
49: 20110722000022 32490439 NULL NULL 440744849 1 1 0
Since the actual rating data is overwritten, there's no way to reconstruct these records, so we have to drop them. First, though, let's make sure they all have user_id == 1
:
df[df.int_timestamp.isnull()][df.user_id!=1]
Now let's drop these obsevations.
df = df[df.int_timestamp.notnull()]
df.drop('int_timestamp', 1)
df.info()
Ok, but now there are still some problematic cases. timestamp
should always be a 14-digit number.
df[df.timestamp.map(lambda x: len(str(x))<14)]
And, rating_key
should never be greater than 4.
df.rating_key = df.rating_key.convert_objects(convert_numeric=True)
df[df.rating_key > 4]
Let's drop these cases.
df = df[df.timestamp.map(lambda x: len(str(x)) == 14)][df.rating_key <= 4]
df.info()
Ok, so we dropped around 10,000 cases but we're still left with...EGAD. 46+ million rows! Well, the data is structured so that each rating is 4 separate rows, but still. Let's take a look at the data, first.
df.head(10)
Ok. So here we can see that each page is rated in groups of four observations. Here is the data dictionary:
The following fields are included:
- timestamp: 14-digit timestamp (UTC)
- page_id: article ID
- page_title: full article title
- page_namespace: article namespace (0 is for Wikipedia's main article namespace)
- rev_id: revision ID (article version that was rated)
- user_id: boolean field set to 0 for anonymous raters and to 1 for registered users
- rating_key: numeric key of the rating dimension (1: trustworthy, 2: objective, 3: complete, 4: well-written)
- rating_value: rating value on a 1-5 scale. A value set to 0 indicates that the rater didn't rate the article on the corresponding dimension.
Some things to note:
user_id
doesn't contain an actual identifier; it's just a flag for whether the editor was registered.timestamp
is the time that the rating was made.rating_key
(i.e. each rating dimension). (More on this in a bit)So, now that we have better idea of our data, let's start by cutting down the size of the file to a reasonable level. We'll take a sample of the observations. However, we can't just take a random chunk of rows -- then we would overcount pages with lots of edits, not to mention the fact that we wouldn't get all the data about those pages; in fact, what we want is a random sample of pages.
page_ids = pd.unique(df.page_id) # unique page_ids
np.random.seed(2014) # set the seed
sample_pids = np.random.choice(page_ids, 10000, replace=False) # sample of 10,000 pages
# Now we'll cut the data to only ratings for those pages
s = df[df.page_id.isin(sample_pids)]
s.info()
OK! Now we're down to about 315k. That's manageable. Note that this means that we have about:
s.timestamp.count() / 4. / len(pd.unique(s.page_id)) # N, divided by 4 rows per rating, divided by number of pages in sample =
7.85 ratings per page. Which is pretty good, as we'll be able to trace them over time. (We'll deal with the distribution of obs./page later.)
Now what we want is a dataset that contains one observation per rating. But how do we identify a particular rating? What's our unique identifier?
timestamp
and rev_id
, of course!...Right? It would make sense, assuming that two people didn't rate the same article at exactly the same moment. But let's make sure.
del(df) # first let's unload df, since it's taking up a lot of RAM
s = s.drop('page_namespace', 1) # drop page_namespace, since we don't need it any more
s['obs_count'] = s.groupby(['rev_id', 'timestamp']).rating_value.transform('count')
s.head(20)
So if all ratings have 4 rows, then there should be no observations where s.count != 4
:
(s.obs_count != 4).sum()
s[s.obs_count != 4].head(20)
Ah! So here we see that, in fact, some ratings were saved to the database across seconds, meaning that not all ratings have the same timestamp
. Bummer.
Since we can't use timestamp
as an ID variable, lets just use the rev_id
combined with sorting by the index. For each rev_id
group, we'll construct a variable equal to 1 for the first foursome (i.e. the first rating), 2 for the second, etc.
# Grouping by rev_id, create the cumulative count (i.e. 0 for first obs with that rev_id, 1 for second, etc.)
# Then, group by fours, and add one (so that the first is 1, the second 2, etc.)
s['nth_rating'] = s.groupby('rev_id').cumcount().apply(lambda x: int(x / 4) + 1)
Let's take a look again at our sample:
s.sort(['rev_id', 'nth_rating', 'rating_key'])
Now nth_rating
, when combined with rev_id
and rating_key
, should give us a unique ID for each rating (which we'll need when we want to reshape the data).
s.duplicated(['rev_id', 'nth_rating', 'rating_key']).sum()
Uh oh. Let's view those 4 cases:
s[s.duplicated(['rev_id', 'nth_rating', 'rating_key'])]
s[(s.rev_id == 434558075) & (s.nth_rating.isin([3,4]))]
Here it appears that the same rating was inserted twice into the database at exactly the same time. Note that the timestamps are identical, as, indeed, are all fields. Thus, this appears to be a genuine duplicate, which we should drop one of. Because of the grouping of nth_rating
, we'll need to change that as well.
s.info()
s.drop_duplicates().info()
s = s.drop_duplicates()
So, we only dropped the 4 observations.
s.nth_rating[(s.rev_id == 434558075) & (s.nth_rating.isin([3,4]))] = 3
s[(s.rev_id == 434558075)]
Finally, to make it easier to deal with the index and reshaping, let's convert the rating keys to meaningful names.
s.rating_key = s.rating_key.replace([1, 2, 3, 4], ['trust','obj','comp','writ'])
Now we're ready to reshape. We start by dropping variables we no longer need. Next, we set the index to all the variables we don't want to be reshaped (i.e. that don't change for our new observations), with the variable we want to reshape on, rating_key
, last.
s = s.drop(['int_timestamp', 'obs_count'], 1)
s = s.set_index(['rev_id', 'nth_rating', 'page_id', 'page_title', 'user_id', 'rating_key'])
Now we do the actual reshaping, using unstack
.
stacked = s # create a copy of the stacked version, for convenience
#s = stacked # to reset
s = s.unstack(level='rating_key')
First, let's create a datetime
variable with a date object for the first timestamp
. First, we create a series, which we will pd.concat
with the rest of the data later.
# Create a series of datetimes
datetime = pd.to_datetime(s.timestamp.comp) # "complete" is the first of the four rating_keys
datetime.name = 'datetime'
Now that we have the datetime
(i.e. the information we care about from timestamp
), the only other data we want to keep is either already in the index, or else is in rating_value
. Since we don't want to keep a MultiIndex
if we don't need it (it makes accessing columns a pain), we can just set our data to be s.rating_value
(which has all the other things we care about in the index
), and add all the other measures to that.
rating_values = s.rating_value
Before we can mean ratings, etc., we need to clean up the rating_value
fields. In particular, we need to:
object
)0
s with NaN
(i.e. missing or NA), since, as the data dictionary states: "A value set to 0 indicates that the rater didn't rate the article on the corresponding dimension."rating_values = rating_values.convert_objects(convert_numeric=True) # convert rating value columns to numeric
rating_values = rating_values.replace(0, np.nan)
rating_values.head(20)
If all the ratings are zero/missing, then we obviously can't get much information from the rating, so drop these cases.
print("BEFORE: %d cases with all missing rating values" % rating_values.isnull().all(axis=1).sum())
rating_values.dropna(how='all', inplace=True) # drop in place if _all_ values are NA
print("AFTER: %d cases with all missing rating values" % rating_values.isnull().all(axis=1).sum())
Now we set s
to be rating_values
, which contains in index
the other info we want, concatenated with the datetime
series we created earlier.
s = pd.concat([pd.DataFrame(datetime), rating_values], axis=1)
s.info()
s.head()
Create a variable counting the number of dimensions for each rating.
s['rating_dim_count'] = rating_values.count(axis=1)
s.head()
Create two variables with the mean of all rating dimensions:
rating_any_mean
-- mean of all dimensions, regardless of how many missing dimensionsrating_all_mean
-- mean of all dimensions only when all dimensions are nonmissings['rating_any_mean'] = rating_values.mean(axis=1)
s['rating_all_mean'] = rating_values[rating_values.notnull().all(axis=1)].mean(axis=1)
s.head(20)
We're about ready to reset_index
so we can use groupby
to create revision- and page-wide statistics, but before we do, let's create an array of all the rating value dimensions.
rating_dims = rating_values.columns
Now we can reset_index
and delete rating_values
.
del(rating_values)
s = s.reset_index()
Rename user_id
to logged_in
, which is more clear for what the variable is actually measuring.
s.rename(columns={'user_id': 'logged_in'}, inplace=True)
Create count variables for the number of ratings for each page and version
page_groups = s.groupby(['page_id'])
s['page_rate_count'] = page_groups.page_id.transform('count')
rev_groups = s.groupby(['rev_id'])
s['rev_rate_count'] = rev_groups.rev_id.transform('count')
And create a mean within each page/revision for each of the four dimensions.
for var in rating_dims:
s['page_%s_mean' % var] = page_groups[var].transform('mean')
s['rev_%s_mean' % var] = rev_groups[var].transform('mean')
And now we're almost done! We just need to sort it, and save.
s = s.sort(['page_id', 'rev_id', 'datetime'])
s.to_csv("cleaned_wiki_ratings.csv")