Clean Wiki Ratings

This notebook shows the process of cleaning the Wikipedia ratings data, downloaded from:

https://ckannet-storage.commondatastorage.googleapis.com/2012-10-22T184507/aft4.tsv.gz

The data comes in a database-like format where each rating has several rows, one for each dimension of the rating. The dimensions are (according to the data dictionary:

1: trustworthy, 2: objective, 3: complete, 4: well-written

(Abbreviated here: trust, obj, comp, writ)

At the end of this file, we want a dataset that has one row for each rating. It should include:

  1. The page_id, so we can connect it to other page info from Wikipedia's database
  2. The rev_id (revision ID), so we can connect it to information about that particular version of the page, and thus compare it to other versions of the page (the changes from one version to another being, of course, edits)
  3. The rating_values for each of the four dimensions (*_rating)
  4. A datetime field, which will be a Pandas-Datetime version of the timestamp variable, indicating when (in time, as opposed to in the page history) a rating was made
  5. A logged_in field indicating whether the rater was logged in or not (based on the anonymized version of user_id included in the data
  6. A page_title field, which is just nice to have

We will also compute some of our own metrics, namely:

  1. page_rate_count - the number of ratings for each page (also versions for each dimension)
  2. rev_rate_count - the number of ratings for each version (i.e. revision) of each page (also versions for each dimension)
  3. rating_dim_count - the number of dimensions rated for each rating (0-4)
  4. rating_any_mean - the mean of all rated dimensions for each rating, regardless of how many are missing
  5. rating_all_mean - the mean of all rated dimensions only when all dimensions are nonmissing
  6. page_rate_mean_* - the mean for each page of all ratings of that page, in each dimension
  7. rev_rate_mean_* - the mean for each page version of all ratings of that version, in each dimension
In [1]:
import pandas as pd
import numpy as np

Load in the Data

We leave it gzipped because it is quite large.

In [2]:
df = pd.read_csv('data/aft4.tsv.gz', compression='gzip', delim_whitespace=True)
df = df[df.page_namespace==0] # only want main wiki pages
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46545752 entries, 0 to 47207443
Data columns (total 8 columns):
timestamp         object
page_id           int64
page_title        object
page_namespace    float64
rev_id            float64
user_id           float64
rating_key        object
rating_value      object
dtypes: float64(3), int64(1), object(4)
/Users/bkarfunkel/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

Deal with File Malformation

The TSV file is actually malformed. timestamp should be an int64, but it's an object because a number of rows have string values.

In [3]:
df['int_timestamp'] = df.timestamp.convert_objects(convert_numeric=True)
df[df.int_timestamp.isnull()]
Out[3]:
timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value int_timestamp
44 Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 1 NaN
45 Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 2 NaN
46 Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 3 NaN
47 Brockway 12707285 Wild_Bill_Hickok_–_Davis_Tutt_shootout 0 440740702 1 20110722000021 4 NaN
1312 Agent99 2027638 Alabina 0 440746057 1 20110722000952 1 NaN
1313 Agent99 2027638 Alabina 0 440746057 1 20110722000952 2 NaN
1314 Agent99 2027638 Alabina 0 440746057 1 20110722000952 3 NaN
1315 Agent99 2027638 Alabina 0 440746057 1 20110722000952 4 NaN
1336 Christensen 15240614 Port_of_Miami_Tunnel 0 440745935 1 20110722000957 1 NaN
1337 Christensen 15240614 Port_of_Miami_Tunnel 0 440745935 1 20110722000957 2 NaN
1338 Christensen 15240614 Port_of_Miami_Tunnel 0 440745935 1 20110722000957 3 NaN
1339 Christensen 15240614 Port_of_Miami_Tunnel 0 440745935 1 20110722000957 4 NaN
3988 Asian 27423599 Time's_List_of_the_100_Best_Novels 0 434945513 1 Guy 20110722002902 NaN
3989 Asian 27423599 Time's_List_of_the_100_Best_Novels 0 434945513 1 Guy 20110722002902 NaN
3990 Asian 27423599 Time's_List_of_the_100_Best_Novels 0 434945513 1 Guy 20110722002902 NaN
3991 Asian 27423599 Time's_List_of_the_100_Best_Novels 0 434945513 1 Guy 20110722002902 NaN
4836 dreamer 28388618 Da_Vinci_Learning 0 438423151 1 20110722003455 1 NaN
4837 dreamer 28388618 Da_Vinci_Learning 0 438423151 1 20110722003455 2 NaN
4838 dreamer 28388618 Da_Vinci_Learning 0 438423151 1 20110722003455 3 NaN
4839 dreamer 28388618 Da_Vinci_Learning 0 438423151 1 20110722003455 4 NaN
5100 Billy 4689209 Soccer_in_Australia 0 440617748 1 20110722003625 1 NaN
5101 Billy 4689209 Soccer_in_Australia 0 440617748 1 20110722003625 2 NaN
5102 Billy 4689209 Soccer_in_Australia 0 440617748 1 20110722003625 3 NaN
5103 Billy 4689209 Soccer_in_Australia 0 440617748 1 20110722003625 4 NaN
5132 dreamer 232026 Chiang_Mai 0 438450083 1 20110722003638 1 NaN
5133 dreamer 232026 Chiang_Mai 0 438450083 1 20110722003638 2 NaN
5134 dreamer 232026 Chiang_Mai 0 438450083 1 20110722003638 3 NaN
5135 dreamer 232026 Chiang_Mai 0 438450083 1 20110722003638 4 NaN
5420 Sublette 1481631 John_A._Stormer 0 430247366 1 20110722003828 1 NaN
5421 Sublette 1481631 John_A._Stormer 0 430247366 1 20110722003828 2 NaN
5422 Sublette 1481631 John_A._Stormer 0 430247366 1 20110722003828 3 NaN
5423 Sublette 1481631 John_A._Stormer 0 430247366 1 20110722003828 4 NaN
5664 ashton 1611646 Ashton_United_F.C. 0 440488396 1 20110722003953 1 NaN
5665 ashton 1611646 Ashton_United_F.C. 0 440488396 1 20110722003953 2 NaN
5666 ashton 1611646 Ashton_United_F.C. 0 440488396 1 20110722003953 3 NaN
5667 ashton 1611646 Ashton_United_F.C. 0 440488396 1 20110722003953 4 NaN
7296 Andres 20647537 Emily_Browning 0 440188744 1 Gomez Martinez NaN
7297 Andres 20647537 Emily_Browning 0 440188744 1 Gomez Martinez NaN
7298 Andres 20647537 Emily_Browning 0 440188744 1 Gomez Martinez NaN
7299 Andres 20647537 Emily_Browning 0 440188744 1 Gomez Martinez NaN
7932 Ruq 20958639 Trailer_Park_of_Terror 0 404230589 1 20110722005511 1 NaN
7933 Ruq 20958639 Trailer_Park_of_Terror 0 404230589 1 20110722005511 2 NaN
7934 Ruq 20958639 Trailer_Park_of_Terror 0 404230589 1 20110722005511 3 NaN
7935 Ruq 20958639 Trailer_Park_of_Terror 0 404230589 1 20110722005511 4 NaN
8552 Wolfowitz 15952370 Robots_Have_No_Tails 0 440752133 1 20110722005914 1 NaN
8553 Wolfowitz 15952370 Robots_Have_No_Tails 0 440752133 1 20110722005914 2 NaN
8554 Wolfowitz 15952370 Robots_Have_No_Tails 0 440752133 1 20110722005914 3 NaN
8555 Wolfowitz 15952370 Robots_Have_No_Tails 0 440752133 1 20110722005914 4 NaN
8896 Alfarrobinha 18962647 Canadian_heraldry 0 440752086 1 20110722010209 1 NaN
8897 Alfarrobinha 18962647 Canadian_heraldry 0 440752086 1 20110722010209 2 NaN
8898 Alfarrobinha 18962647 Canadian_heraldry 0 440752086 1 20110722010209 3 NaN
8899 Alfarrobinha 18962647 Canadian_heraldry 0 440752086 1 20110722010209 4 NaN
9316 Bratland 3839461 Aptera_Motors 0 440752865 1 20110722010521 1 NaN
9317 Bratland 3839461 Aptera_Motors 0 440752865 1 20110722010521 2 NaN
9318 Bratland 3839461 Aptera_Motors 0 440752865 1 20110722010521 3 NaN
9319 Bratland 3839461 Aptera_Motors 0 440752865 1 20110722010521 4 NaN
10256 Syutfu 32364156 2011_East_Africa_drought 0 440752203 1 20110722011245 1 NaN
10257 Syutfu 32364156 2011_East_Africa_drought 0 440752203 1 20110722011245 2 NaN
10258 Syutfu 32364156 2011_East_Africa_drought 0 440752203 1 20110722011245 3 NaN
10259 Syutfu 32364156 2011_East_Africa_drought 0 440752203 1 20110722011245 4 NaN
... ... ... ... ... ... ... ... ...

191972 rows × 9 columns

Judging by the fact that user_id == 1 here, I suspect that when the data was created, there was an issue with logged in users' user names (which have since been scrubbed for anonymity) being stored in a way that went across columns, causing other columns to be pushed around. To check this, we can look at the original file:

f = gzip.open('data/aft4.tsv.gz','rb')
for i in range(0,50):
    print "%s: %s" % (i, f.readline()),

Results (last lines, line numbers inserted by me):

44: 20110722000020    1148043    New_Zealand_cuisine    0    437274630    0    4    5
45: Brockway    12707285    Wild_Bill_Hickok_–_Davis_Tutt_shootout    0    440740702    1    20110722000021    1
46: Brockway    12707285    Wild_Bill_Hickok_–_Davis_Tutt_shootout    0    440740702    1    20110722000021    2
47: Brockway    12707285    Wild_Bill_Hickok_–_Davis_Tutt_shootout    0    440740702    1    20110722000021    3
48: Brockway    12707285    Wild_Bill_Hickok_–_Davis_Tutt_shootout    0    440740702    1    20110722000021    4
49: 20110722000022    32490439    NULL    NULL    440744849    1    1    0

Since the actual rating data is overwritten, there's no way to reconstruct these records, so we have to drop them. First, though, let's make sure they all have user_id == 1:

In [4]:
df[df.int_timestamp.isnull()][df.user_id!=1]
/Users/bkarfunkel/anaconda/lib/python2.7/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)

Out[4]:
Int64Index([], dtype='int64') Empty DataFrame

0 rows × 9 columns

Now let's drop these obsevations.

In [5]:
df = df[df.int_timestamp.notnull()]
df.drop('int_timestamp', 1)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46353780 entries, 0 to 47207443
Data columns (total 9 columns):
timestamp         object
page_id           int64
page_title        object
page_namespace    float64
rev_id            float64
user_id           float64
rating_key        object
rating_value      object
int_timestamp     float64
dtypes: float64(4), int64(1), object(4)

Ok, but now there are still some problematic cases. timestamp should always be a 14-digit number.

In [6]:
df[df.timestamp.map(lambda x: len(str(x))<14)]
Out[6]:
timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value int_timestamp
11128 27 4404343 Ziynet_Sali 0 439459700 1 20110722011928 1 27
11129 27 4404343 Ziynet_Sali 0 439459700 1 20110722011928 2 27
11130 27 4404343 Ziynet_Sali 0 439459700 1 20110722011928 3 27
11131 27 4404343 Ziynet_Sali 0 439459700 1 20110722011928 4 27
24676 09 489856 Valenzuela,_Philippines 0 436564449 1 20110722025521 1 9
24677 09 489856 Valenzuela,_Philippines 0 436564449 1 20110722025521 2 9
24678 09 489856 Valenzuela,_Philippines 0 436564449 1 20110722025521 3 9
24679 09 489856 Valenzuela,_Philippines 0 436564449 1 20110722025521 4 9
25600 20 3152105 Joan_Capdevila 0 440743979 1 20110722030150 1 20
25601 20 3152105 Joan_Capdevila 0 440743979 1 20110722030150 2 20
25602 20 3152105 Joan_Capdevila 0 440743979 1 20110722030150 3 20
25603 20 3152105 Joan_Capdevila 0 440743979 1 20110722030150 4 20
33024 2010 19616015 Creampie_(sexual_act) 0 440681588 1 20110722035733 1 2010
33025 2010 19616015 Creampie_(sexual_act) 0 440681588 1 20110722035733 2 2010
33026 2010 19616015 Creampie_(sexual_act) 0 440681588 1 20110722035733 3 2010
33027 2010 19616015 Creampie_(sexual_act) 0 440681588 1 20110722035733 4 2010
34708 2805 4942332 St_John_Passion 0 431451664 1 20110722040927 1 2805
34709 2805 4942332 St_John_Passion 0 431451664 1 20110722040927 2 2805
34710 2805 4942332 St_John_Passion 0 431451664 1 20110722040927 3 2805
34711 2805 4942332 St_John_Passion 0 431451664 1 20110722040927 4 2805
36024 52 31937595 Americium_smoke_detector 0 439331352 1 20110722041827 1 52
36025 52 31937595 Americium_smoke_detector 0 439331352 1 20110722041827 2 52
36026 52 31937595 Americium_smoke_detector 0 439331352 1 20110722041827 3 52
36027 52 31937595 Americium_smoke_detector 0 439331352 1 20110722041827 4 52
50768 85 148537 General_officer 0 440607407 1 20110722061527 1 85
50769 85 148537 General_officer 0 440607407 1 20110722061527 2 85
50770 85 148537 General_officer 0 440607407 1 20110722061527 3 85
50771 85 148537 General_officer 0 440607407 1 20110722061527 4 85
53280 73 740900 Club_Atlético_River_Plate 0 440774459 1 20110722063345 1 73
53281 73 740900 Club_Atlético_River_Plate 0 440774459 1 20110722063345 2 73
53282 73 740900 Club_Atlético_River_Plate 0 440774459 1 20110722063345 3 73
53283 73 740900 Club_Atlético_River_Plate 0 440774459 1 20110722063345 4 73
73404 28 551710 Jack_Cornwell 0 439120205 1 20110722090836 1 28
73405 28 551710 Jack_Cornwell 0 439120205 1 20110722090836 2 28
73406 28 551710 Jack_Cornwell 0 439120205 1 20110722090836 3 28
73407 28 551710 Jack_Cornwell 0 439120205 1 20110722090836 4 28
78324 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 20110722094348 1 8029
78325 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 20110722094348 2 8029
78326 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 20110722094348 3 8029
78327 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 20110722094348 4 8029
79744 07 11568427 Guru_Ghasidas_University 0 432968172 1 20110722095422 1 7
79745 07 11568427 Guru_Ghasidas_University 0 432968172 1 20110722095422 2 7
79746 07 11568427 Guru_Ghasidas_University 0 432968172 1 20110722095422 3 7
79747 07 11568427 Guru_Ghasidas_University 0 432968172 1 20110722095422 4 7
82660 28 69476 Rear_admiral 0 435052661 1 20110722101354 1 28
82661 28 69476 Rear_admiral 0 435052661 1 20110722101354 2 28
82662 28 69476 Rear_admiral 0 435052661 1 20110722101354 3 28
82663 28 69476 Rear_admiral 0 435052661 1 20110722101354 4 28
85080 28 2388291 Admiral_(Sweden) 0 308999077 1 20110722103116 1 28
85081 28 2388291 Admiral_(Sweden) 0 308999077 1 20110722103116 2 28
85082 28 2388291 Admiral_(Sweden) 0 308999077 1 20110722103116 3 28
85083 28 2388291 Admiral_(Sweden) 0 308999077 1 20110722103116 4 28
85212 28 1452817 Flotilla_admiral 0 384534905 1 20110722103221 1 28
85213 28 1452817 Flotilla_admiral 0 384534905 1 20110722103221 2 28
85214 28 1452817 Flotilla_admiral 0 384534905 1 20110722103221 3 28
85215 28 1452817 Flotilla_admiral 0 384534905 1 20110722103221 4 28
91396 28 170320 Riksdag_of_the_Estates 0 416921168 1 20110722111555 1 28
91397 28 170320 Riksdag_of_the_Estates 0 416921168 1 20110722111555 2 28
91398 28 170320 Riksdag_of_the_Estates 0 416921168 1 20110722111555 3 28
91399 28 170320 Riksdag_of_the_Estates 0 416921168 1 20110722111555 4 28
... ... ... ... ... ... ... ... ...

10052 rows × 9 columns

And, rating_key should never be greater than 4.

In [7]:
df.rating_key = df.rating_key.convert_objects(convert_numeric=True)
df[df.rating_key > 4]
Out[7]:
timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value int_timestamp
11128 27 4404343 Ziynet_Sali 0 439459700 1 2.011072e+13 1 27
11129 27 4404343 Ziynet_Sali 0 439459700 1 2.011072e+13 2 27
11130 27 4404343 Ziynet_Sali 0 439459700 1 2.011072e+13 3 27
11131 27 4404343 Ziynet_Sali 0 439459700 1 2.011072e+13 4 27
24676 09 489856 Valenzuela,_Philippines 0 436564449 1 2.011072e+13 1 9
24677 09 489856 Valenzuela,_Philippines 0 436564449 1 2.011072e+13 2 9
24678 09 489856 Valenzuela,_Philippines 0 436564449 1 2.011072e+13 3 9
24679 09 489856 Valenzuela,_Philippines 0 436564449 1 2.011072e+13 4 9
25600 20 3152105 Joan_Capdevila 0 440743979 1 2.011072e+13 1 20
25601 20 3152105 Joan_Capdevila 0 440743979 1 2.011072e+13 2 20
25602 20 3152105 Joan_Capdevila 0 440743979 1 2.011072e+13 3 20
25603 20 3152105 Joan_Capdevila 0 440743979 1 2.011072e+13 4 20
33024 2010 19616015 Creampie_(sexual_act) 0 440681588 1 2.011072e+13 1 2010
33025 2010 19616015 Creampie_(sexual_act) 0 440681588 1 2.011072e+13 2 2010
33026 2010 19616015 Creampie_(sexual_act) 0 440681588 1 2.011072e+13 3 2010
33027 2010 19616015 Creampie_(sexual_act) 0 440681588 1 2.011072e+13 4 2010
34708 2805 4942332 St_John_Passion 0 431451664 1 2.011072e+13 1 2805
34709 2805 4942332 St_John_Passion 0 431451664 1 2.011072e+13 2 2805
34710 2805 4942332 St_John_Passion 0 431451664 1 2.011072e+13 3 2805
34711 2805 4942332 St_John_Passion 0 431451664 1 2.011072e+13 4 2805
36024 52 31937595 Americium_smoke_detector 0 439331352 1 2.011072e+13 1 52
36025 52 31937595 Americium_smoke_detector 0 439331352 1 2.011072e+13 2 52
36026 52 31937595 Americium_smoke_detector 0 439331352 1 2.011072e+13 3 52
36027 52 31937595 Americium_smoke_detector 0 439331352 1 2.011072e+13 4 52
50768 85 148537 General_officer 0 440607407 1 2.011072e+13 1 85
50769 85 148537 General_officer 0 440607407 1 2.011072e+13 2 85
50770 85 148537 General_officer 0 440607407 1 2.011072e+13 3 85
50771 85 148537 General_officer 0 440607407 1 2.011072e+13 4 85
53280 73 740900 Club_Atlético_River_Plate 0 440774459 1 2.011072e+13 1 73
53281 73 740900 Club_Atlético_River_Plate 0 440774459 1 2.011072e+13 2 73
53282 73 740900 Club_Atlético_River_Plate 0 440774459 1 2.011072e+13 3 73
53283 73 740900 Club_Atlético_River_Plate 0 440774459 1 2.011072e+13 4 73
73404 28 551710 Jack_Cornwell 0 439120205 1 2.011072e+13 1 28
73405 28 551710 Jack_Cornwell 0 439120205 1 2.011072e+13 2 28
73406 28 551710 Jack_Cornwell 0 439120205 1 2.011072e+13 3 28
73407 28 551710 Jack_Cornwell 0 439120205 1 2.011072e+13 4 28
78324 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 2.011072e+13 1 8029
78325 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 2.011072e+13 2 8029
78326 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 2.011072e+13 3 8029
78327 8029 1587291 Harbour_Line_(Mumbai_Suburban_Railway) 0 395523873 1 2.011072e+13 4 8029
79744 07 11568427 Guru_Ghasidas_University 0 432968172 1 2.011072e+13 1 7
79745 07 11568427 Guru_Ghasidas_University 0 432968172 1 2.011072e+13 2 7
79746 07 11568427 Guru_Ghasidas_University 0 432968172 1 2.011072e+13 3 7
79747 07 11568427 Guru_Ghasidas_University 0 432968172 1 2.011072e+13 4 7
82660 28 69476 Rear_admiral 0 435052661 1 2.011072e+13 1 28
82661 28 69476 Rear_admiral 0 435052661 1 2.011072e+13 2 28
82662 28 69476 Rear_admiral 0 435052661 1 2.011072e+13 3 28
82663 28 69476 Rear_admiral 0 435052661 1 2.011072e+13 4 28
85080 28 2388291 Admiral_(Sweden) 0 308999077 1 2.011072e+13 1 28
85081 28 2388291 Admiral_(Sweden) 0 308999077 1 2.011072e+13 2 28
85082 28 2388291 Admiral_(Sweden) 0 308999077 1 2.011072e+13 3 28
85083 28 2388291 Admiral_(Sweden) 0 308999077 1 2.011072e+13 4 28
85212 28 1452817 Flotilla_admiral 0 384534905 1 2.011072e+13 1 28
85213 28 1452817 Flotilla_admiral 0 384534905 1 2.011072e+13 2 28
85214 28 1452817 Flotilla_admiral 0 384534905 1 2.011072e+13 3 28
85215 28 1452817 Flotilla_admiral 0 384534905 1 2.011072e+13 4 28
91396 28 170320 Riksdag_of_the_Estates 0 416921168 1 2.011072e+13 1 28
91397 28 170320 Riksdag_of_the_Estates 0 416921168 1 2.011072e+13 2 28
91398 28 170320 Riksdag_of_the_Estates 0 416921168 1 2.011072e+13 3 28
91399 28 170320 Riksdag_of_the_Estates 0 416921168 1 2.011072e+13 4 28
... ... ... ... ... ... ... ... ...

9836 rows × 9 columns

Let's drop these cases.

In [8]:
df = df[df.timestamp.map(lambda x: len(str(x)) == 14)][df.rating_key <= 4]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46343728 entries, 0 to 47207443
Data columns (total 9 columns):
timestamp         object
page_id           int64
page_title        object
page_namespace    float64
rev_id            float64
user_id           float64
rating_key        float64
rating_value      object
int_timestamp     float64
dtypes: float64(5), int64(1), object(3)

Ok, so we dropped around 10,000 cases but we're still left with...EGAD. 46+ million rows! Well, the data is structured so that each rating is 4 separate rows, but still. Let's take a look at the data, first.

In [9]:
df.head(10)
Out[9]:
timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value int_timestamp
0 20110722000002 29543332 RC_Timişoara 0 419784624 0 1 5 2.011072e+13
1 20110722000002 29543332 RC_Timişoara 0 419784624 0 2 5 2.011072e+13
2 20110722000002 29543332 RC_Timişoara 0 419784624 0 3 5 2.011072e+13
3 20110722000002 29543332 RC_Timişoara 0 419784624 0 4 5 2.011072e+13
4 20110722000002 68453 Basset_Hound 0 439346501 0 1 5 2.011072e+13
5 20110722000002 68453 Basset_Hound 0 439346501 0 2 0 2.011072e+13
6 20110722000002 68453 Basset_Hound 0 439346501 0 3 0 2.011072e+13
7 20110722000002 68453 Basset_Hound 0 439346501 0 4 0 2.011072e+13
8 20110722000005 180473 Symbolic_link 0 439383163 0 1 5 2.011072e+13
9 20110722000005 180473 Symbolic_link 0 439383163 0 2 5 2.011072e+13

10 rows × 9 columns

Ok. So here we can see that each page is rated in groups of four observations. Here is the data dictionary:

The following fields are included:

  • timestamp: 14-digit timestamp (UTC)
  • page_id: article ID
  • page_title: full article title
  • page_namespace: article namespace (0 is for Wikipedia's main article namespace)
  • rev_id: revision ID (article version that was rated)
  • user_id: boolean field set to 0 for anonymous raters and to 1 for registered users
  • rating_key: numeric key of the rating dimension (1: trustworthy, 2: objective, 3: complete, 4: well-written)
  • rating_value: rating value on a 1-5 scale. A value set to 0 indicates that the rater didn't rate the article on the corresponding dimension.

Some things to note:

  1. The user_id doesn't contain an actual identifier; it's just a flag for whether the editor was registered.
  2. The timestamp is the time that the rating was made.
  3. For each rating, there are four rows, one for each rating_key (i.e. each rating dimension). (More on this in a bit)

Sample Data

So, now that we have better idea of our data, let's start by cutting down the size of the file to a reasonable level. We'll take a sample of the observations. However, we can't just take a random chunk of rows -- then we would overcount pages with lots of edits, not to mention the fact that we wouldn't get all the data about those pages; in fact, what we want is a random sample of pages.

In [10]:
page_ids = pd.unique(df.page_id) # unique page_ids
np.random.seed(2014) # set the seed
sample_pids = np.random.choice(page_ids, 10000, replace=False) # sample of 10,000 pages

# Now we'll cut the data to only ratings for those pages
s = df[df.page_id.isin(sample_pids)]
s.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 314100 entries, 888 to 47206771
Data columns (total 9 columns):
timestamp         314100 non-null object
page_id           314100 non-null int64
page_title        314100 non-null object
page_namespace    314100 non-null float64
rev_id            314100 non-null float64
user_id           314100 non-null float64
rating_key        314100 non-null float64
rating_value      314100 non-null object
int_timestamp     314100 non-null float64
dtypes: float64(5), int64(1), object(3)

OK! Now we're down to about 315k. That's manageable. Note that this means that we have about:

In [11]:
s.timestamp.count() / 4. / len(pd.unique(s.page_id)) # N, divided by 4 rows per rating, divided by number of pages in sample =
Out[11]:
7.8525

7.85 ratings per page. Which is pretty good, as we'll be able to trace them over time. (We'll deal with the distribution of obs./page later.)


Prepare for Reshaping

Now what we want is a dataset that contains one observation per rating. But how do we identify a particular rating? What's our unique identifier?

timestamp and rev_id, of course!...Right? It would make sense, assuming that two people didn't rate the same article at exactly the same moment. But let's make sure.

In [12]:
del(df) # first let's unload df, since it's taking up a lot of RAM
s = s.drop('page_namespace', 1) # drop page_namespace, since we don't need it any more
s['obs_count'] = s.groupby(['rev_id', 'timestamp']).rating_value.transform('count')
s.head(20)
Out[12]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count
888 20110722000638 25862453 The_Amazing_Spider-Man_(2012_film) 440732801 0 1 1 2.011072e+13 4
889 20110722000638 25862453 The_Amazing_Spider-Man_(2012_film) 440732801 0 2 1 2.011072e+13 4
890 20110722000638 25862453 The_Amazing_Spider-Man_(2012_film) 440732801 0 3 1 2.011072e+13 4
891 20110722000638 25862453 The_Amazing_Spider-Man_(2012_film) 440732801 0 4 1 2.011072e+13 4
1132 20110722000822 1384672 God_Save_the_South 421037003 1 1 2 2.011072e+13 4
1133 20110722000822 1384672 God_Save_the_South 421037003 1 2 4 2.011072e+13 4
1134 20110722000822 1384672 God_Save_the_South 421037003 1 3 2 2.011072e+13 4
1135 20110722000822 1384672 God_Save_the_South 421037003 1 4 3 2.011072e+13 4
1368 20110722001012 22188070 Monte_Pisanino 418490010 0 1 3 2.011072e+13 4
1369 20110722001012 22188070 Monte_Pisanino 418490010 0 2 2 2.011072e+13 4
1370 20110722001012 22188070 Monte_Pisanino 418490010 0 3 3 2.011072e+13 4
1371 20110722001012 22188070 Monte_Pisanino 418490010 0 4 1 2.011072e+13 4
1980 20110722001444 32330134 Bill_S.978 440742665 0 1 2 2.011072e+13 4
1981 20110722001444 32330134 Bill_S.978 440742665 0 2 4 2.011072e+13 4
1982 20110722001444 32330134 Bill_S.978 440742665 0 3 3 2.011072e+13 4
1983 20110722001444 32330134 Bill_S.978 440742665 0 4 3 2.011072e+13 4
3040 20110722002205 1099515 Squall_Leonhart 434283951 0 1 5 2.011072e+13 4
3041 20110722002205 1099515 Squall_Leonhart 434283951 0 2 5 2.011072e+13 4
3042 20110722002205 1099515 Squall_Leonhart 434283951 0 3 4 2.011072e+13 4
3043 20110722002205 1099515 Squall_Leonhart 434283951 0 4 5 2.011072e+13 4

20 rows × 9 columns

So if all ratings have 4 rows, then there should be no observations where s.count != 4:

In [13]:
(s.obs_count != 4).sum()
s[s.obs_count != 4].head(20)
Out[13]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count
99976 20110722121723 39184 NTFS 437513502 0 1 0 2.011072e+13 2
99977 20110722121723 39184 NTFS 437513502 0 2 0 2.011072e+13 2
99978 20110722121724 39184 NTFS 437513502 0 3 0 2.011072e+13 2
99979 20110722121724 39184 NTFS 437513502 0 4 4 2.011072e+13 2
127228 20110722145634 23869776 Thirteen_Reasons_Why 440734013 1 1 0 2.011072e+13 1
127229 20110722145635 23869776 Thirteen_Reasons_Why 440734013 1 2 0 2.011072e+13 3
127230 20110722145635 23869776 Thirteen_Reasons_Why 440734013 1 3 0 2.011072e+13 3
127231 20110722145635 23869776 Thirteen_Reasons_Why 440734013 1 4 4 2.011072e+13 3
150074 20110722170052 972095 Lick_(music) 433396260 0 1 5 2.011072e+13 1
150086 20110722170107 19819307 Blacksand 436834538 0 1 1 2.011072e+13 1
150247 20110722170119 19819307 Blacksand 436834538 0 2 1 2.011072e+13 3
150248 20110722170119 19819307 Blacksand 436834538 0 3 1 2.011072e+13 3
150249 20110722170119 19819307 Blacksand 436834538 0 4 1 2.011072e+13 3
150275 20110722170119 972095 Lick_(music) 433396260 0 2 5 2.011072e+13 3
150276 20110722170119 972095 Lick_(music) 433396260 0 3 5 2.011072e+13 3
150277 20110722170119 972095 Lick_(music) 433396260 0 4 5 2.011072e+13 3
192904 20110722210445 386327 Law_School_Admission_Test 439972076 0 1 5 2.011072e+13 1
192905 20110722210446 386327 Law_School_Admission_Test 439972076 0 2 3 2.011072e+13 3
192906 20110722210446 386327 Law_School_Admission_Test 439972076 0 3 5 2.011072e+13 3
192907 20110722210446 386327 Law_School_Admission_Test 439972076 0 4 3 2.011072e+13 3

20 rows × 9 columns

Ah! So here we see that, in fact, some ratings were saved to the database across seconds, meaning that not all ratings have the same timestamp. Bummer.

Since we can't use timestamp as an ID variable, lets just use the rev_id combined with sorting by the index. For each rev_id group, we'll construct a variable equal to 1 for the first foursome (i.e. the first rating), 2 for the second, etc.

In [14]:
# Grouping by rev_id, create the cumulative count (i.e. 0 for first obs with that rev_id, 1 for second, etc.)
# Then, group by fours, and add one (so that the first is 1, the second 2, etc.)
s['nth_rating'] = s.groupby('rev_id').cumcount().apply(lambda x: int(x / 4) + 1)

Let's take a look again at our sample:

In [15]:
s.sort(['rev_id', 'nth_rating', 'rating_key'])
Out[15]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count nth_rating
6321972 20110825164758 7688277 Information_Harvesting 132017785 0 1 0 2.011083e+13 4 1
6321973 20110825164758 7688277 Information_Harvesting 132017785 0 2 0 2.011083e+13 4 1
6321974 20110825164758 7688277 Information_Harvesting 132017785 0 3 1 2.011083e+13 4 1
6321975 20110825164758 7688277 Information_Harvesting 132017785 0 4 0 2.011083e+13 4 1
6606408 20110827162937 988664 Hygd 177504140 0 1 4 2.011083e+13 4 1
6606409 20110827162937 988664 Hygd 177504140 0 2 5 2.011083e+13 4 1
6606410 20110827162937 988664 Hygd 177504140 0 3 3 2.011083e+13 4 1
6606411 20110827162937 988664 Hygd 177504140 0 4 5 2.011083e+13 4 1
21592108 20111208180546 15306408 Northern_Territory_Certificate_of_Education 200528301 0 1 1 2.011121e+13 4 1
21592109 20111208180546 15306408 Northern_Territory_Certificate_of_Education 200528301 0 2 1 2.011121e+13 4 1
21592110 20111208180546 15306408 Northern_Territory_Certificate_of_Education 200528301 0 3 1 2.011121e+13 4 1
21592111 20111208180546 15306408 Northern_Territory_Certificate_of_Education 200528301 0 4 1 2.011121e+13 4 1
6122456 20110824130602 11191944 Methionine_adenosyltransferase 204958352 0 1 1 2.011082e+13 4 1
6122457 20110824130602 11191944 Methionine_adenosyltransferase 204958352 0 2 0 2.011082e+13 4 1
6122458 20110824130602 11191944 Methionine_adenosyltransferase 204958352 0 3 1 2.011082e+13 4 1
6122459 20110824130602 11191944 Methionine_adenosyltransferase 204958352 0 4 0 2.011082e+13 4 1
2852844 20110805002329 637164 Zuph 216938513 0 1 5 2.011081e+13 4 1
2852845 20110805002329 637164 Zuph 216938513 0 2 5 2.011081e+13 4 1
2852846 20110805002329 637164 Zuph 216938513 0 3 4 2.011081e+13 4 1
2852847 20110805002329 637164 Zuph 216938513 0 4 4 2.011081e+13 4 1
1302312 20110728024506 3283625 Course_(medicine) 220802711 0 1 0 2.011073e+13 4 1
1302313 20110728024506 3283625 Course_(medicine) 220802711 0 2 0 2.011073e+13 4 1
1302314 20110728024506 3283625 Course_(medicine) 220802711 0 3 0 2.011073e+13 4 1
1302315 20110728024506 3283625 Course_(medicine) 220802711 0 4 4 2.011073e+13 4 1
5536324 20110820231000 3283625 Course_(medicine) 220802711 0 1 1 2.011082e+13 4 2
5536325 20110820231000 3283625 Course_(medicine) 220802711 0 2 1 2.011082e+13 4 2
5536326 20110820231000 3283625 Course_(medicine) 220802711 0 3 1 2.011082e+13 4 2
5536327 20110820231000 3283625 Course_(medicine) 220802711 0 4 1 2.011082e+13 4 2
108288 20110722130957 6689720 Integrated_bottom_line 229691947 1 1 1 2.011072e+13 4 1
108289 20110722130957 6689720 Integrated_bottom_line 229691947 1 2 1 2.011072e+13 4 1
108290 20110722130957 6689720 Integrated_bottom_line 229691947 1 3 1 2.011072e+13 4 1
108291 20110722130957 6689720 Integrated_bottom_line 229691947 1 4 2 2.011072e+13 4 1
27712704 20120131021928 13485005 Tangipahoa 231669276 0 1 5 2.012013e+13 4 1
27712705 20120131021928 13485005 Tangipahoa 231669276 0 2 5 2.012013e+13 4 1
27712706 20120131021928 13485005 Tangipahoa 231669276 0 3 0 2.012013e+13 4 1
27712707 20120131021928 13485005 Tangipahoa 231669276 0 4 5 2.012013e+13 4 1
35669136 20120404224906 13485005 Tangipahoa 231669276 0 1 4 2.012040e+13 4 2
35669137 20120404224906 13485005 Tangipahoa 231669276 0 2 4 2.012040e+13 4 2
35669138 20120404224906 13485005 Tangipahoa 231669276 0 3 4 2.012040e+13 4 2
35669139 20120404224906 13485005 Tangipahoa 231669276 0 4 4 2.012040e+13 4 2
9452360 20110914194324 9574456 Velikonda_Range 235410872 0 1 1 2.011091e+13 4 1
9452361 20110914194324 9574456 Velikonda_Range 235410872 0 2 1 2.011091e+13 4 1
9452362 20110914194324 9574456 Velikonda_Range 235410872 0 3 1 2.011091e+13 4 1
9452363 20110914194324 9574456 Velikonda_Range 235410872 0 4 1 2.011091e+13 4 1
29187900 20120211122213 9574456 Velikonda_Range 235410872 0 1 1 2.012021e+13 4 2
29187901 20120211122213 9574456 Velikonda_Range 235410872 0 2 1 2.012021e+13 4 2
29187902 20120211122213 9574456 Velikonda_Range 235410872 0 3 1 2.012021e+13 4 2
29187903 20120211122213 9574456 Velikonda_Range 235410872 0 4 1 2.012021e+13 4 2
31192136 20120227142735 9574456 Velikonda_Range 235410872 0 1 1 2.012023e+13 4 3
31192137 20120227142735 9574456 Velikonda_Range 235410872 0 2 1 2.012023e+13 4 3
31192138 20120227142735 9574456 Velikonda_Range 235410872 0 3 1 2.012023e+13 4 3
31192139 20120227142735 9574456 Velikonda_Range 235410872 0 4 1 2.012023e+13 4 3
1001688 20110726191624 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 1 1 2.011073e+13 4 1
1001689 20110726191624 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 2 4 2.011073e+13 4 1
1001690 20110726191624 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 3 1 2.011073e+13 4 1
1001691 20110726191624 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 4 3 2.011073e+13 4 1
21750440 20111209231300 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 1 4 2.011121e+13 4 2
21750441 20111209231300 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 2 4 2.011121e+13 4 2
21750442 20111209231300 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 3 1 2.011121e+13 4 2
21750443 20111209231300 14245160 Intestinal_cancer_in_cats_and_dogs 240879859 0 4 4 2.011121e+13 4 2
... ... ... ... ... ... ... ... ... ...

314100 rows × 10 columns

Now nth_rating, when combined with rev_id and rating_key, should give us a unique ID for each rating (which we'll need when we want to reshape the data).

In [16]:
s.duplicated(['rev_id', 'nth_rating', 'rating_key']).sum()
Out[16]:
4

Uh oh. Let's view those 4 cases:

In [17]:
s[s.duplicated(['rev_id', 'nth_rating', 'rating_key'])]
Out[17]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count nth_rating
2771857 20110804150404 31877713 Software4Students 434558075 0 1 5 2.011080e+13 8 3
2771859 20110804150404 31877713 Software4Students 434558075 0 2 5 2.011080e+13 8 3
2771861 20110804150404 31877713 Software4Students 434558075 0 3 5 2.011080e+13 8 4
2771863 20110804150404 31877713 Software4Students 434558075 0 4 5 2.011080e+13 8 4

4 rows × 10 columns

In [18]:
s[(s.rev_id == 434558075) & (s.nth_rating.isin([3,4]))]
Out[18]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count nth_rating
2771856 20110804150404 31877713 Software4Students 434558075 0 1 5 2.011080e+13 8 3
2771857 20110804150404 31877713 Software4Students 434558075 0 1 5 2.011080e+13 8 3
2771858 20110804150404 31877713 Software4Students 434558075 0 2 5 2.011080e+13 8 3
2771859 20110804150404 31877713 Software4Students 434558075 0 2 5 2.011080e+13 8 3
2771860 20110804150404 31877713 Software4Students 434558075 0 3 5 2.011080e+13 8 4
2771861 20110804150404 31877713 Software4Students 434558075 0 3 5 2.011080e+13 8 4
2771862 20110804150404 31877713 Software4Students 434558075 0 4 5 2.011080e+13 8 4
2771863 20110804150404 31877713 Software4Students 434558075 0 4 5 2.011080e+13 8 4

8 rows × 10 columns

Here it appears that the same rating was inserted twice into the database at exactly the same time. Note that the timestamps are identical, as, indeed, are all fields. Thus, this appears to be a genuine duplicate, which we should drop one of. Because of the grouping of nth_rating, we'll need to change that as well.

In [19]:
s.info()
s.drop_duplicates().info()
s = s.drop_duplicates()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 314100 entries, 888 to 47206771
Data columns (total 10 columns):
timestamp        314100 non-null object
page_id          314100 non-null int64
page_title       314100 non-null object
rev_id           314100 non-null float64
user_id          314100 non-null float64
rating_key       314100 non-null float64
rating_value     314100 non-null object
int_timestamp    314100 non-null float64
obs_count        314100 non-null object
nth_rating       314100 non-null int64
dtypes: float64(4), int64(2), object(4)<class 'pandas.core.frame.DataFrame'>
Int64Index: 314096 entries, 888 to 47206771
Data columns (total 10 columns):
timestamp        314096 non-null object
page_id          314096 non-null int64
page_title       314096 non-null object
rev_id           314096 non-null float64
user_id          314096 non-null float64
rating_key       314096 non-null float64
rating_value     314096 non-null object
int_timestamp    314096 non-null float64
obs_count        314096 non-null object
nth_rating       314096 non-null int64
dtypes: float64(4), int64(2), object(4)

So, we only dropped the 4 observations.

In [20]:
s.nth_rating[(s.rev_id == 434558075) & (s.nth_rating.isin([3,4]))] = 3
s[(s.rev_id == 434558075)]
Out[20]:
timestamp page_id page_title rev_id user_id rating_key rating_value int_timestamp obs_count nth_rating
2770336 20110804145454 31877713 Software4Students 434558075 0 1 4 2.011080e+13 4 1
2770337 20110804145454 31877713 Software4Students 434558075 0 2 4 2.011080e+13 4 1
2770338 20110804145454 31877713 Software4Students 434558075 0 3 3 2.011080e+13 4 1
2770339 20110804145454 31877713 Software4Students 434558075 0 4 4 2.011080e+13 4 1
2771688 20110804150259 31877713 Software4Students 434558075 0 1 5 2.011080e+13 4 2
2771689 20110804150259 31877713 Software4Students 434558075 0 2 5 2.011080e+13 4 2
2771690 20110804150259 31877713 Software4Students 434558075 0 3 5 2.011080e+13 4 2
2771691 20110804150259 31877713 Software4Students 434558075 0 4 5 2.011080e+13 4 2
2771856 20110804150404 31877713 Software4Students 434558075 0 1 5 2.011080e+13 8 3
2771858 20110804150404 31877713 Software4Students 434558075 0 2 5 2.011080e+13 8 3
2771860 20110804150404 31877713 Software4Students 434558075 0 3 5 2.011080e+13 8 3
2771862 20110804150404 31877713 Software4Students 434558075 0 4 5 2.011080e+13 8 3
2772092 20110804150527 31877713 Software4Students 434558075 0 1 5 2.011080e+13 4 5
2772093 20110804150527 31877713 Software4Students 434558075 0 2 5 2.011080e+13 4 5
2772094 20110804150527 31877713 Software4Students 434558075 0 3 5 2.011080e+13 4 5
2772095 20110804150527 31877713 Software4Students 434558075 0 4 5 2.011080e+13 4 5
2772824 20110804151007 31877713 Software4Students 434558075 1 1 5 2.011080e+13 4 6
2772825 20110804151007 31877713 Software4Students 434558075 1 2 5 2.011080e+13 4 6
2772826 20110804151007 31877713 Software4Students 434558075 1 3 5 2.011080e+13 4 6
2772827 20110804151007 31877713 Software4Students 434558075 1 4 5 2.011080e+13 4 6
2772924 20110804151034 31877713 Software4Students 434558075 0 1 5 2.011080e+13 4 7
2772925 20110804151034 31877713 Software4Students 434558075 0 2 5 2.011080e+13 4 7
2772926 20110804151034 31877713 Software4Students 434558075 0 3 5 2.011080e+13 4 7
2772927 20110804151034 31877713 Software4Students 434558075 0 4 5 2.011080e+13 4 7
2774520 20110804152040 31877713 Software4Students 434558075 0 1 5 2.011080e+13 4 8
2774521 20110804152040 31877713 Software4Students 434558075 0 2 3 2.011080e+13 4 8
2774522 20110804152040 31877713 Software4Students 434558075 0 3 3 2.011080e+13 4 8
2774523 20110804152040 31877713 Software4Students 434558075 0 4 5 2.011080e+13 4 8
2785984 20110804162749 31877713 Software4Students 434558075 0 1 5 2.011080e+13 4 9
2785985 20110804162749 31877713 Software4Students 434558075 0 2 5 2.011080e+13 4 9
2785986 20110804162749 31877713 Software4Students 434558075 0 3 5 2.011080e+13 4 9
2785987 20110804162749 31877713 Software4Students 434558075 0 4 5 2.011080e+13 4 9
5767556 20110822115429 31877713 Software4Students 434558075 0 1 5 2.011082e+13 4 10
5767557 20110822115429 31877713 Software4Students 434558075 0 2 5 2.011082e+13 4 10
5767558 20110822115429 31877713 Software4Students 434558075 0 3 5 2.011082e+13 4 10
5767559 20110822115429 31877713 Software4Students 434558075 0 4 5 2.011082e+13 4 10

36 rows × 10 columns

Finally, to make it easier to deal with the index and reshaping, let's convert the rating keys to meaningful names.

In [21]:
s.rating_key = s.rating_key.replace([1, 2, 3, 4], ['trust','obj','comp','writ'])

Reshape Data

Now we're ready to reshape. We start by dropping variables we no longer need. Next, we set the index to all the variables we don't want to be reshaped (i.e. that don't change for our new observations), with the variable we want to reshape on, rating_key, last.

In [22]:
s = s.drop(['int_timestamp', 'obs_count'], 1)
s = s.set_index(['rev_id', 'nth_rating', 'page_id', 'page_title', 'user_id', 'rating_key'])

Now we do the actual reshaping, using unstack.

In [23]:
stacked = s # create a copy of the stacked version, for convenience
In [24]:
#s = stacked # to reset
In [25]:
s = s.unstack(level='rating_key') 

Create Additional Fields

First, let's create a datetime variable with a date object for the first timestamp. First, we create a series, which we will pd.concat with the rest of the data later.

In [26]:
# Create a series of datetimes
datetime = pd.to_datetime(s.timestamp.comp) # "complete" is the first of the four rating_keys
datetime.name = 'datetime'

Now that we have the datetime (i.e. the information we care about from timestamp), the only other data we want to keep is either already in the index, or else is in rating_value. Since we don't want to keep a MultiIndex if we don't need it (it makes accessing columns a pain), we can just set our data to be s.rating_value (which has all the other things we care about in the index), and add all the other measures to that.

In [27]:
rating_values = s.rating_value

Before we can mean ratings, etc., we need to clean up the rating_value fields. In particular, we need to:

  1. Convert values to numeric (from object)
  2. Replace 0s with NaN (i.e. missing or NA), since, as the data dictionary states: "A value set to 0 indicates that the rater didn't rate the article on the corresponding dimension."
In [28]:
rating_values = rating_values.convert_objects(convert_numeric=True) # convert rating value columns to numeric
In [29]:
rating_values = rating_values.replace(0, np.nan)
rating_values.head(20)
Out[29]:
rating_key comp obj trust writ
rev_id nth_rating page_id page_title user_id
132017785 1 7688277 Information_Harvesting 0 1 NaN NaN NaN
177504140 1 988664 Hygd 0 3 5 4 5
200528301 1 15306408 Northern_Territory_Certificate_of_Education 0 1 1 1 1
204958352 1 11191944 Methionine_adenosyltransferase 0 1 NaN 1 NaN
216938513 1 637164 Zuph 0 4 5 5 4
220802711 1 3283625 Course_(medicine) 0 NaN NaN NaN 4
2 3283625 Course_(medicine) 0 1 1 1 1
229691947 1 6689720 Integrated_bottom_line 1 1 1 1 2
231669276 1 13485005 Tangipahoa 0 NaN 5 5 5
2 13485005 Tangipahoa 0 4 4 4 4
235410872 1 9574456 Velikonda_Range 0 1 1 1 1
2 9574456 Velikonda_Range 0 1 1 1 1
3 9574456 Velikonda_Range 0 1 1 1 1
240879859 1 14245160 Intestinal_cancer_in_cats_and_dogs 0 1 4 1 3
2 14245160 Intestinal_cancer_in_cats_and_dogs 0 1 4 4 4
3 14245160 Intestinal_cancer_in_cats_and_dogs 0 1 1 1 1
242526672 1 11758353 Perceptual_attack_time 1 5 4 5 5
244967736 1 14714488 Southeast_Peninsula_(Saint_Kitts) 0 1 3 4 1
248234025 1 10223439 NGOSS-TNA 0 NaN 3 NaN NaN
250841642 1 550298 Groupaction 0 1 3 NaN 3

20 rows × 4 columns

If all the ratings are zero/missing, then we obviously can't get much information from the rating, so drop these cases.

In [30]:
print("BEFORE: %d cases with all missing rating values" % rating_values.isnull().all(axis=1).sum())
rating_values.dropna(how='all', inplace=True) # drop in place if _all_ values are NA
print("AFTER: %d cases with all missing rating values" % rating_values.isnull().all(axis=1).sum())
BEFORE: 787 cases with all missing rating values
AFTER: 0 cases with all missing rating values

Now we set s to be rating_values, which contains in index the other info we want, concatenated with the datetime series we created earlier.

In [31]:
s = pd.concat([pd.DataFrame(datetime), rating_values], axis=1)
s.info()
s.head()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 78524 entries, (132017785.0, 1, 7688277, Information_Harvesting, 0.0) to (503626343.0, 1, 3741746, Emma_Stone, 0.0)
Data columns (total 5 columns):
datetime    78524 non-null datetime64[ns]
comp        60263 non-null float64
obj         57496 non-null float64
trust       61993 non-null float64
writ        65568 non-null float64
dtypes: datetime64[ns](1), float64(4)
Out[31]:
datetime comp obj trust writ
rev_id nth_rating page_id page_title user_id
132017785 1 7688277 Information_Harvesting 0 2011-08-25 16:47:58 1 NaN NaN NaN
177504140 1 988664 Hygd 0 2011-08-27 16:29:37 3 5 4 5
200528301 1 15306408 Northern_Territory_Certificate_of_Education 0 2011-12-08 18:05:46 1 1 1 1
204958352 1 11191944 Methionine_adenosyltransferase 0 2011-08-24 13:06:02 1 NaN 1 NaN
216938513 1 637164 Zuph 0 2011-08-05 00:23:29 4 5 5 4

5 rows × 5 columns

Create a variable counting the number of dimensions for each rating.

In [32]:
s['rating_dim_count'] = rating_values.count(axis=1)
s.head()
Out[32]:
datetime comp obj trust writ rating_dim_count
rev_id nth_rating page_id page_title user_id
132017785 1 7688277 Information_Harvesting 0 2011-08-25 16:47:58 1 NaN NaN NaN 1
177504140 1 988664 Hygd 0 2011-08-27 16:29:37 3 5 4 5 4
200528301 1 15306408 Northern_Territory_Certificate_of_Education 0 2011-12-08 18:05:46 1 1 1 1 4
204958352 1 11191944 Methionine_adenosyltransferase 0 2011-08-24 13:06:02 1 NaN 1 NaN 2
216938513 1 637164 Zuph 0 2011-08-05 00:23:29 4 5 5 4 4

5 rows × 6 columns

Create two variables with the mean of all rating dimensions:

  • rating_any_mean -- mean of all dimensions, regardless of how many missing dimensions
  • rating_all_mean -- mean of all dimensions only when all dimensions are nonmissing
In [33]:
s['rating_any_mean'] = rating_values.mean(axis=1)
s['rating_all_mean'] = rating_values[rating_values.notnull().all(axis=1)].mean(axis=1)
s.head(20)
Out[33]:
datetime comp obj trust writ rating_dim_count rating_any_mean rating_all_mean
rev_id nth_rating page_id page_title user_id
132017785 1 7688277 Information_Harvesting 0 2011-08-25 16:47:58 1 NaN NaN NaN 1 1.000000 NaN
177504140 1 988664 Hygd 0 2011-08-27 16:29:37 3 5 4 5 4 4.250000 4.25
200528301 1 15306408 Northern_Territory_Certificate_of_Education 0 2011-12-08 18:05:46 1 1 1 1 4 1.000000 1.00
204958352 1 11191944 Methionine_adenosyltransferase 0 2011-08-24 13:06:02 1 NaN 1 NaN 2 1.000000 NaN
216938513 1 637164 Zuph 0 2011-08-05 00:23:29 4 5 5 4 4 4.500000 4.50
220802711 1 3283625 Course_(medicine) 0 2011-07-28 02:45:06 NaN NaN NaN 4 1 4.000000 NaN
2 3283625 Course_(medicine) 0 2011-08-20 23:10:00 1 1 1 1 4 1.000000 1.00
229691947 1 6689720 Integrated_bottom_line 1 2011-07-22 13:09:57 1 1 1 2 4 1.250000 1.25
231669276 1 13485005 Tangipahoa 0 2012-01-31 02:19:28 NaN 5 5 5 3 5.000000 NaN
2 13485005 Tangipahoa 0 2012-04-04 22:49:06 4 4 4 4 4 4.000000 4.00
235410872 1 9574456 Velikonda_Range 0 2011-09-14 19:43:24 1 1 1 1 4 1.000000 1.00
2 9574456 Velikonda_Range 0 2012-02-11 12:22:13 1 1 1 1 4 1.000000 1.00
3 9574456 Velikonda_Range 0 2012-02-27 14:27:35 1 1 1 1 4 1.000000 1.00
240879859 1 14245160 Intestinal_cancer_in_cats_and_dogs 0 2011-07-26 19:16:24 1 4 1 3 4 2.250000 2.25
2 14245160 Intestinal_cancer_in_cats_and_dogs 0 2011-12-09 23:13:00 1 4 4 4 4 3.250000 3.25
3 14245160 Intestinal_cancer_in_cats_and_dogs 0 2012-03-17 08:58:16 1 1 1 1 4 1.000000 1.00
242526672 1 11758353 Perceptual_attack_time 1 2011-11-20 16:17:01 5 4 5 5 4 4.750000 4.75
244967736 1 14714488 Southeast_Peninsula_(Saint_Kitts) 0 2011-11-16 04:08:36 1 3 4 1 4 2.250000 2.25
248234025 1 10223439 NGOSS-TNA 0 2011-09-12 13:02:35 NaN 3 NaN NaN 1 3.000000 NaN
250841642 1 550298 Groupaction 0 2011-08-22 18:51:45 1 3 NaN 3 3 2.333333 NaN

20 rows × 8 columns

We're about ready to reset_index so we can use groupby to create revision- and page-wide statistics, but before we do, let's create an array of all the rating value dimensions.

In [34]:
rating_dims = rating_values.columns

Now we can reset_index and delete rating_values.

In [35]:
del(rating_values)
s = s.reset_index()

Rename user_id to logged_in, which is more clear for what the variable is actually measuring.

In [36]:
s.rename(columns={'user_id': 'logged_in'}, inplace=True)

Create count variables for the number of ratings for each page and version

In [37]:
page_groups = s.groupby(['page_id'])
In [38]:
s['page_rate_count'] = page_groups.page_id.transform('count')
In [39]:
rev_groups = s.groupby(['rev_id'])
In [40]:
s['rev_rate_count'] = rev_groups.rev_id.transform('count')

And create a mean within each page/revision for each of the four dimensions.

In [41]:
for var in rating_dims:
    s['page_%s_mean' % var] = page_groups[var].transform('mean')
    s['rev_%s_mean' % var] = rev_groups[var].transform('mean')

And now we're almost done! We just need to sort it, and save.

In [42]:
s = s.sort(['page_id', 'rev_id', 'datetime'])
In [43]:
s.to_csv("cleaned_wiki_ratings.csv")