Rating Wikipedia

Analysis of Wikipedia ratings data

View the Project on GitHub bkfunk/rating_wiki

« Brian Karfunkel home

Rating Wikipedia

Overview

What makes a good Wikipedia page?

Using data on ratings for Wikipedia pages, I try to explore that question.

About the data

Wikimedia provides a one-year dump of all ratings on Wikipedia, from July 22, 2011 up to July 22, 2012. This file contains almost 50 million records of almost 12 million unique ratings (NB: there are 4 dimensions for each rating, so one person rating an article will provide up to 4 data points).

The file (approx 441.6 MB gzipped) is hosted here: http://datahub.io/dataset/wikipedia-article-ratings/resource/8a218330-6ac3-40d1-ae0d-4224f57db214

There are 4 dimensions on which a page could be rated:

  1. Trustworthy (trust in my cleaned dataset)
  2. Objective (obj)
  3. Complete (comp)
  4. Well-written (writ)

A user could rate any or all dimensions for each page. In the data provided, in addition to the values for each rating dimension, we have:

Given the anonymized data, it is impossible to know whether users were able to rate a page more than once, but given the fact that most of the ratings come from non-logged-in users, it is certainly possible.

Motivating questions

My goal for this analysis was to first analyze the ratings data itself to see whether the ratings seemed to provide a useful measure of page quality. Then, I wanted to combine the ratings data with data on the actual page versions rated to look at how different kinds of edits affect the page's rating. Namely, is there a way to quantify which edits do the most to improve a page in any or all of the four rating dimensions? For example, are big edits (based on number of characters edited) better than small edits? Do small edits improve score for quality of writing (writ), while big edits improve completeness (comp)? Are logged-in users better editors than anonymous users?

If there were a way to identify particularly helpful edits in terms of improving article quality, then Wikipedia could:

Cleaning Data

The first step was cleaning the ratings data. The iPython Notebook showing all the cleaning code and output is here. In addition to reshaping the data so that each rating has one observation, rather than four, I cut the data to a sample of some 79,000 ratings for 10,000 different pages.

Exploring Data

My exploratory analysis is documented here. I start by looking at the ratings themselves.

1. How is the number of ratings distributed across pages and versions (revisions) of pages?

In order for ratings to be any kind of measure of page quality, people have to actually rate pages. How many ratings to pages get in the year-long sample? And do some pages get significantly more than others?

As with a lot of data dealing with popularity (e.g. population of cities), the data appear to follow Zipf's law insofar as a small number of pages get a huge number of ratings, but the number of ratings quickly drops off, ending with a long right tail. Here is a chart illustrating the distribution:

Figure 1

Chart of distribution of ratings per page (in sample)

The average page in the sample has 7.85 ratings in the year-long period (standard deviation of 42.1), though of course the sample is of pages with at least one rating in that period. The most-rated page in my sample is "The Hunger Games" with 2,713 ratings. The average page has 3.66 different versions in the sample, with each version being rated an average of 1.63 times.

2. How is the value of ratings distributed across pages and versions?

Most ratings tend to be high on the 1-5 scale.

Figure 2

Chart of distribution of rating values

The above chart shows the distribution of average rating value for each observation. The first plot is the distribution of rating_all_mean, which is the mean of the 4 rating dimensions when all 4 dimensions are rated (and NA otherwise), while the second plot is the distribution of rating_any_mean, which is the mean of all rating dimensions that are present (whether there is only 1 dimension rated, or 4, etc.). Note that, for both of these histograms (and all similar histograms below), each bin does not include the right-most boundary except the last, which includes ratings with means from 4.5 to 5, inclusive. The third plot shows the proportions of each rating observation that have 1, 2, 3, or 4 dimensions rated.

Over 40% of observations have ratings that average 4.5 or greater on a scale of 5. Furthermore, almost 70% of users rated all 4 dimensions, while about 25% rated only one dimension, and very few rated 2 or 3. When all four dimensions are rated, the mean rating is 3.66, with a standard deviation of 1.34, and the median is 4.0.

Given the rest of the distribution, there is a larger-than-expected frequency of low ratings (namely, 1s). This bimodality, where there is a peak in the ratings distribution around [4.5, 5] and then around [1, 1.5), is relatively consistent across rating dimensions.

Figure 3

Chart of distribution of each rating dimension

While the rightmost peak (at rating value of 5) is lower for "Complete", all 4 dimensions have another peak at a rating value of 1.

Thus, it appears that raters tend to rate pages at the extremes, either very high or very low. This is true regardless of whether the user rates only 1 dimension or all 4 (see the complete exploration for charts illustrating this).

3. Are rating dimensions meaningful?

Is it worth asking users to rate 4 dimensions instead of simply giving each page a 1-5 rating? Each dimension is highly correlated with each other dimension, indicating that a user who gives a high rating to "Complete", for example, is very likely to give a high rating to "Well-written". Here is a table of Pearson correlations between each dimension:

Table 1
Correlation by Dimension Complete Objective Trustworthy Well-written
Complete 1.000000 0.715450 0.749500 0.764108
Objective 0.715450 1.000000 0.774930 0.738221
Trustworthy 0.749500 0.774930 1.000000 0.753312
Well-written 0.764108 0.738221 0.753312 1.000000

But, the correlation is not perfect. So, though there is not a whole lot of new information in each dimension, there is some, and depending on the cost to the user to rate each dimension, it may still make sense to have separate categories.

The means for each dimension do generally differ significantly, however, using a paired difference test. For example, here are the p-values from a Wilcoxon test comparing each dimension to each other dimension:

Table 2
Wilcoxon Signed-Rank Test Complete Objective Trustworthy Well-written
Complete X 0.000 0.000 0.000
Objective 0.000 X 0.000 0.153
Trustworthy 0.000 0.000 X 0.000
Well-written 0.000 0.153 0.000 X

Figures reported are p-values from a Wilcoxon Signed-Rank test performed for each combination of rating dimensions.

All are significantly different with the exception of "Objective" and "Well-Written", which has a p-value of 0.153.

4. How consistent are the ratings for a single page or version?

If ratings are indeed useful measures of page quality, and if page quality is something that is consistent for various users of Wikipedia, then we would hope that ratings cluster around the mean. That is, we would want most people to tend to give ratings that are close to a certain average. The bimodality reflected in #2 above might indicate that this is not the case, however.

There are many ways to measure the spread of data like this, but I will focus on the mean absolute deviation (MAD), which is simply the mean of the absolute value of the difference between each observation's value and the mean or median value. Given the fact that both the mean and median rating is quite high, and that rating values are truncated at 5, the distance from the point of central tendency (mean or median) can be higher for lower ratings. Thus it doesn't make as much sense to use the standard deviation, which weights bigger deviations more (since it's the square root of the mean squared error). The MAD can be easily understood as the average difference between a particular rating and the mean/median for all ratings for that page/version.

Table 3
Distribution of MAD by Page Count Mean 25% 50% 75%
Abs. Dev. from Page Mean - Mean of All Dim. 49,712 0.956 0.395 0.820 1.333
Abs. Dev. from Page Median - Mean of All Dim. 49,712 0.905 0.250 0.625 1.250
Abs. Dev. from Page Mean - "Complete" 55,500 1.126 0.500 1.059 1.624
Abs. Dev. from Page Mean - "Objective" 53,265 1.103 0.500 1.000 1.500
Abs. Dev. from Page Mean - "Trustworthy" 56,986 1.117 0.500 0.981 1.509
Abs. Dev. from Page Mean - "Well-Written" 60,508 1.040 0.500 0.893 1.418

Only pages with at least 2 ratings were included for this table. The computation for the above table is as follows: 1) Compute the mean/median rating for all observations for a given page; 2) for each observation, calculate the absolute value of the difference between that observation's rating for the particular dimension (or the mean of all dimensions). Thus, the Mean column constitutes the MAD for the given dimension. Note that for the first two rows, labeled Mean of All Dim., all observations with less than all 4 dimensions rated are excluded; for the remaining rows, all non-missing ratings for that dimension are included, regardless of whether other dimensions were also rated in that observation.


Table 4
Distribution of MAD by Version Count Mean 25% 50% 75%
Abs. Dev. from Version Mean - Mean of All Dim. 34515 0.817 0.275 0.649 1.200
Abs. Dev. from Version Median - Mean of All Dim. 34515 0.758 0.125 0.500 1.125
Abs. Dev. from Version Mean - "Complete" 37403 0.960 0.400 0.826 1.500
Abs. Dev. from Version Mean - "Objective" 36317 0.933 0.333 0.750 1.400
Abs. Dev. from Version Mean - "Trustworthy" 38216 0.958 0.333 0.800 1.484
Abs. Dev. from Version Mean - "Well-Written" 39920 0.902 0.333 0.722 1.333

Only versions with at least 2 ratings were included for this table. See also note to Table 3 above.

Even though the sample only covers a year, it is of course possible that pages would differ significantly between ratings. However, when looking at versions of pages, we see MADs that are smaller, but still relatively large in magnitude. To put this in perspective, consider a version of an article that has only 2 ratings; Table 4 indicates that the average of all 4 rating dimensions for those two ratings would tend to be about 1.6 points apart on a 1-5 scale, meaning that if one rating was a 5 (very high), on average the other rating would be a 3.4 (very mediocre, given the fact that most ratings are, in fact, 4s or 5s).

To further illustrate this, here are plots of the distributions of the non-absolute differences between each observation and the average for that version:

Figure 4

Chart of distribution of deviation from version average

Figure 5

Chart of distribution of deviation from version average

Figure 6

Chart of distribution of deviation from version average

The cutoffs of 2, 4, and 6 were chosen because they represent 100%, 25%, and 10%, respectively, of the sample of versions with at least 2 ratings.

Though there initially appears to be just a rightward skew to the distributions, as we get more ratings per version, the familiar bimodal pattern begins to show up again. Furthermore, the bimodality becomes much more pronounced when looking at versions with even higher numbers of ratings, and it is also more pronounced when looking at pages, rather than versions of pages (no doubt in part because pages tend to have higher numbers of ratings per page, and thus there is more information).

On the one hand, the fact that this bimodality is more clear with a higher N (i.e. more ratings per version) could indicate that we simply have more information and thus have clearer picture of what could be a fundamental, bimodal pattern. However, it could also mean that pages with many ratings are in some way different; for example, pages with higher ratings could be more controversial, leading some to simply give low ratings because they disagree with the content of the page itself, rather than judging the completeness, trustworthiness, objectivity, or quality of writing.

5. Do different kinds of users rate pages differently?

In the dataset, the only information we have about the user rating the page is whether they are logged in or not. Still, given that most users are not logged in, being logged in could indicate that the user is more aware of the standards of Wikipedia, and thus is a better judge of article quality. It could also be that logged-in users are more invested in the quality of Wikipedia and thus will put more effort into providing a full and accurate rating. Still another option is that logged-in users are more likely to have edited the pages themselves, and thus may be either rating their own work or giving a low rating before deciding to edit the page.

Figure 7

Chart of distribution of ratings based on logged-in status

Here it appears that logged-in users are less likely to give low ratings. When I look at the MAD for logged-in vs. not logged-in users, the logged-in users have MADs that are significantly (p-value < 0.001) lower than users who are not logged in, meaning that not only are logged-in users less likely to give low ratings, but they tend to give ratings that are closer to ratings for other logged-in users.

5. Do pages increase in quality over time?

In my exploratory analysis, I look at the rating history for the most frequently edited and rated pages, trying to find a pattern of improving quality over time. Here, for example, are plots of the history for three pages that have many versions and high numbers of ratings per version:

Figure 8

Chart of page rating history: "Arithmetic Progression"

Figure 9

Chart of page rating history: "List of spells in Harry Potter"

Figure 10

Chart of page rating history: "United States Declaration of Independence"

The boxplot for each page shows the distribution of rating_all_mean (i.e. the mean rating when all 4 dimensions are recorded) for each version of the page in our dataset, indexed by version number (rather than, say, date the version was made). The blue line and circles trace the mean, while the red lines are the median.

There certainly does not appear to be a broad upward trend. Indeed, when I look at correlations for all pages in the data set, there is almost no correlation between version number and rating value:

Table 5
Correlation Mean Rating, Any Dimensions Mean Rating, All Dimensions "Complete" "Objective" "Trustworthy" "Well-Written"
nth_version 0.063059 0.063983 0.096140 0.035822 0.051972 0.049783

It is possible that, if I can combine versions that, due to the edits being very minor, are essentially the same, that a different trend could emerge. But as of now, there is not much evidence that pages get better over time when quality is measured by user ratings.

Next Steps

I am currently working on merging in data about the actual edits (and other data) from the MediaWiki API. Once I do that, I'll be able to see if certain kinds of edits affect ratings in different ways. For example: