To install the project:
git clone https://github.com/Felihong/wikidata-sequence-analysis.git
cd wikidata-sequence-analysis
pip install -r requirements.txt
The sample data are collected mainly for the following two perspectives:
- Descriptive statistics of data collected
- Bahaviour patterns with the help of sequence pattern mining
- Identify randomly 100 items per current quality prediction A, B, C, D, E, which are retrived from
wikidatawiki_p page table : page_latestandores API - The edit histories of all items are retrieved from
wikidatawiki_p revision table - All data above would be combined again with the repective editer information from
wikidatawiki_p user table, together with edit comments fromwikidatawiki_p comment table, user group information fromwikidatawiki_p user_groups table
| article_id | item_id | item_title | label | category |
|---|
article_idtable ID, primary keyitem_idedited item page IDitem_titlerespective item page namelabelEnglish label of the item pagecategoryclassified content category based on label and description
| editor_id | user_id | user_name | user_group | user_editcount | user_registration |
|---|
editor_idtable ID, primary keyuser_ideditor IDuser_nameediter nameuser_groupeditor's user group and their corresponding user rightsuser_editcountrough number of edits and edit-like actions the user has performeduser_registrationeditor registration timestamp
| rev_id | prediction | itemquality_A | itemquality_B | itemquality_C | itemquality_D | itemquality_E | js_distance |
|---|
rev_idrevision(edit) ID, primary keypredictionquality prediction of this revision ID, chosen as the one with the biggest probabilityitemquality_A, itemquality_B, itemquality_C, itemquality_D, itemquality_Econcrete quality level probability distribution of this revisionjs_distanceJensen-Shannon divergence value based on given quality distribution
| rev_id | comment | edit_summary | edit_type | paraphrase |
|---|
rev_idrevision(edit) ID, primary keycommentoriginal comment information for this editedit_summarycomment information simplified with regular expressionedit_typeschematized and classified edit summary for ease of useparaphraseparaphrase of edit summary according to Wikibase API
| rev_id | parent_id | editor_id | article_id | rev_timestamp |
|---|
rev_idrevision(edit) ID, primary keyparent_idpreceding revision(edit) IDeditor_idforeign key to table editorarticle_idforeign key to table articlerev_timestamprevision timestamp
It is strongly suggested using virtual environment to install ores by firstly install python’s virtual environment and create a directory named python-environments, then navigate in the newly created directory:
sudo apt install virtualenv
mkdir python-environments
cd python-environments
Create a virtual environment in python 3 with the environment name of project_ores, then activate the newly created virtual environment:
virtualenv -p python3 project_ores
source ~/project_ores/bin/activate
Alternatively, you may also create the virtual environment with Anaconda:
conda create --name project_ores python=3.5.0
conda activate project_ores
Now install ORES package in the virtual environment:
pip install ores
The following steps show how to use ORES item_quality model scoring the given revisions as input, the data can be fetched from the commandline using the ORES built-in tools.
To pull a sample, start with:
revision_id.csv | tsv2json int | ores score_revisions https://ores.wikimedia.org \
'Example app, here should be your user agend' \
wikidatawiki \
itemquality \
--input-format=plain \
--parallel-requests=4 \
> result.jsonlines
Please make sure that the input file revision_id.csv is beginning with a header "rev_id".
After this, the script itemquality_scores_to_csv.py could be used to parse results into CSV:
python itemquality_scores_to_csv.py < result.jsonlines > result.csv