-
Notifications
You must be signed in to change notification settings - Fork 0
Using the Data
If you have SSH access to the production server, all of the tools used below should have been pre-installed for you.
Most simple data processing should be able to be performed using the MongoDB shell mongosh.
The MongoDB shell can return data directly, but it uses BSON rather than JSON (an expanded format including binary fields) and cannot be processed by jq - see section below for jq usage.
Once you have started an authenticated mongosh session, queries can be made using the MongoDB query language:
mongosh "mongodb://<server address>:27017/github?authSource=admin" --username <username> --password <password> --quiet
github> db.getCollection('users').find({ company: RegExp('.*University') }).count()
In cases where the required processing is able to be done with mongosh this will usually be substantially faster due to the data not needing to be transferred out of the database for the processing to occur.
For more details on mongosh, including installation instructions, see https://docs.mongodb.com/mongodb-shell/.
In order to get JSON data suitable for processing at the command line with a tool like jq, we need to use mongoexport.
This is part of the MongoDB Database Tools collection, which can be installed following the instructions at https://docs.mongodb.com/database-tools/installation/installation/.
First, we need to create a config file containing the connection details:
# file: mongoexport.conf
uri: mongodb://<username>@<server address>:27017/github?authSource=admin
password: <password>
Now we can use mongoexport to fetch some data using -c to specify the collection (table) name and optionally -q to filter using a MongoDB query.
For example, select all users who's company field satisfies the regex .*University, pass these records to jq and count them:
mongoexport --config=mongoexport.conf --quiet -c users -q '{ "company": { "$regex": ".*University" } }' | jq -s '. | length'
For more details on MongoDB query syntax see https://docs.mongodb.com/manual/tutorial/query-documents/.
For more details on mongoexport see https://docs.mongodb.com/database-tools/mongoexport/.
In order to label subsets of the data you need to have the gha tool installed and configured with the connection details of the MongoDB server.
If you have SSH access to the server on which the database is deployed, this should have been done for you already and can be tested after SSHing in with:
which gha
If the tool is not installed, or to update it, use:
python3.8 -m pip --user --upgrade pip
python3.8 -m pip --user --upgrade /srv/app/.
Then create a .env file specifying the database connection details:
# file: .env
DATABASE_URL=mongodb://<username>:<password>@<server address>:27017/github?authSource=admin
To label a subset of the data, use:
gha name-set -f <repos file> --set-name <subset name>
This will add the subset name to all records belonging to each repo listed in the repos file.
Once labelled, the subset name can be used as part of a query filter or find operation, e.g.:
mongosh "mongodb://<server address>:27017/github?authSource=admin" --username <username> --password <password> --quiet
github> db.getCollection('repos').find({ sets: '<subset name>' }).count()