Research‎ > ‎GNOME Archive Data‎ > ‎

GNOME Sample SQL Queries

Getting an Overview of the Data


The first thing you'll want to do is to get an idea of the amount of data in the database. These queries get general information about the major data sources in the database:

cvsminer=# select count(*) from person;
 count 
-------
 10145
(1 row)

cvsminer=# select count(*) from mail_message;
 count  
--------
 857369
(1 row)

cvsminer=# select count(*) from cvs_commit;
 count  
--------
 726560
(1 row)

cvsminer=# select count(*) from bz_bugs;
 count  
--------
 411517
(1 row)

cvsminer=# select count(*) from feed_entry;
 count  
--------
 150153
(1 row)

From this it should be very obvious that the data contained in the database are not the complete record of everything that ever happened in the GNOME community. The data are focused around questions that I found most interesting and was focusing on for my own research. Thus, not all projects have equal amounts of information in the database.

What Are the Email Addresses of People Who Have Contributed to Evolution?

This query shows how much of the data is linked up. We first take an email address (email_address) and map that to person (person), which can be mapped to a cvs account (users), which may have created multiple cvs commits (cvs_commit), on multiple projects (project), which belong to the large evolution master project (master_project). I realize that it is possible to compress this query some by eliminating the person table and various other optimizations, but it's left fully verbose here to make the process clearer.

cvsminer=# select distinct(email_address.email) 
             from email_address, person, person_user, users, cvs_commit, project, master_project
            where email_address.person_id=person.person_id 
                  and person.person_id=person_user.person_id
                  and person_user.user_id=users.user_id
                  and users.user_id=cvs_commit.user_id
                  and cvs_commit.project_id=project.project_id
                  and project.master_project_id=master_project.master_project_id
                  and master_project.master_project_name='gnome/evolution';

                   email                   
-------------------------------------------
 efgbr@terra.com.br
 hp@redhat.com
 spapadim@cs.cmu.edu
 michael@helixcode.com
 gilosher@gmail.com
 damon@gnome.org
 bulkmeel@yahoo.co.uk
[and several hundred more]

What Mailing Lists Do Evolution Developers Post To?

This is an expansion of the previous query where we continue the join two steps further to join find the all of the mailing list posts that the email_address has done (mail_message) and then figure out which lists those posts were done on (mailing_list).

cvsminer=# select distinct(mailing_list.name) 
             from mailing_list, mail_message, email_address, person, 
                  person_user, users, cvs_commit, project, master_project
            where mailing_list.mailing_list_id=mail_message.mailing_list_id
                  and mail_message.email_address_id=email_address.email_address_id
                  and email_address.person_id=person.person_id 
                  and person.person_id=person_user.person_id
                  and person_user.user_id=users.user_id
                  and users.user_id=cvs_commit.user_id
                  and cvs_commit.project_id=project.project_id
                  and project.master_project_id=master_project.master_project_id
                  and master_project.master_project_name='gnome/evolution';

           name            
---------------------------
 libsoup
 muine
 gnome-planning-summit
 atomato
 evince
 gtk-app-devel
 gnome-cyr
 straw
 xml-i18n-tools
 xml-bindings
[and many more, 154 total]

How Many Posts Were on Made to Each Blog Aggregator

The blog engine was something I wrote for a one off weekend project. From there it just ran for a couple of years. That's the power of cron. The basic premise was that at some points I crawled the various blog planets for their OPML lists of subscriptions. I then visited all of those URLs religiously. The planet lists were not frequently updated, so this doesn't represent and entirely true image, but it might be interesting to natural language people or something like that.

cvsminer=# select aggregator.title, count(feed_entry.feed_entry_id)
             from feed_entry, feed, feed_aggregator, aggregator
            where feed_entry.feed_id=feed.feed_id
                  and feed.feed_id=feed_aggregator.feed_id
                  and feed_aggregator.aggregator_id=aggregator.aggregator_id
         group by aggregator.title;

          title           | count 
--------------------------+-------
 Fedora People            |  9037
 Planet GNOME             | 23344
 planet.freedesktop.org   |  4490
 PlanetJava.org           | 36626
 Planet Mozilla           | 11043
 planet n                 |  8959
 Planet SuSE              |  7639
 Planet Ubuntu            |  8898
 Planet Arslinux          |   966
 Planet Perl              |  5599
 Planet GStreamer         |  3645
 Planet KDE               | 11009
 Planet Beagle            |  1122
 Planet Debian            | 22234
 Unofficial Planet Python | 23834
(15 rows)
Comments