Sonntag, 6. April 2008

Ambassador counting by region

This is my approach to get more (and hopefully better) metrics about the Ambassadors Project with my poor python programming skills. During March I added the new ambassadors by hand to the statistics page. That's not the way we want to go...Francesco Crippa made a python script to count the members of the Fedora Ambassadors Project. His idea was to collect the diffs of the Country List page. Unfortunately this page is updated manually and because of that not very reliable. IMHO the better way is to get the data from the several category pages (for example CategoryAmbassadorsFrance). The personal wiki pages of the ambassadors are often more up-to-date because Thomas Chung add the category after verification. The script is quite simple and at the moment just a working prototype. There is a list of countries. With this list the script will get all links from the category pages (incl. the wiki names), then do some work with regex and print out a number. This number is the count of all ambassadors in the selected countries.

import sys, urllib, os, re
from BeautifulSoup import BeautifulSoup

def countryList(countries):
  """
  This function will get the names from Fedora Project category pages
  """
  urlList = []
  for country in countries:
    URL = urllib.urlopen ('http://fedoraproject.org/wiki/CategoryAmbassadors'+country)
    #print URL
    soup = BeautifulSoup(URL)
    for tag in soup("a"):
      attrs = dict(tag.attrs)
      links = str(tag)
      urlList.append(links)
      #print urlList
    return urlList

def cleanList(rawHTML):
  result = []
  for line in rawHTML:
    if "%28CategoryCategory%5Cb%29" in line:
      result.append(line)
  return result

def get_names(cleanHTML):
  """
  Extracting the names of the ambassadors.
  """
  re1='.*?'                 # Non-greedy match on filler
  re2='(?:[a-z][a-z]+)'    # Uninteresting
  re3='((?:[a-z][a-z]+))'  # ((CategoryCategory))
  result = []
  for line in cleanHTML:
    #print URL
    rg = re.compile(re1+re2+re1+re2+re1+re3,re.IGNORECASE|re.DOTALL)
    A_name_temp = rg.search(line)
    A_name = A_name_temp.group(1)
    result.append(A_name)
  return result

def doubleNames(all_names):
  """
  Remove double entries...Gerold per example ;-)
  """
  dic={}
  for i in all_names:
    dic[i]=''

  list=dic.keys()
  list.sort()
  return list

def lengthList(names):
  """
  Get the number of ambassadors
  """
  length = len(names)
  return length

def main():
 # Countries in EMEA, only a few for testing (56 ambassadors in this 5 countries)
 countries=['France','Italy','Switzerland','Germany','Austria']
 countryHTML = countryList(countries)
 cleanHTML = cleanList(countryHTML)
 all_names = get_names(cleanHTML)
 names = doubleNames(all_names)
 # If you want to know the names of the ambassadors in the region you selected
 #print names
 print lengthList(names)

if __name__ == "__main__":
sys.exit(main())
Download: amb_count.py If you are a python pro feel free to tell me what I can do better in this script. Any kind of suggestions are welcome. Maybe I will expand this script. I would like to have a log function (writing the data to a file with time and data), selection of the area through user input, and some other small thing like error handling and a test if the the script got all data form the Fedora wiki. At the moment with moinmoin it's a pain and there have to be more than one run to get all the data, 502 Proxy Error...sooner or later everything will be fine with mediawiki.

1 Kommentar:

Unknown hat gesagt…

Don't have time to offer up too many suggestions, but if you don't compile your regex in the for loop each time it should run faster -- compile it outside the "for". The idea is to build the regular expression once and use it repeatedly for different data.

--Michael DeHaan