Visualizations Census 2013


Creation date: 2015-08-10

Tags: kaggle, Visualization

In this blog entry you will learn how to visualize datasets on a map using Python.

Unfortunately, I had a lot to do during the last month so this blog post comes a bit late. This entry is not about machine learning directly but I think it's sometimes interesting and fun to first visualize the dataset.

Kaggle has an new challenge about the US Census 2013, which gives several information about the US population. The goal is to visualize a part of the huge dataset.

Internet Access

There is some interesting data in New Mexico and Arizona, where less than 50% of the households have Internet access. About 700 out of 40,000 households answered that specific question in the northwestern part of New Mexico, which is quite accurate. More than 1,000 households answered that question in the yellow colored part of Arizona.

First of all we need the data which is stored inside this file:

ZipFile Census 2013

There are several columns which are explained here:

Data dictionary

The CSV file is quite huge and it will take a lot of time to always open the full CSV file. Therefore we will create an extra CSV file for the Internet access column. The script can be used to extract every other column as well.

getCSVByColumn.py 1/3

# import several libs
import io
import zipfile
import csv
import sys

# column numbers for state and puma code
pumaColNr = 3
stColNr   = 5

w = csv.writer(open("output-{0}.csv".format(sys.argv[1]), "w"))
w.writerow(["STATE", "PUMA", sys.argv[1]])

row_counts = [756065,720248]

First we need to import some libs and define the column numbers for the US State and the PUMA (Public Use Microdata Areas). A PUMA contains at least 100.000 people (more information). The complete dataset contains these columns:

['RT', 'SERIALNO', 'DIVISION', 'PUMA', 'REGION', 'ST', 'ADJHSG', 'ADJINC', 'WGTP', 'NP', 'TYPE', 'ACCESS', 'ACR', 'AGS', 'BATH', 'BDSP', 'BLD', 'BROADBND', 'BUS', 'COMPOTHX', 'CONP', 'DIALUP', 'DSL', 'ELEP', 'FIBEROP', 'FS', 'FULP', 'GASP', 'HANDHELD', 'HFL', 'INSP', 'LAPTOP', 'MHP', 'MODEM', 'MRGI', 'MRGP', 'MRGT', 'MRGX', 'OTHSVCEX', 'REFR', 'RMSP', 'RNTM', 'RNTP', 'RWAT', 'RWATPR', 'SATELLITE', 'SINK', 'SMP', 'STOV', 'TEL', 'TEN', 'TOIL', 'VACS', 'VALP', 'VEH', 'WATP', 'YBL', 'FES', 'FFINCP', 'FGRNTP', 'FHINCP', 'FINCP', 'FPARC', 'FSMOCP', 'GRNTP', 'GRPIP', 'HHL', 'HHT', 'HINCP', 'HUGCL', 'HUPAC', 'HUPAOC', 'HUPARC', 'KIT', 'LNGI', 'MULTG', 'MV', 'NOC', 'NPF', 'NPP', 'NR', 'NRC', 'OCPIP', 'PARTNER', 'PLM', 'PSF', 'R18', 'R60', 'R65', 'RESMODE', 'SMOCP', 'SMX', 'SRNT', 'SSMC', 'SVAL', 'TAXP', 'WIF', 'WKEXREL', 'WORKSTAT', 'FACCESSP', 'FACRP', 'FAGSP', 'FBATHP', 'FBDSP', 'FBLDP', 'FBROADBNDP', 'FBUSP', 'FCOMPOTHXP', 'FCONP', 'FDIALUPP', 'FDSLP', 'FELEP', 'FFIBEROPP', 'FFSP', 'FFULP', 'FGASP', 'FHANDHELDP', 'FHFLP', 'FINSP', 'FKITP', 'FLAPTOPP', 'FMHP', 'FMODEMP', 'FMRGIP', 'FMRGP', 'FMRGTP', 'FMRGXP', 'FMVP', 'FOTHSVCEXP', 'FPLMP', 'FREFRP', 'FRMSP', 'FRNTMP', 'FRNTP', 'FRWATP', 'FRWATPRP', 'FSATELLITEP', 'FSINKP', 'FSMP', 'FSMXHP', 'FSMXSP', 'FSTOVP', 'FTAXP', 'FTELP', 'FTENP', 'FTOILP', 'FVACSP', 'FVALP', 'FVEHP', 'FWATP', 'FYBLP', 'wgtp1', 'wgtp2', 'wgtp3', 'wgtp4', 'wgtp5', 'wgtp6', 'wgtp7', 'wgtp8', 'wgtp9', 'wgtp10', 'wgtp11', 'wgtp12', 'wgtp13', 'wgtp14', 'wgtp15', 'wgtp16', 'wgtp17', 'wgtp18', 'wgtp19', 'wgtp20', 'wgtp21', 'wgtp22', 'wgtp23', 'wgtp24', 'wgtp25', 'wgtp26', 'wgtp27', 'wgtp28', 'wgtp29', 'wgtp30', 'wgtp31', 'wgtp32', 'wgtp33', 'wgtp34', 'wgtp35', 'wgtp36', 'wgtp37', 'wgtp38', 'wgtp39', 'wgtp40', 'wgtp41', 'wgtp42', 'wgtp43', 'wgtp44', 'wgtp45', 'wgtp46', 'wgtp47', 'wgtp48', 'wgtp49', 'wgtp50', 'wgtp51', 'wgtp52', 'wgtp53', 'wgtp54', 'wgtp55', 'wgtp56', 'wgtp57', 'wgtp58', 'wgtp59', 'wgtp60', 'wgtp61', 'wgtp62', 'wgtp63', 'wgtp64', 'wgtp65', 'wgtp66', 'wgtp67', 'wgtp68', 'wgtp69', 'wgtp70', 'wgtp71', 'wgtp72', 'wgtp73', 'wgtp74', 'wgtp75', 'wgtp76', 'wgtp77', 'wgtp78', 'wgtp79', 'wgtp80']

We want to use our script to save only the columns ST,PUMA and a specified column, which will be the first command line argument, in our case the ACCESS column.

The dataset is divided in two pieces for the first states 1-25 (ss13husa.csv.zip) and the rest in ss13husb.csv.zip.

Info: The link contains the CSV files directly. I zipped them to save some disk space.

Now we need to know how many lines are in each dataset (without counting the header). It isn't easy to see the whole dataset inside LibreOffice so you can use this line: row_count = sum(1 for row in csvf) for each dataset. In the script I hard-coded the number of rows row_counts = [756065,720248].

getCSVByColumn.py 2/3

for fNr in range(2):
    if (fNr == 0):
        alpha = 'a'
    else:
        alpha = 'b'

    zf = zipfile.ZipFile('ss13hus{0}.csv.zip'.format(alpha))
    f = io.TextIOWrapper(zf.open("ss13hus{0}.csv".format(alpha), "rU"))
    csvf = csv.reader(f)
    header = csvf.next()

    colNr = header.index(sys.argv[1])

These lines will be used twice, because the dataset is splitted. We open the ZIP file and read the first line of the CSV (header). As mentioned above the new CSV file should contain the state, puma and the specified one. The colNr is the number of the specified column.

The last part of the script is inside the "for-loop" as shown above and writes the three columns.

getCSVByColumn.py 3/3

for i in range(row_counts[fNr]):
    row=csvf.next()
    puma=row[pumaColNr]
    state=row[stColNr]
    col=row[colNr]
    if (col == ''):
        continue

    col=int(col)
    w.writerow([state,puma,col])

puma is the puma code for a row, state the number of the state and col in our case the ACCESS column.

The state code is specified in the Data dictionary

ST         2
    State Code
           01 .Alabama/AL
           02 .Alaska/AK
           04 .Arizona/AZ
          ...

The ACCESS column has these values:

ACCESS     1
    Access to the Internet
           b .N/A (GQ)
           1 .Yes, with subscription to an Internet service
           2 .Yes, without a subscription to an Internet service
           3 .No Internet access at this house, apartment, or mobile home

We want to ignore the N/A answers which is achieved by

if (col == ''):
    continue

The entry is an integer now and we can use int(col). The last line w.writerow([state,puma,col]) writes the three columns into output-ACCESS.csv.

To start the script you can use python2 getCSVByColumn.py ACCESS.

Visualization

Now let's have a look at the visualization script basic.py. This time I'll not include the whole script inside the blog post, but as always you can download the scripts here: GitHub

We need some imports and the states with name and number code:

state_codes = {'01': 'Alabama',
               '02': 'Alaska',...

Again we will use the command line argument to specifiy the column which we want to visualize.

colArg = sys.argv[1]

csvf = csv.reader(open('output-{0}.csv'.format(colArg), 'rb'))
header = csvf.next()

The output file, which was generated in the first script will be used to write the data into a Python dictionary.

row_count = 1211264

data = {}
for i in range(row_count):
    row=csvf.next()
    state=row[0]
    puma=row[1]
    col=int(row[2])
    if (state not in data):
        data.update({state: {puma: np.array([col])}})
    elif (puma not in data[state]):
        data[state].update({puma: np.array([col])})
    else:
        data[state][puma] = np.append(data[state][puma],col)

The row count can be calculated again. The dictionary structure will be {stateNr: {pumaNr: np.array([values])}

Because it isn't possible to visualize a US map including Hawaii and Alaska in a geographically realistic way, we use the common practice to add them in the bottom left corner. Therefore we need a pyplot figure and three subplots.

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
rect = [0.08,0.05,0.35,0.35]
axAlaska = add_subplot_axes(ax,rect)
rect = [0.3,0.02,0.2,0.2]
axHawaii = add_subplot_axes(ax,rect)

fig.suptitle('Census 2013: Internet access', fontsize=20)

For Alaska and Hawaii I am using the add_subplot_axes function by Pablo which he published on StackOverflow.

# part of http://stackoverflow.com/a/17479417/2501747
def add_subplot_axes(ax,rect):
    fig = plt.gcf()
    box = ax.get_position()
    width = box.width
    height = box.height
    inax_position  = ax.transAxes.transform(rect[0:2])
    transFigure = fig.transFigure.inverted()
    infig_position = transFigure.transform(inax_position)
    x = infig_position[0]
    y = infig_position[1]
    width *= rect[2]
    height *= rect[3]
    # we don't need a frame
    subax = fig.add_axes([x,y,width,height],frameon=False)
    return subax

We need three different Basemaps where we need to specify the axes ax we used before.

# create a map object with the Albers Equal Areas projection.
# This projection tends to look nice for the contiguous us.
mNormal = Basemap(width=5000000,height=3500000,
            resolution='l',projection='aea',\
            ax=ax, \
            lon_0=-96,lat_0=38)

mAlaska = Basemap(width=5000000,height=3500000,
            resolution='l',projection='aea',\
            ax=axAlaska, \
            lon_0=-155,lat_0=65)

mHawaii = Basemap(width=1000000,height=700000,
            resolution='l',projection='aea',\
            ax=axHawaii, \
            lon_0=-157,lat_0=20)

To generate the several different colors for the percentage values, I used the get_cmap function, which is a good choice for this.

num_colors = 21
cm = plt.get_cmap('RdYlGn')
colorGradient = [cm(1.*i/num_colors) for i in range(num_colors)]

The following part will fill the shapefiles with the correct color:

# read each states shapefile
for key in state_codes.keys():
    print(state_codes[key])
    if (state_codes[key] == "Alaska"):
        mAlaska.readshapefile('shapefiles/pums/tl_2013_{0}_puma10'.format(key),name='state', drawbounds=True)
        m = mAlaska
    elif (state_codes[key] == "Hawaii"):
        mHawaii.readshapefile('shapefiles/pums/tl_2013_{0}_puma10'.format(key),name='state', drawbounds=True)
        m = mHawaii
    else:
        mNormal.readshapefile('shapefiles/pums/tl_2013_{0}_puma10'.format(key),name='state', drawbounds=True)
        m = mNormal

    # loop through each PUMA and assign the correct color to its shape
    for info, shape in zip(m.state_info, m.state):
        dataForStPuma = data[key][info['PUMACE10']]

        # get the percentage of households with Internet access
        access = (dataForStPuma == 3)
        accessPerc = 1-(sum(access)/(1.0*len(dataForStPuma)))
        colorInd = int(round(accessPerc*num_colors))

        patches = [Polygon(np.array(shape), True)]
        pc = PatchCollection(patches, edgecolor='k', linewidths=1., zorder=2)
        pc.set_color(colorGradient[colorInd])
        if (state_codes[key] == "Alaska"):
            axAlaska.add_collection(pc)
        elif (state_codes[key] == "Hawaii"):
            axHawaii.add_collection(pc)
        else:
            ax.add_collection(pc)

At first we need to use our different Basemaps to read the shapefiles.

Info: The shapefiles can be downloaded here: PUMA shapes

Each state has several PUMA and the shapefile contains the PUMA code inside the info array with the key PUMACE10 so we can get our data for the specified state and PUMA using: dataForStPuma = data[key][info['PUMACE10']].

As described above the ACCESS column has the following values:

ACCESS     1
    Access to the Internet
           b .N/A (GQ)
           1 .Yes, with subscription to an Internet service
           2 .Yes, without a subscription to an Internet service
           3 .No Internet access at this house, apartment, or mobile home

and we want to visualize the number of households with Internet access, therefore we will subtract the percentage of households without Internet from 100%.

# get the percentage of households with Internet access
woAccess = (dataForStPuma == 3)
accessPerc = 1-(sum(woAccess)/(1.0*len(dataForStPuma)))
colorInd = int(round(accessPerc*num_colors))

Inside the last part we assign the color with the color index colorInd to the shape:

patches = [Polygon(np.array(shape), True)]
pc = PatchCollection(patches, edgecolor='k', linewidths=1., zorder=2)
pc.set_color(colorGradient[colorInd])
if (state_codes[key] == "Alaska"):
    axAlaska.add_collection(pc)
elif (state_codes[key] == "Hawaii"):
    axHawaii.add_collection(pc)
else:
    ax.add_collection(pc)

At the moment the visualization will look like this: Internet Access without legend

The last step is to add the legend:

# add colorbar legend
cmap = mpl.colors.ListedColormap(colorGradient)
# define the bounds
bounds = np.linspace(0,100,num_colors)

# create a second axes for the colorbar at the right of the figure
ax2 = fig.add_axes([0.82, 0.1, 0.03, 0.8])
cb = mpl.colorbar.ColorbarBase(ax2, cmap=cmap, ticks=bounds, boundaries=bounds, format='%1i')
cb.ax.set_yticklabels([str(int(i))+"%" for i in bounds])# vertically oriented colorbar

... and to save the map.

plt.savefig('map-{0}.png'.format(colArg))

Some more visualizations:

Some upvotes to win that challenge would be awesome ;) My scripts @kaggle

Would be great to see your visualizations as well!



Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.

Powered by Buttondown.


Subscribe to RSS