Scraping Rock Climbing Data

I love to rock climb. The website mountainproject.com has data on 160,316 routes and 51,371 users who record their climbing activity. I was curious about whether any patterns emerge, so I scraped all this data and then made some charts that help visualize the data.

Mountain project has information about just over 160,000 climbing routes, organized by climbing area. Each route has its own page, and is in many nested climbing areas. There is a page for each climbing area that either lists all the sub-areas within it (if it has sub-areas), or all the climbing routes for that area.

For example, the classic climb "High Exposure" is listed under New York > The Gunks > The Trapps > i. High E. Each route is given a unique route id that forms part of the url for that climb; for High Exposure, its route id is 105798994, and the corresponding url is https://www.mountainproject.com/v/105798994.

I wanted to get data about every climb on mountainproject.com, so I started by writing a python script that recursively searched through every climbing area, found each sub-area, and so on, and once it reached the bottom level areas recording the route ids for each climbing route.

To do this, I used the following modules:

requests, to load the web pages.
beautiful_soup, to easily search within those pages.
pickle, store and retrieve python lists.
time, to time our progress.

So, here goes.

import requests, bs4, pickle, time

I found that I could select the the links for sub-areas or routes by searching for a certain style class, so let's define variables for that and the base url:

base_url = 'https://www.mountainproject.com/v/'
classname = " background-color: #e5e5e5; ;  padding:6px 6px "

The following function, get_routes(), takes an area id and returns ids for all sub-areas or sub-routes for that area.

The route/area links are stored in a script, and didn't have DOM elements, so extracting them was a bit hacky:

def get_routes(area_id):
    # given an area_id, returns ids for all sub-areas or sub-routes.
    url = base_url + area_id
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text,'html5lib')
    t = soup.find("div", style = classname)
    script = str(t.find('script'))

    areas_i = script.index('Areas = [') + len('Areas = [')
    areas_j = script.index('];')
    routes_i = script.index('Routes = [') + len('Routes = [')
    routes_j = script.index('$') -3
    areas = script[areas_i:areas_j].strip('\n').split('\n')
    routes = script[routes_i:routes_j].strip('\n').split('\n')

    area_ids = []
    route_ids =[]
    for area in areas:
        area_ids.append(area.split(',')[0][1:])
    for route in routes:
        route_ids.append(route.split(',')[0][1:])

    area_ids = area_ids if not area_ids == [''] else []
    route_ids = route_ids if not route_ids == [''] else []

    return [area_ids, route_ids]

Let's test that. Remember that High Exposure is in the sub-area <a href= "https://www.mountainproject.com/v/107059022">i. High Exposure</a>, with area id 107059022. Let's try passing that to get_routes():

print(get_routes('107059022'))

[[], ['106562228', '106731336', '105803260', '105801433', '109700137', '105799563', '105799077', '107328856', '106232452', '107381972', '105828999', '105798994', '106792623', '109076690', '106598659', '110980901', '106579880', '113066452', '106576675', '106148500', '105888125', '106792907', '106010504', '105841801', '106733193', '109526250', '106077032', '105840288']]

We can see that the first element of the list is an empty list: there are no sub-areas for <a href= "https://www.mountainproject.com/v/107059022">i. High Exposure</a>. And we can confirm that the climb we've been looking at, High Exposure, with id 105798994, is in the list.

In fact, it turns out that mountain project areas only have routes if they don't have further sub-areas. We can use that to define a simple recursive function, fetch_children(), that is guaranteed to find every route in a given area:

def fetch_children(route_id):
    areas, routes = get_routes(route_id)
    if areas:
        child_areas = []
        for area in areas:
            child_areas.append([area,fetch_children(area)])
        return child_areas
    if routes:
        return routes
    else:
        return [None]

Great. Now we just need to get all the master climbing areas on the site. They are listed on a 'Destinations' page: https://www.mountainproject.com/destinations/. Again, using beautiful_soup makes getting all the area ids easy:

def find_master_areas():
    r = requests.get('https://www.mountainproject.com/destinations/')
    soup = bs4.BeautifulSoup(r.text,'html5lib')
    results = soup.find_all('tbody')[3]
    areas = results.find_all('span',{'class', 'destArea'})
    area_links = []
    for area in areas:
        area_links.append(area.find_all('a')[0].attrs['href'])
    area_links = [area[3:] for area in area_links]
    return area_links

areas = find_master_areas()

print (areas[:5]) #Let's look at the first 5 areas:
print (len(areas))

['alabama/105905173', 'alaska/105909311', 'arizona/105708962', 'arkansas/105901027', 'california/105708959']
48

Great. Now we just need to get the routes for each area. Wait! How long is this going to take? Let's test the recursive fetch_children() on a smaller area than New York State. The main cliff at my local climbing area, the Gunks, is called the Trapps (id = 105798818), and it has 453 climbs in it.

start = time.time()
trapps_routes = fetch_children('105798818')
end = time.time()
delta = end - start
print ('It took %s seconds to find 453 routes' %delta)

It took 4.654216289520264 seconds to find 453 routes

If it takes 4.6 seconds for 453 climbs, that's about 0.01 seconds per climb. If that scales up, we should expect it to take a little more than a half hour to get all 160,316 climbs.

In the next portion of the scraping task, I used the python module multiprocessing to speed things up. But I'm not sure how to implement that given the recursive nature of fetch_children(). So let's just commit to working on something else for a bit.

all_routes = []
for area in areas:
    all_routes.append(fetch_children(area))

with open('nested_route_ids.p', 'wb') as f:
    pickle.dump(all_routes)

To preserve the area-route structure fetch_children() returns a nested list: if area A has sub-areas B1 and B2, and B1 contains climbs C1 and C1, and B2 contains C3 and C4, then fetch_children(A) would return:

[ [B1, [C1,C2] ], [B2, [C3,C4] ] ]

But I also want a list of just the climbs (C1-C4 in the example above). So we need to flatten the list to get only the climbs; the recursive functions extract_routes() and flatten() do that.

def extract_routes(nest):
    if isinstance(nest,list):
        if len(nest)>1:
            if not isinstance(nest[0],list) and isinstance(nest[1],list):
                return [extract_routes(l) for l in nest[1]]
            else:
                return [extract_routes(l) for l in nest]
        else:
            return nest
    else:
        if nest:
            return nest

def flatten(nest):
    for i in nest:
        if isinstance(i, list):
            for j in flatten(i):
                return j
        else:
            return i

all_route_ids = flatten(extract_routes(all_routes))

with open('all_route_ids.p', 'wb') as f:
    pickle.dump(all_route_ids)

with open('all_route_ids.p', 'rb') as f: # Here's one I made earlier
    all_route_ids = pickle.load(f)

print(len(all_route_ids))

160316

Ok, so we have a list of 160316 route ids. We're going to use that to systematially retrieve information about each climb.

When a climber ascends a route, mountain project allows users to record that fact (the climber has 'ticked' the route). Each route page contains a link to a table listing all the recorded ticks. We're specifically interested in the mountain project user id of the climber and the date of the tick. (Later, we'll use the user ids to look up demographic information about the climbers: gender, age, and location.

Again, we'll use beautiful_soup to parse the route pages. fetch_ticks(), given a route id, returns a list of the form [route_id,[list of ticks]]

route_url = 'https://www.mountainproject.com/scripts/ShowObjectStats.php?id='

def fetch_ticks(route_id):
    url = route_url + route_id
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text,'html5lib')
    tables = soup.find_all('span', {'class':'dkorange bold'})

    if not tables:
        return [route_id,None]

    ticks_table = tables[-1].find_next('table')
    if not ticks_table.tbody:
        return [route_id,None]

    ticks = ticks_table.tbody.find_all('tr')
    tick_table = []

    for tick in ticks:
        tick_tds = tick.find_all('td')
        user, date = tick_tds[:2]
        user = user.a.attrs['href']
        date =  date.contents[0].replace(u'\xa0', u'')
        tick_table.append([user,date])

    return [route_id,tick_table]

Let's test fetch_ticks() on a random climb:

start = time.time()
ticks_for_route_5 = fetch_ticks(all_route_ids[5])
end = time.time()
print ('Fetching Ticks took %s seconds' %(end-start))
print (ticks_for_route_5)

Fetching Ticks took 0.17746806144714355 seconds
['106651430', [['/u/ashley-holmes//106651241', 'Apr 30, 2012'], ['/u/ashley-holmes//106651241', 'Mar 23, 2012'], ['/u/shaun-atkins//106904857', 'Oct 19, 2010'], ['/u/matthew-hollimon//106928891', 'Oct 19, 2010']]]

Above, we see fetch_ticks() at work: for the route "Cutting Teeth" in Alabama, it has been ticked four times.

Now, 0.17 seconds isn't all that long. But going through all 160k routes one by one at that rate would take 7.5 hours. I'm not that patient.

We can speed things up tremendously by using python's multiprocessing module.

from multiprocessing import Pool

## Without multiprocessing
t0 = time.time()
for route in all_route_ids[:100]:
    fetch_ticks(route)
t1 =  time.time()
print('Time without multiprocessing: %s' %(t1-t0))

## With multiprocessing
p = Pool(10)
route_ticks = p.map(fetch_ticks,all_route_ids[:100])
p.terminate()
p.join()
t2 = time.time()
print ('Time with multiprocessing: %s' %(t2-t1))

Time without multiprocessing: 18.75671100616455
Time with multiprocessing: 5.771172046661377

So it should be around three times faster using multiprocessing.

Just to be safe, we're also going break up the 160316 routes into small batches of 1000 routes each, and save our progress each time, and also allow a resume option in case something goes wrong.

If we just try to pass each route id to fetch_ticks(), we risk an exception crashing the whole shebang. So just in case, we'll use a wrapper function to try each route few times before giving up if we keep getting exceptions:

def fetch_ticks_wrap(route):
    count = 0
    while True:
        count +=1
        if count>4:
            return [route, None]
        try:
            ticks = fetch_ticks(route)
            return ticks
        except:
            time.sleep(2)
            pass
    return ticks

def batches(l, n, resume = 0):
    # yeild batches of size n from l
    for i in range(0, len(l), n):
        yield l[i+resume:i + n]

def get_batch_ticks(batch):
    p = Pool(10)
    route_ticks = p.map(fetch_ticks_wrap,batch)
    p.terminate()
    p.join()
    return route_ticks

route_ticks = []
for i, batch in enumerate(batches(all_route_ids[:1000], 1000)):
    print('batch %s' %i)
    batch_ticks = get_batch_ticks(batch)
    route_ticks.extend(batch_ticks)
    with open('route_ticks_.p', 'wb') as f:
        pickle.dump(route_ticks,f)

Now we have a list of ticks for each climb. In order to look up user info, we need to extract just the user ids:

def get_users(routes):
    users = []
    for route in routes:
        ticks = route[1]
        if ticks:
            for tick in ticks:
                user = tick[0]
                users.append(user)
    return list(set(users))       # we want just one instance of each id

users_with_ticks = get_users(route_ticks)

with open('users_w_ticks', 'rb') as f:
    user_ids = pickle.load(f)

print (len(user_ids))

51371

Ok, we have 51,371 users that have recorded climbs. Each user has a public page on the site that lists simple demographic information like gender, age, and location. Let's get that data!

def get_user_info(user):
    base_url = 'https://www.mountainproject.com'
    url = base_url + user
    try:
        r = requests.get(url)
    except:
        time.sleep(2)
        r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text, 'html5lib')
    info = soup.find('div', {'class':'personalData'})
    personal = info.find_all('div')[1].get_text()
    if personal:
        return personal

Let's test this on myself: my id is 108535796:

print (get_user_info('/u/108535796'))

Personal: Lives in New Paltz, 33 years old, Male

Again, for speed we'll use multiprocessing. This took about an hour:

p = Pool(10)
user_details = p.map(get_user_info, user_ids)
p.terminate()
p.join()

with open('user_details.p', 'wb') as f:
    pickle.dump(user_details,f)

Ok, we nearly have all the information we need. Mountain project actually has a limited data API; it couldn't help me find the user or route data I was curious about, but now that I have all the route ids, it can do precisely what I want.

Given up to 100 route ids, it will return a dictionary with interersting facts like the route difficulty grade and its location in lat/lon coordinates.

get_locations() takes a list of up to 100 routes and returns a list of dictionaries with all this information. To use the mountain project API, I needed to use the API key associated with my account. I've set that as an environmental variable:

import os, json
MP_API_KEY = os.environ['MP_API_KEY']

def get_locations(route_list):
    route_list = [str(r) for r in route_list]
    base = "https://www.mountainproject.com/data/get-routes?routeIds="
    end = "&key=" + MP_API_KEY
    routes = ','.join(route_list)
    url = base + routes + end
    r = requests.get(url)
    data = json.loads(r.text)['routes']
    return data

We'll break up the routes into chunks of 100 and pass those to get_locations(). But first, we needn't do that for all the routes -- only those that actually have recorded ticks:

# Just look at routes with ticks
routes_with_ticks = [route for route in route_ticks if route[1]]

# Make a big list 
batch_data = []
for route_batch in chunks(routes_with_ticks,100):
    batch_data.append(get_locations(route_batch))
route_data = {}        # Make a dictionary for easy look up
for batch in batch_data:
    for route in route_data:
        route_data[route['id']: route]

with open('route_data.p', 'wb') as f:
    pickle.dump(route_data,f)

Phew! Now we have all the information we need. Now just to analyze it. I'll put that in a separate post.