<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Scripted Company Blog &#124; The Scripted Company Blog</title>
	<atom:link href="http://blog.scripted.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.scripted.com</link>
	<description></description>
	<lastBuildDate>Tue, 21 May 2013 17:22:33 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Hacking for Sales, Part 5</title>
		<link>http://blog.scripted.com/staff/hacking-for-sales-part-5/</link>
		<comments>http://blog.scripted.com/staff/hacking-for-sales-part-5/#comments</comments>
		<pubDate>Tue, 21 May 2013 17:19:36 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=2123</guid>
		<description><![CDATA[Until now, we&#8217;ve played mostly with APIs and page scrapes. In this lesson, we dive a little bit deeper into the later in order to solve a simple sales problem: knowing when your customer changes jobs. I love JobChangeNotifier. It’s a simple web application that sends me an email each &#8230;]]></description>
				<content:encoded><![CDATA[<p>Until now, we&#8217;ve played mostly with APIs and page scrapes. In this lesson, we dive a little bit deeper into the later in order to solve a simple sales problem: knowing when your customer changes jobs.</p>
<p>I love <a href="http://jobchangenotifier.com/">JobChangeNotifier</a>. It’s a simple web application that sends me an email each week showing which of my LinkedIn connections have new jobs, and where they are now. It’s fun to watch my friends&#8217; careers evolve.</p>
<p>But more importantly, when a customer changes jobs, I suddenly have a sales opportunity.  If they liked Scripted at their old company, they’ll probably promote us at their new one. We have a dozen new deals this year alone from customer job changes.</p>
<p>However, JobChangeNotifier can’t help me if I’m not connected to my customer on LinkedIn. Some friends and I decided to hack a fix.</p>
<h3>Step 1. Start with  a CSV.</h3>
<p>Every legit CRM will have an export tool. I’m going to start with a CSV rather than explain how to use the Salesforce API to pull contacts. If you’ve read my previous posts on using APIs, you should be able to figure that out. Getting your contacts into a spreadsheet is a trivial step.</p>
<p>All we need is first name, last name, and company name in this three-column CSV. It should look something like this:</p>
<table>
<tbody>
<tr>
<td>Jen</td>
<td>Brian</td>
<td>Band Digitial</td>
</tr>
<tr>
<td>David</td>
<td>Skinner</td>
<td>Band Digitial</td>
</tr>
<tr>
<td>Russel</td>
<td>Evans</td>
<td>Experian</td>
</tr>
<tr>
<td>Tim</td>
<td>Titus</td>
<td>Experian</td>
</tr>
</tbody>
</table>
<p>Easy enough, right?</p>
<h3>Step 2. Get the LinkedIn profile.</h3>
<p>There are at least two ways to get your contact’s LinkedIn profile. The first and most direct (and perhaps terms compliant) approach is to use LinkedIn’s People Search API. There is complex authentication and required, and even when you’re authenticated, building the right query is actually not easy.</p>
<p>I decided to skip API approach in favor of a less direct but still very effective approach: the Internet search.</p>
<p>The code that follows is admittedly on the sketchier side of hacky. It uses a Python library to <a href="http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/">emulate a browser</a>, so search engines treat your page loads the same as if you were doing it with a mouse. If you’re not comfortable with this approach, you can use urllib or urllib2 instead which adds no disguise about this being a Python script. But for those who want to see this hack in action, here you go.</p><pre class="crayon-plain-tag">from bs4 import BeautifulSoup
import urllib2, re, csv, sys, urllib, json, mechanize, os, time, random
import mechanize
import cookielib
#http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
# Browser
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh &gt; 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17')]</pre><p>The Python library I use here is called Mechanize, and once the options are set, it’s very easy to use. I create a browser instance called br with this line:</p><pre class="crayon-plain-tag">br = mechanize.Browser()</pre><p>And you’ll see it’s used later in the script to load Bing queries and open public LinkedIn profile pages.</p>
<p>Now, let’s build a Bing query to find the LinkedIn profile.<a href="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-9.38.54-PM.png"><img class="alignright size-medium wp-image-2126" alt="Screen Shot 2013-05-20 at 9.38.54 PM" src="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-9.38.54-PM-300x122.png" width="300" height="122" /></a></p><pre class="crayon-plain-tag">burl = 'http://www.bing.com/search?q=site:linkedin.com %s %s %s' % (first, last, company)</pre><p>I found that querying for “site:linkedin.com ryan buckley scripted” yields the target LinkedIn profile as the top search result about 80% of the time. That’s good enough for me, and it makes it easy to parse the results page and grab that first result.</p>
<h3>Step 3. Parse the Bing results page.</h3>
<p>Here&#8217;s some code you should understand from my previous posts.</p><pre class="crayon-plain-tag">r = br.open(burl.replace(' ', '%20'))
html = r.read()
soup = BeautifulSoup(html)
result = soup.find('div', { "class" : "sb_tlst" } )
if not hasattr(result, 'a'):
	print "No link?", result
	continue
if not result.a.has_key('href'):
	print "No href?", result
	continue
link = result.a['href']</pre><p>The first line replaces all spaces with their HTML character, &#8220;%20&#8243;. This helps mechanize open the page. With the page loaded, I read it into a variable called html, which passes to BeautifulSoup for parsing. Again, all of this should be familiar (well, except the %20 thing, but you&#8217;d have discovered this need if you ran the code without it.)</p>
<p><a href="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-9.39.10-PM.png"><img class="alignright size-medium wp-image-2125" alt="Screen Shot 2013-05-20 at 9.39.10 PM" src="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-9.39.10-PM-300x45.png" width="300" height="45" /></a>The &#8220;soup.find&#8221; command looks for the first div with class=&#8221;sb_tlst&#8221;. If I did find_all, it wouldhave returned an array, but find returns only the first one.</p>
<p>The next lines are some error testing. I need to make sure of two things:</p>
<p>1. The result object has an &#8220;a&#8221; attribute (meaning, it find the link within this div), and<br />
2. The anchor tag has an &#8220;href&#8221; attribute (this has the LinkedIn URL that I&#8217;m looking for).</p>
<p>If these tests pass, then I store the link into a variable. And I&#8217;m off to the next step.</p>
<h3>Step 4. Parse the LinkedIn profile.</h3>
<p>Alright! We&#8217;re almost there, but LinkedIn makes this last part is tricky.</p>
<p>There are two types of public profiles, shown here:</p>
<div id="attachment_2130" class="wp-caption alignnone" style="width: 310px"><a href="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-10.40.04-PM.png"><img class="size-medium wp-image-2130" alt="Does not have a headline" src="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-10.40.04-PM-300x269.png" width="300" height="269" /></a><p class="wp-caption-text">Does not have a headline</p></div>
<div id="attachment_2129" class="wp-caption alignnone" style="width: 310px"><a href="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-10.40.21-PM.png"><img class="size-medium wp-image-2129" alt="Has a headline" src="http://blog.scripted.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-20-at-10.40.21-PM-300x285.png" width="300" height="285" /></a><p class="wp-caption-text">Has a headline</p></div>
<p>One has a profile and the other doesn&#8217;t. We have to account for this difference in our code:</p><pre class="crayon-plain-tag">r = br.open(link)
html = r.read()
soup = BeautifulSoup(html)
result = soup.find('p', { "class" : "headline-title title" }) if soup.find('p', { "class" : "headline-title title" }) else soup.find('ul', { "class" : "current" })
if not result:
	continue
if result.string:
	if ' at ' in result.string:
		new_title, new_company = result.string.strip().split(' at ')
elif result.li:
	if ' at ' in result.find("li").get_text():
		new_title, new_company = result.find("li").get_text().strip().split(' at ')
if not new_company:
	continue</pre><p>This little bit of code opens the LinkedIn profile URL we just saved from Bing, finds the profile headline using BeautifulSoup, and then parses the headline based on the word &#8220;at&#8221; into the title and company and stores them into two new variables. With some error handling, all that is just 15 lines of code. Python is amazing!</p>
<h3>Step 5. Print changes to screen and/or write the CSV.</h3>
<p>To recap, we read the contents of a three-column CSV including first name, last name, and company name. We used Bing to get the URL for the public LinkedIn profile. We loaded that page, parsed out the current job info and stored it.</p>
<p>Now, we compare the LinkedIn data to our spreadsheet and print people whose company names don&#8217;t match.</p><pre class="crayon-plain-tag">if company not in new_company.strip():
	ins['First'], ins['Last'], ins['Company'], ins['New Title'], ins['New Company'], ins['Link'] = first, last, company, new_title.strip(), new_company.strip(), link
	f = open(outpath,'a')
	dw = csv.DictWriter(f, fieldnames=fields)
	dw.writerow(ins)
	f.close()
	print "%s - %s %s: was at %s, now at %s (%s)" % (i, ins['First'], ins['Last'], ins['Company'], ins['New Company'], ins['Link'])</pre><p>You won&#8217;t be able to run this, so let&#8217;s just read it. The dictionary &#8220;ins&#8221; is not defined yet (I did that earlier in the script but haven&#8217;t shown it yet in this post). Before I can insert these data to the dictionary, I have to define it (which also empties it at the top of each loop) like this:</p><pre class="crayon-plain-tag">ins = {}</pre><p>The lines that follow the creation of the dictionary are the standard process you&#8217;ll copy every time you need to write a CSV. I&#8217;m a huge fan of the DictWriter now. The code is cleaner, it makes sense, and it&#8217;s quick. The print command at the end shows me the data I&#8217;m writing to the CSV.</p>
<p>Here&#8217;s the full script in all of its glory. Note that you should set inpath and outpath to match your file structure, and you&#8217;ll need that three column CSV to start with. For your convenience, I&#8217;m <a href="https://docs.google.com/file/d/0B22bAVquiy3KVmFmanB5SXQxRXc/edit?usp=sharing">sharing one here</a>.</p><pre class="crayon-plain-tag">from bs4 import BeautifulSoup
import urllib2, re, csv, sys, urllib, json, mechanize, os, time, random
import mechanize
import cookielib
#http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
# Browser
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh &gt; 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17')]

# Assumes list is in ~/data/starting/list.csv
inpath = os.getenv("HOME")+"/data/starting/linkedin_test.csv"
outpath = os.getenv("HOME")+"/data/finished/linkedin_test.csv"
#field for outpath
fields = ['First', 'Last', 'Company', 'New Title', 'New Company', 'Link']

cos = csv.reader(open(inpath, 'rU'))
i = 0
for co in cos:
	try: 
		i += 1
		if i &lt; 6: 
			continue
		ins, new_title, new_company = {}, '', ''
		first, last, company = co[0].strip(), co[1].strip(), co[2].strip()
		burl = 'http://www.bing.com/search?q=%s %s %s site:linkedin.com' % (first, last, company)
		r = br.open(burl.replace(' ', '%20'))
		html = r.read()
		soup = BeautifulSoup(html)
		result = soup.find('div', { "class" : "sb_tlst" } )
		if not hasattr(result, 'a'):
			print "No link?", result
			continue
		if not result.a.has_key('href'):
			print "No href?", result
			continue
		link = result.a['href']
		r = br.open(link)
		html = r.read()
		soup = BeautifulSoup(html)
		result = soup.find('p', { "class" : "headline-title title" }) if soup.find('p', { "class" : "headline-title title" }) else soup.find('ul', { "class" : "current" })
		if not result:
			continue
		if result.string:
			if ' at ' in result.string:
				new_title, new_company = result.string.strip().split(' at ')
		elif result.li:
			if ' at ' in result.find("li").get_text():
				new_title, new_company = result.find("li").get_text().strip().split(' at ')
		if not new_company:
			continue
		if company not in new_company.strip():
			ins['First'], ins['Last'], ins['Company'], ins['New Title'], ins['New Company'], ins['Link'] = first, last, company, new_title.strip(), new_company.strip(), link
			f = open(outpath,'a')
			dw = csv.DictWriter(f, fieldnames=fields)
			dw.writerow(ins)
			f.close()
			print "%s - %s %s: was at %s, now at %s (%s)" % (i, ins['First'], ins['Last'], ins['Company'], ins['New Company'], ins['Link'])
	except Exception as detail:
		print "Failed on record %s with error %s" % (i, detail)
		#c.close()</pre><p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/hacking-for-sales-part-5/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hacking for Sales, Part 4</title>
		<link>http://blog.scripted.com/staff/hacking-for-sales-part-4/</link>
		<comments>http://blog.scripted.com/staff/hacking-for-sales-part-4/#comments</comments>
		<pubDate>Mon, 04 Feb 2013 04:53:19 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=2059</guid>
		<description><![CDATA[Recently, I got curious what other companies received funding, and when the funding event happened. There&#8217;s no better resource for this data than Crunchbase, and it&#8217;s another great little Python exercise to get that data out and into a spreadsheet. First, let&#8217;s start with the imports and data sources. [crayon-519f2d13d5505/] &#8230;]]></description>
				<content:encoded><![CDATA[<p>Recently, I got curious what other companies received funding, and when the funding event happened. There&#8217;s no better resource for this data than Crunchbase, and it&#8217;s another great little Python exercise to get that data out and into a spreadsheet.</p>
<p><strong>First, let&#8217;s start with the imports and data sources.</strong></p><pre class="crayon-plain-tag">import json, urllib, os, csv
inpath = os.getenv("HOME")+"/data/starting/permalinks.csv"
outpath = os.getenv("HOME")+"/data/finished/funding_data.csv"</pre><p>I like to define the source of data and its resting place upfront. You&#8217;ll see if you keep following me on these, that I&#8217;ll always call them inpath and outpath. I also usually run them on my Mac, but sometimes if it&#8217;s a big file I&#8217;ll shoot it up to my server and let it run there. The os library allows me to define a home path without providing absolute folders, so paths work the same on my Mac as they do on the server.</p>
<p>To make this clearer, my home directory on my MacBook is /Users/rbucks. I keep all of my scripts in /Users/rbucks/scripts/, and my data is all in /Users/rbucks/data/starting/ and /Users/rbucks/data/finished/.</p>
<p>I keep this same file structure on an Ubuntu server, but of course the home directory is different. On Ubuntu, it&#8217;s /home/rbucks/. In order to save myself time, and make the same script run locally and on my server, I like to stick os.getenv(&#8220;HOME&#8221;) in all of my script file paths.</p>
<p><strong>Next, let&#8217;s make a function to write a csv.</strong></p><pre class="crayon-plain-tag">def write(sales_data):
	f = open(outpath,'a')
	fields = ['permalink', 'total_money_raised', 'dates_raised']
	dw = csv.DictWriter(f, fieldnames=fields)
	dw.writerow(sales_data)
	f.close()</pre><p>We covered functions in an earlier post. This one takes a list of dictionaries and writes it beautifully into a csv. All you need to do is tell it which keys to look for (that&#8217;s the fields list).</p>
<p>There&#8217;s another function called <a href="http://docs.python.org/2/library/csv.html">csv.writer</a> but I&#8217;m sold now on dictionary writing. Without the keys, you need to offset empty fields when you create the list to write the csv, and the code is much messier. But csv.writer works well enough, and it was my first csv writing tool.</p>
<p>But don&#8217;t bother with it. DictWriter is way cooler.</p>
<p><strong>Finally, loop and write.</strong></p>
<p>Here&#8217;s the code for the rest of it.</p><pre class="crayon-plain-tag">key = "get your key from crunchbase"
cos = csv.reader(open(inpath, 'rU'))
for co in cos:
	try:
		sales_data = {}
		sales_data['permalink'] = co[0]
		qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?api_key=%s' % (co[0], key)
		qry_response = urllib.urlopen(qry_url).read()
		qry_result = json.loads(qry_response)
		if qry_result.has_key('error'):
			continue
		if qry_result.has_key('total_money_raised'):
			sales_data['total_money_raised'] = qry_result['total_money_raised']
		if qry_result.has_key('funding_rounds'):
			dates = []
			for r in qry_result['funding_rounds']:
				funded_date = "%s/%s/%s" % (r['funded_month'], r['funded_day'], r['funded_year'])				
				dates.append(funded_date)
			sales_data['dates_raised'] = ", ".join(dates)
		write(sales_data)
		print "Ran "+co[0]
	except Exception as detail:
		print "Gah", detail
		print qry_url
		continue</pre><p>I&#8217;m going to breeze through this, and I&#8217;m not trying to be lazy. If you followed the first three of these, the Python code here should make sense, and if it doesn&#8217;t, you should know how to use the Python console to make it make sense.</p>
<p>I load a list of permalinks into a variable called cos via the csv.reader function that we imported in the first step. This csv is little more than a single field with a bunch of permalinks I pulled in Step 3, like this:</p>
<p>55<br />
158<br />
lover-ly<br />
netconstructor-com<br />
212-media<br />
sales-marketing<br />
netconstructor<br />
1800pharmacy<br />
1-800-therapist-llc<br />
10-20-media<br />
1000jobboersen-de<br />
10th-degree<br />
10th-degree<br />
128b<br />
140-proof<br />
140fire</p>
<p>If you want to play along, take that list and save it into a file called &#8220;permalinks.csv&#8221; in your home&#8217;s data/starting folder. The csv.reader function loads this into an iterable csv object that I&#8217;m calling cos. At each iteration of the for loop, it&#8217;s pulling a row into a variable I&#8217;m calling &#8220;co,&#8221; which is a list of row values. To get the first value, I use co[0] (also covered earlier).</p>
<p>Now watch this closely. sales_data is a dictionary, defined by the empty curly brackets. Dictionaries take keys and values, unlike lists which just take values. So in the third line of the loop, I&#8217;m adding a key-value pair for &#8216;permalink&#8217; in my dictionary. I&#8217;m defining permalink as that row&#8217;s value in the csv. I want to record this permalink in my new csv.</p>
<p>Also note that all of the fields in my write function get defined in this loop. So I create a key-value pair also for &#8216;total_money_raised&#8217; and &#8216;dates_raised.&#8217; This is very important! The write function will break if it sees a key in sales_data that&#8217;s not defined in the fields list.</p>
<p>The rest of this is same as last time. To learn how this works, take it line by line in the Python console. Go ahead and start with &#8220;scripted-com&#8221; as your permalink, and play with the JSON response.</p><pre class="crayon-plain-tag">qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?api_key=%s' % ('scripted-com', key)
qry_response = urllib.urlopen(qry_url).read()
qry_result = json.loads(qry_response)</pre><p>(Note: you need to have key defined with your own Crunchbase API key.)</p>
<p>Type qry_result in the Python console and look at the data. There&#8217;s a lot. Try iterating through the first level like this:</p><pre class="crayon-plain-tag">for r in qry_result:
	print r</pre><p>Then try printing out the keys in r. Type r['total_money_raised']. Or r['funding_rounds']. You should start to see how this script works, and how awesome JSON data is for salespeople to work with.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/hacking-for-sales-part-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NLP Hacking in Python</title>
		<link>http://blog.scripted.com/staff/computerz/</link>
		<comments>http://blog.scripted.com/staff/computerz/#comments</comments>
		<pubDate>Fri, 25 Jan 2013 01:36:18 +0000</pubDate>
		<dc:creator>Murad</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1983</guid>
		<description><![CDATA[Scripted recently released a new feature called Teams, which allows us to efficiently and confidently select expert writers in a given subject to be on a team, the idea being that a business looking for experts in that field can easily find writers who are highly qualified (as both a writer and &#8230;]]></description>
				<content:encoded><![CDATA[<h1></h1>
<p>Scripted recently released a new feature called <b>Teams</b>,<b> </b>which allows us to efficiently and confidently select expert writers in a given subject to be on a team, the idea being that a business looking for experts in that field can easily find writers who are highly qualified (as both a writer and a domain expert) to write about it. Part of what determines whether a writer is a good fit for a team is knowing how many pieces they&#8217;ve written about that team&#8217;s subject matter. This entailed an interesting machine learning problem&#8230;</p>
<h2>The Problem</h2>
<blockquote><p><em>How can I get a computer to tell me what an article is about (provided methods such as bribery and asking politely do not work)?</em></p></blockquote>
<p>To formally explain the problem as well as the proposed solution to it, I&#8217;m going to stay fairly high-level and use a toy example, with links to resources for further reading and a disclaimer that in reality you would need a dataset consisting of more than four samples to actually make any of this work (<em>like a lot more&#8230;</em>).</p>
<p>For the sake of this example, imagine we are engineers at a much tinier version of Scripted, with only four pieces of writing in our system, the titles of which are:</p>
<p style="text-align: center;"><em style="font-size: 15px; text-align: center;">&#8220;The Perfect Panini&#8221;</em><br />
<em style="font-size: 15px; text-align: center;">&#8220;Sandwich Sorcery&#8221;</em><br />
<em style="font-size: 15px; text-align: center;">&#8220;Boston Terriers: Friend or Foe?&#8221;</em><br />
<em style="font-size: 15px; text-align: center;">&#8220;How to Tell if your Dog is a Dog and not a Cat&#8221;</em></p>
<p><em>Note: These four documents, in the language of natural language processing, are known as a &#8216;corpus&#8217;. You can think of this as a collection of written documents that we want a computer to learn from. </em></p>
<p>As imaginary employees of tiny Scripted, we&#8217;ve noticed that many clients are looking for writers who are experts in the fields of dogs and sandwiches to produce content for them. Accordingly, we make a &#8220;Dogs&#8221; team and a &#8220;Sandwiches&#8221; team. The question now is, which documents belong on which team?</p>
<p style="text-align: center;"><a href="http://blog.scripted.com/uncategorized/computerz/attachment/ds/" rel="attachment wp-att-2049"><img class="size-medium wp-image-2049 aligncenter" alt="Dog vs. Sandwich" src="http://blog.scripted.com/wp-content/uploads/2013/01/ds-300x107.png" width="300" height="107" /></a></p>
<h2>Describing Documents</h2>
<p>Imagine you&#8217;re playing a guessing game with someone. You provide your partner with a list of animals, and a list of attributes about each animal. You describe a domesticated animal that is furry, says &#8216;meow,&#8217; and is prominently featured on the internet. Based on these clues, and the information you initially supplied, the person you&#8217;re playing with guesses &#8220;cat.&#8221;</p>
<p>This is pretty much the same process our algorithm will use (<i>this task is actually a very familiar one to anyone who&#8217;s dabbled in machine learning, and is known formally as </i><a href="http://en.wikipedia.org/wiki/Statistical_classification"><i>classification</i></a>). To put it a bit more formally and relate it to our specific problem,<i> </i>we want the process of putting a document on a team to eventually look like:</p>
<ol>
<li>Extracting features from the document which are known to us</li>
<li>Using these features to describe the document to the system</li>
<li>Having the system guess which team that document belongs to based on that description.</li>
</ol>
<p>First thing&#8217;s first. We need a way to numerically describe documents to a computer so that we can tell how similar two documents are to each other. You can think of this as a mapping from a document to a point in space such that the closer two of these points are to each other the more similar their documents corresponding to those points are. Thinking of documents as vectors in this way is formally known as the <a href="http://en.wikipedia.org/wiki/Vector_space_model">Vector Space Model.</a></p>
<div id="attachment_2046" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.scripted.com/uncategorized/computerz/attachment/cap/" rel="attachment wp-att-2046"><img class="size-medium wp-image-2046     " alt="Fig. 1. A two-dimensional vector space illustrating the concept of considering document similarity in terms of Euclidean Distance. All the dots represent " src="http://blog.scripted.com/wp-content/uploads/2013/01/cap-300x176.png" width="300" height="176" /></a><p class="wp-caption-text">Fig. 1. A two-dimensional vector space illustrating the concept of considering document similarity in terms of Euclidean Distance. All the dots represent documents. The small red dots represent the documents most similar to the large red dot in the middle.</p></div>
<p>A common way to represent a written document as a vector is to think about it in terms of <a href="http://en.wikipedia.org/wiki/Bag-of-words_model">&#8220;Bag of Words&#8221;</a> (BoW) vectors. These are vectors of word counts where each slot in the vector represents the number of times a certain word was used. The list of all the words we keep track of in these vectors is known as our <i>vocabulary</i>.</p>
<p>Let&#8217;s say that our entire vocabulary consists of only the words &#8220;dog&#8221; and &#8220;cat.&#8221; Then a BoW vector of a document with this vocabulary would be:</p>
<p style="text-align: center;"><strong>&lt;# Times &#8216;dog&#8217; was used in doc, #Times &#8216;cat&#8217; was used in doc&gt;</strong></p>
<p>For example, the text <code>"dog cat cat dog dog"</code> would be represented as the vector <code>&lt;3,2&gt;</code>. This pattern holds for larger documents and vocabularies. Now that we have some understanding of the vector space model, we can get cracking on some code.</p>
<h2>Setup</h2>
<p>The entirety of the code covered in this tutorial can be found <a href="https://github.com/Scripted/NLP-Tutorial">here</a>. We&#8217;re going to be working in Python, and will use the following modules to make our lives way easier:</p>
<ul>
<li><a href="http://www.numpy.org/">Numpy</a> - To goto Python library for fast numerical computing.</li>
<li><a href="http://www.scipy.org/">Scipy</a> - A scientific computing module with loads of functionality built on top of Numpy.</li>
<li><a href="http://radimrehurek.com/gensim/">Gensim</a> - &#8220;Topic Modeling for Humans&#8221;</li>
<li><a href="http://scikit-learn.org/stable/">Scikit-Learn</a> - Machine learning library for Python also built on top of Numpy.</li>
</ul>
<p>I&#8217;m assuming you have <b><a href="http://pypi.python.org/pypi/pip">pip</a> </b>and<b> </b><a href="http://git-scm.com/">git</a> installed. If this isn&#8217;t the case, you should definitely start there, as these tools are super useful for developers. First thing we want to do with those installed is grab a local copy of the tutorial&#8217;s source code:</p><pre class="crayon-plain-tag">git clone https://github.com/Scripted/NLP-Tutorial</pre><p>Now, we need to install the dependencies. In a perfect world, running <code>pip install -r requirements.txt</code> in the <b>NLP-Tutorial</b> directory would take care of all of this for you. Unfortunately, Numpy and Scipy can be tricky to install. You might want to try installing each individually in the following order:</p><pre class="crayon-plain-tag">pip install numpy
pip install scipy
pip install gensim
pip install scikit-learn</pre><p>To test whether your installation was successful, run <code>python classifier.py</code>. If it doesn&#8217;t give you any errors, then you&#8217;re all set! If not, just follow the links I provided above next to each dependency and follow their installation instructions.</p>
<h2>The Code</h2>
<p>Now that you&#8217;re all set up, let&#8217;s switch gears back to the algorithm. Recall that the first step of the process we described was to turn documents into vectors. Let&#8217;s look at how to actually do that in code. We&#8217;ll start by pulling in all our dependencies.</p><pre class="crayon-plain-tag">import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
from math import sqrt
import gensim
from sklearn.svm import SVC
import os</pre><p>Now that we&#8217;ve got all the resources we need, we start by loading in our corpus. If you recall, a <i>corpus</i> is a collection of documents we wish to learn from and work with in the context of an NLP algorithm. I&#8217;ve provided a toy corpus in the repository which is aptly named&#8230; &#8220;<em>corpus.</em>&#8220;<i> </i>This directory contains the following four documents:</p><pre class="crayon-plain-tag">dog1.txt - &quot;dog runs and barks at dog&quot;
dog2.txt - &quot;the dog runs and barks at the apatosaurus&quot;
sandwich1.txt - &quot;a sandwich of cheese and meat and bread and cheese is&nbsp;
                   supercalifragilisticexpialidocious&quot;
sandwich2.txt - &quot;the sandwich of meat and meat and cheese and meat&quot;</pre><p>You might astutely observe that these documents are unrealistically simple, contain a very limited vocabulary, and no punctuation. Yup, that&#8217;s by design. This toy corpus was <i>meticulously</i> designed by yours truly to demonstrate certain aspects of the process, while not getting mired in the gritty details of practical application.</p>
<p>While things like cleaning punctuation, HTML, and URL&#8217;s out of your text and <a href="http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">lemmatization</a> are absolutely things you should look into for a &#8220;real world&#8221; application, they are definitely low-level details that distract us from the high level process. <b>What we ultimately want all of these things to do is split a block of text into words.</b></p><pre class="crayon-plain-tag">if __name__ == '__main__':
&nbsp;&nbsp;&nbsp;&nbsp;#Load in corpus, remove newlines, make strings lower-case
&nbsp;&nbsp;&nbsp;&nbsp;docs = {}
&nbsp;&nbsp;&nbsp;&nbsp;corpus_dir = 'corpus'
&nbsp;&nbsp;&nbsp;&nbsp;for filename in os.listdir(corpus_dir):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;path = os.path.join(corpus_dir, filename)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;doc = open(path).read().strip().lower()
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;docs[filename] = doc
&nbsp;&nbsp;&nbsp;&nbsp;names = docs.keys()

&nbsp;&nbsp;&nbsp;&nbsp;#Remove stopwords and split on spaces
&nbsp;&nbsp;&nbsp;&nbsp;print &quot;\n---Corpus with Stopwords Removed---&quot;
&nbsp;&nbsp;&nbsp;&nbsp;stop = ['the', 'of', 'a', 'at', 'is']
&nbsp;&nbsp;&nbsp;&nbsp;preprocessed_docs = {}
&nbsp;&nbsp;&nbsp;&nbsp;for name in names:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;text = docs[name].split()
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;preprocessed = [word for word in text if word not in stop]
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;preprocessed_docs[name] = preprocessed
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print name, &quot;:&quot;, preprocessed</pre><p>The output so far&#8230;</p><pre class="crayon-plain-tag">---Corpus with Stopwords Removed---
sandwich2.txt : ['sandwich', 'meat', 'and', 'meat', 'and', 'cheese', 'and', 'meat']
dog2.txt : ['dog', 'runs', 'and', 'barks', 'apatosaurus']
dog1.txt : ['dog', 'runs', 'and', 'barks', 'dog']
sandwich1.txt : ['sandwich', 'cheese', 'and', 'meat', 'and', 'bread', 'and', 'cheese', 'supercalifragilisticexpialidocious']</pre><p>What we&#8217;ve done so far is just some pretty basic Python to read in the documents in our corpus and store them in a <b>dict</b>. All we did to split the blocks of text into lists of words was call the string&#8217;s builtin <b>split </b>method, which for our purposes pretty much just splits on spaces.</p>
<h2><b>Preprocessing</b></h2>
<p>The code also hints at another aspect of NLP called <i>preprocessing</i>. You can think of preprocessing as filtering out words which ultimately don&#8217;t tell us a whole lot about the text. Words such as ['the', 'of', 'a', 'at', 'is'] which appear with great frequency in the vast majority of documents we&#8217;ll ever analyze are called <i>stop words</i> and should definitely be filtered out. Similarly, you might also want to filter out words which appear either extremely frequently, or extremely rarely. This is where <b>gensim</b> comes in handy.</p><pre class="crayon-plain-tag">#Build the dictionary and filter out common/rare terms
&nbsp;&nbsp;&nbsp;&nbsp;dct = gensim.corpora.Dictionary(preprocessed_docs.values())
&nbsp;&nbsp;&nbsp;&nbsp;unfiltered = dct.token2id.keys()
&nbsp;&nbsp;&nbsp;&nbsp;dct.filter_extremes(no_below=2)
&nbsp;&nbsp;&nbsp;&nbsp;filtered = dct.token2id.keys()
&nbsp;&nbsp;&nbsp;&nbsp;filtered_out = set(unfiltered) - set(filtered)
&nbsp;&nbsp;&nbsp;&nbsp;print &quot;\nThe following super common/rare words were filtered out...&quot;
&nbsp;&nbsp;&nbsp;&nbsp;print list(filtered_out), '\n'
&nbsp;&nbsp;&nbsp;&nbsp;print &quot;Vocabulary after filtering...&quot;
&nbsp;&nbsp;&nbsp;&nbsp;print dct.token2id.keys(), '\n'</pre><p>Which outputs&#8230;</p><pre class="crayon-plain-tag">The following super common/rare words were filtered out...
['and', 'apatosaurus', 'bread', 'supercalifragilisticexpialidocious'] 

Vocabulary after filtering...
['cheese', 'runs', 'sandwich', 'meat', 'barks', 'dog']</pre><p>To recap, the above code filters out words from our corpus which are too common or too rare to be useful. It does so by leveraging <b>gensim&#8217;s Dictionary</b> class which stores word counts for each term encountered in the corpus. After we feed it all the documents, we tell it to filter out words which are very common (defined by whatever <b>gensim&#8217;s </b>default threshold is) as well as words which occur only once. This leaves us with only six words in our vocabulary: ['cheese', 'runs', 'sandwich', 'meat', 'barks', 'dog'] . All words not in the vocabulary are ignored from here on out.</p>
<p>At this point, we&#8217;ve got all we need to start bringing bag of words vectors into the picture.</p>
<h2>From Texts to Vectors</h2>
<p>From this point on, we&#8217;ll be thinking about documents as vectors. That handy <b>Dictionary</b> class we used above for preprocessing also contains functionality to take a list of words, compute word counts for words in our vocabulary, and return the corresponding bag of words vectors.</p><pre class="crayon-plain-tag">#Build Bag of Words Vectors out of preprocessed corpus
    print "---Bag of Words Corpus---"

    bow_docs = {}
    for name in names:

        sparse = dct.doc2bow(preprocessed_docs[name])
        bow_docs[name] = sparse
        dense = vec2dense(sparse, num_terms=len(dct))
        print name, ":", dense</pre><p><em>Note: vec2dense is just a helper function I wrote to convert vectors to a more familiar format for display purposes</em></p>
<p>Here are the resulting bag of words vectors</p><pre class="crayon-plain-tag">---Bag of Words Corpus---
sandwich2.txt : [1.0, 0.0, 1.0, 3.0, 0.0, 0.0]
dog2.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 1.0]
dog1.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 2.0]
sandwich1.txt : [2.0, 0.0, 1.0, 1.0, 0.0, 0.0]</pre><p>There we go &#8212; our documents expressed as points in 6-dimensional space. For a practical application, working with 6-dimensional data is pretty damn reasonable. However, one can imagine a scenario where our corpus is composed of hundreds of thousands of documents which sample from a much more extensive vocabulary (like say the entire English language). Even after removing stop words, and exceedingly common/rare terms we are left with bag of words vectors with upwards of 50,000 unique word-counts/dimensions to keep track of. This should make the data nerd in you uncomfortable.</p>
<h2>The Curse of Dimensionality</h2>
<p>There are all kinds of terrible things that happen as the dimensionality of your descriptor vectors rises. One obvious one is that as the dimensionality rises, both the time and space complexity of dealing with these vectors rises, often exponentially.</p>
<p>Another issue is that as dimensionality rises, the amount of samples needed to draw useful conclusions from that data also rises steeply. Another way of phrasing that is with a fixed number of samples, the usefulness of each dimension diminishes. Finally, as the dimensionality rises, your points all tend to start becoming equidistant to each other, making it difficult to draw solid conclusions from them. The umbrella term that covers all these adverse effects of high dimensionality is &#8220;<a href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">the curse of dimensionality</a>.&#8221;</p>
<h2>The Remedy</h2>
<p>Fortunately, there&#8217;s a whole family of techniques called <i>dimensionality reduction techniques </i>which are entirely geared toward bringing down the number of dimensions in our descriptor vectors to something more reasonable. While the low level details often entail fairly advanced mathematics, the high level ideas behind the techniques are quite intuitive (also, the low-level details are often implemented for you, like in <b>gensim</b>).</p>
<p>Let&#8217;s look at those bag of words vectors again.</p><pre class="crayon-plain-tag">---Bag of Words Corpus---
sandwich2.txt : [1.0, 0.0, 1.0, 3.0, 0.0, 0.0]
dog2.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 1.0]
dog1.txt : [0.0, 1.0, 0.0, 0.0, 1.0, 2.0]
sandwich1.txt : [2.0, 0.0, 1.0, 1.0, 0.0, 0.0]</pre><p>Notice that certain dimensions are highly correlated with one another. For example, in both <b>dog1.txt </b>and <b>dog2.txt</b>,<b> </b>the first, third, and fourth values all seem to occur together. In terms of the text, this means that the terms &#8216;dog,&#8217; &#8216;runs,&#8217; and &#8216;barks&#8217; frequently seem to occur together. In that case, maybe rather than thinking of our documents in terms of individual word counts, we should be thinking about them in terms of topics (or groups of words that occur together).</p>
<p>Dimensionality reduction techniques help us do exactly that. They math-magically (it&#8217;s a technical term, I promise) express our high dimensional data in a lower dimensional space. The only manual bit that many of them feature is that you need to specify how many dimensions the lower-dimensional space should feature. Unfortunately, there&#8217;s no algorithm that I know of that can look at your data and auto-detect the perfect dimensionality to reduce to. In most cases, there&#8217;s not even a clearcut solution to that problem were you to ask a human. <b>Fortunately</b>, this silly toy corpus isn&#8217;t most cases &#8212; it&#8217;s an especially trivial case! Arguably the best kind of case.</p>
<p>We look at the corpus, and we intuitively say that these documents are about either &#8216;dogs&#8217; or &#8216;sandwiches,&#8217; and thus, the number of dimensions in our lower dimensional space should be two. The algorithm that we use to do the dimensionality reduction in this case is called &#8220;<i><a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing">Latent Semantic Indexing</a>,</i>&#8221; generally abbreviated to LSI.</p>
<p>Going into the math that makes LSI work is way beyond the scope of this article, so I&#8217;ll just summarize by saying it uses a technique from linear algebra called <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a> to reduce the input matrix (your document vectors stacked on top of each other) to one of a lower rank (the number of dimensions you specified). If you didn&#8217;t understand any of what just happened in that last sentence, it&#8217;s totally fine, because <b>gensim</b>&#8216;s done the heavy lifting for us.</p><pre class="crayon-plain-tag">#Dimensionality reduction using LSI. Go from 6D to 2D.
&nbsp;&nbsp;&nbsp;&nbsp;print &quot;\n---LSI Model---&quot;
&nbsp;&nbsp;&nbsp;&nbsp;lsi_docs = {}
&nbsp;&nbsp;&nbsp;&nbsp;num_topics = 2
&nbsp;&nbsp;&nbsp;&nbsp;lsi_model = gensim.models.LsiModel(bow_docs.values(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;num_topics=num_topics)
&nbsp;&nbsp;&nbsp;&nbsp;for name in names:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vec = bow_docs[name]
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sparse = lsi_model[vec]
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dense = vec2dense(sparse, num_topics)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lsi_docs[name] = sparse
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print name, ':', dense</pre><p>And here are our simplified 2d vectors&#8230;</p><pre class="crayon-plain-tag">---LSI Model---
sandwich2.txt : [3.222517, 0.0]
dog2.txt : [0.0, 1.6870012]
dog1.txt : [0.0, 2.4343436]
sandwich1.txt : [2.1483445, 0.0]</pre><p>Look at that. Two-dimensional vectors which represent all the information we care about without having to consider our data in terms of high-dimensional word counts. The actual numbers in the vectors don&#8217;t matter so much as the fact that the first slot is clearly our sandwich topic while the second is our dog topic.</p>
<p>This is probably a good time to reiterate how much training on four samples of data would never, ever, ever work to generate a topic model like LSI for non-trivial data. You&#8217;re also not likely to have outcomes as cut and dry as these, where one topic contributes 100% to a given vector&#8217;s magnitude, while the other contributes nothing.</p>
<h2>Document Similarity</h2>
<p>I&#8217;ve alluded above to the fact that once documents are represented as points in space, we can tell how similar they are by how close they are to each other. Now that we&#8217;re at that point in the process, let&#8217;s go over what it means for points to be &#8220;close&#8221; to each other.</p>
<p>It turns out that there are many different ways of considering how close two points in space are to each other. The most natural one to us is called <i>Euclidean Distance</i>. If I were to draw a straight line between two points, and measure that line, that is the Euclidean distance. To see an example of Euclidean Distance in action refer to <em>Fig. 1.</em> The big red dot represents the document whose closest matches we&#8217;re trying to find. The smaller red dots around it are the most similar documents by the Euclidean distance metric.<i><br />
</i></p>
<p>There are, however, issues with Euclidean distance. In many text processing applications, we care more about the direction of the vector (or the angle from the origin to the point) than its actual location. To demonstrate why we might want to consider the vector in terms of direction rather than magnitude, let&#8217;s consider another toy bag of words example. We consider the following three documents.</p><pre class="crayon-plain-tag">doc1 - ['dog','dog','dog','dog','dog','dog']
doc2 - ['dog','dog']
doc3 - ['cat']</pre><p>In their bag of words form:</p><pre class="crayon-plain-tag">doc1 - [6,0]
doc2 - [2,0]
doc3 - [0,1]</pre><p>Intuitively, we know that in terms of subject matter, <b>doc1</b> and <b>doc2</b> should be more similar to each other than they are to <b>doc3</b>. They are both clearly 100% about dogs, while, <b>doc3</b> is 100% about cats. However, when we take the Euclidean distances between them, we find that the distance between <b>doc2 </b>and <b>doc1 </b>is 4, while the distance between <b>doc2 </b>and <b>doc3</b> is ~2.24. The length of the first document led us to believe that <b>doc2 </b>was more like <b>doc3</b> than it was to <b>doc1</b>.</p>
<p>The fact that document length has nothing to do with what the document is actually about is exactly why we want to downplay the importance of vector magnitude and instead focus on direction. There are a few ways we can accomplish this.</p>
<p>Firstly, we can modify the vectors themselves by dividing each number in each vector by that vector&#8217;s magnitude. In doing so, all our vectors have a magnitude of 1. This process is called <i>unit vectorization</i><b> </b>because the output vectors are units vectors.</p>
<p>The unit vectors make it so that the dog documents are now closest to each other:</p><pre class="crayon-plain-tag">doc1 - [1,0]
doc2 - [1,0]
doc3 - [0,1]</pre><p>Another technique is to leave the vectors alone and just take the angle between them. Measuring similarity based on angle between vectors is know as <a href="http://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a>, or <i>cosine similarity</i>. In our example up there, you can see the angle between <b>doc1 </b>and <b>doc2 </b>is 0<b>° </b>because they are pointed in exactly the same direction. On the either hand, the angle between either dog document and <b>doc3 </b>is 90<b>°</b>. 0<b>° </b>is less than 90<b>°</b>, therefore, the dog articles are once again more similar to each other than they are to the one about cats.</p>
<p>Both <i>Euclidean </i>and <i>cosine</i> distance are called <i>distance metrics</i>. A distance metric is simply the formula we provide to an algorithm that will dictate how close we consider two vectors to be to each other. With that bit of theory under our belts, let&#8217;s go back to the code.</p>
<h2>Cosine Distance in Action</h2>
<p>We last left off at transforming our bag of words corpus into two-dimensional topic vectors. At this point, we want to unit vectorize those topic vectors because, again, we care more about the angle of the vector than the magnitude. The reason I&#8217;m using both unit vectorization and cosine distance in this code is twofold. Firstly, it couldn&#8217;t hurt. Second, when we move on to classification, we will be using a model that can only use <i>Euclidean </i>distance (as far as I know).</p>
<p>By unit vectorizing our corpus, we make <i>Euclidean</i> and <i>Cosine</i> distance equivalent in terms of ordering. They will not return the same exact distance for instance (as one is measuring magnitude while the other is measuring angle). However, what they will do is say that if <i>cosine_distance(A,B) &lt; cosine_distance(B,C) </i>then <i>euclidean_distance(A,B) &lt; euclidean_distance(B,C)</i> for all points <i>A,B,C </i>in our corpus. Intuitively, you can think of this as saying &#8220;if two points are closer to each other on the unit circle/sphere/hypersphere, then the angle between them is smaller.&#8221;</p>
<p>Here&#8217;s where the magic happens:</p><pre class="crayon-plain-tag">#Normalize LSI vectors by setting each vector to unit length
    print "\n---Unit Vectorization---"

    unit_vecs = {}
    for name in names:

        vec = vec2dense(lsi_docs[name], num_topics)
        norm = sqrt(sum(num ** 2 for num in vec))
        unit_vec = [num / norm for num in vec]
        unit_vecs[name] = unit_vec
        print name, ':', unit_vec</pre><p>and here&#8217;s the output:</p><pre class="crayon-plain-tag">---Unit Vectorization---
sandwich2.txt : [1.0, 0.0]
dog2.txt : [0.0, 1.0]
dog1.txt : [0.0, 1.0]
sandwich1.txt : [1.0, 0.0]</pre><p>Now what we want to do is illustrate cosine distance correctly matching up documents that should be similar. Without further delay, here&#8217;s the code to do it.</p><pre class="crayon-plain-tag">#Take cosine distances between docs and show best matches
    print "\n---Document Similarities---"

    index = gensim.similarities.MatrixSimilarity(lsi_docs.values())
    for i, name in enumerate(names):

        vec = lsi_docs[name]
        sims = index[vec]
        sims = sorted(enumerate(sims), key=lambda item: -item[1])

        #Similarities are a list of tuples of the form (doc #, score)
        #In order to extract the doc # we take first value in the tuple
        #Doc # is stored in tuple as numpy format, must cast to int

        if int(sims[0][0]) != i:
            match = int(sims[0][0])
        else:
            match = int(sims[1][0])

        match = names[match]
        print name, "is most similar to...", match</pre><p>You might be asking yourself where the cosine distance computation happened in that code. <b>gensim</b> features a class called<b> MatrixSimilarity</b> which is a type of <i>index</i>, or a data structure used to efficiently store vectors and data so that when it comes time to make a similarity query, the search to find vectors closest to our query point is much faster than brute force.</p>
<p>The actual query is made when we pick a pick from our corpus (called <b>vec </b>in the code)  and say <code>sims = index[vec]</code>. <b>sims </b>is a list of all points and their distance to the query point (<b>vec</b>). We proceed to sort that list, pull out the closest match, and print it out which yields the following:</p><pre class="crayon-plain-tag">---Document Similarities---
sandwich2.txt is most similar to... sandwich1.txt
dog2.txt is most similar to... dog1.txt
dog1.txt is most similar to... dog2.txt
sandwich1.txt is most similar to... sandwich2.txt</pre><p>Seems reasonable to me. At this point we&#8217;re very close to solving the problem we set out to solve.</p>
<h2>Classification</h2>
<p>Our progress so far answers the question &#8220;<i>which document is this document most similar to?&#8221; </i>The question we ultimately want to answer is <i>&#8220;what is this document about?&#8221; </i>There are different ways to answer that question, including keyword extraction, clustering, and all sorts of other techniques. The one that I like best involves <i>supervised learning</i>, where you train the algorithm on samples which have the &#8220;correct&#8221; answer provided with them.</p>
<p>The specific supervised learning problem we&#8217;re addressing here is called <i>classification</i>. You train an algorithm on labelled descriptor vectors, then ask it to label a previously unseen descriptor vector based on conclusions drawn from the training set. The way we are going to accomplish this in our case is you make use of <i><a href="http://en.wikipedia.org/wiki/Support_vector_machine">support vector machines</a></i>, a family of algorithms which define decision boundaries between classes based on labelled training data.</p>
<p>To give a high-level view of what exactly support vector machines (or SVM&#8217;s for short) do, I&#8217;ll refer back to our points in space. For our &#8216;dog&#8217; vs. &#8216;sandwich&#8217; classification problem, we provide the algorithm with some training samples. These samples are documents which have gone through our whole process (BoW vector -&gt; topic vector -&gt; unit vector) and carry with them either a &#8216;dog&#8217; label or a &#8216;sandwich&#8217; label. As you provide the SVM model with these samples, it looks at these points in space and essentially draws a line between the &#8216;sandwich&#8217; documents and the &#8216;dog&#8217; documents. This border between &#8220;dog&#8221;-land and &#8220;sandwich&#8221;-land is known as a <i>decision boundary</i>. Whichever side of the line the query point falls on determines what the algorithm labels it.</p>
<h2>The Final Step</h2>
<p>Time to build our SVM, train it, and test it.</p><pre class="crayon-plain-tag">#We add classes to the mix by labelling dog1.txt and sandwich1.txt
    #We use these as our training set, and test on all documents.
    print "\n---Classification---"

    dog1 = unit_vecs['dog1.txt']
    sandwich1 = unit_vecs['sandwich1.txt']

    train = [dog1, sandwich1]

    # The label '1' represents the 'dog' category
    # The label '2' represents the 'sandwich' category

    label_to_name = dict([(1, 'dogs'), (2, 'sandwiches')])
    labels = [1, 2]
    classifier = SVC()
    classifier.fit(train, labels)

    for name in names:

        vec = unit_vecs[name]
        label = classifier.predict([vec])[0]
        cls = label_to_name[label]
        print name, 'is a document about', cls

    print '\n'</pre><p>And the end result of all of this code is&#8230;</p><pre class="crayon-plain-tag">---Classification---
sandwich2.txt is a document about sandwiches
dog2.txt is a document about dogs
dog1.txt is a document about dogs
sandwich1.txt is a document about sandwiches</pre><p>Voila, we&#8217;ve successfully answered the question we set out to solve! This part of the code is where we take advantage of <b>scikit-learn</b> to do all the SVM-related heavy lifting for us. I constructed a training set out of two of the documents by manually labeling them &#8216;dog,&#8217; and &#8216;sandwich,&#8217; then correctly classified all four documents.</p>
<h2>Cross-Validation</h2>
<p>As a result of the simplicity of our example, we have done a big no-no in &#8220;testing&#8221; our algorithm, which is to train and test on what essentially ended up becoming the same data. In reality, your dataset would be much more varied and numerous than our toy example and you would test your data by partitioning it into a <i>training set</i> and a <i>test set</i>.</p>
<p>All samples in both training and test sets are labeled. However, in practice, you would build the model on the labeled training set, ignore the labels on the test set, feed them into the model, have the model guess what those labels are, and finally check whether or not the algorithm guessed correctly. This process of testing out your supervised learning algorithm with a training and test set is called <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">cross-validation</a>.</p>
<h2>A Quick Reality Check</h2>
<p>So you&#8217;ve read through the article, understand everything that&#8217;s going on, but sense there&#8217;s something missing. I&#8217;ve pointed out several times that this example has been simplified to the point of triviality for the sake of demonstrating the most important aspects of the algorithm which might otherwise get lost in the details. How do you take this framework we&#8217;ve run through and apply it to something practical?</p>
<p>As usual, the devil&#8217;s in the details. I can&#8217;t spell out the entire process out for you (otherwise this article would be the size of a textbook), but I can list for you the questions you&#8217;ll need to answer in order to make any such machine learning application work. Hopefully research, experimentation, and intuition will take you the rest of the way.</p>
<h3>First questions you should ask yourself</h3>
<ul>
<li>What is the problem I&#8217;m trying to solve?</li>
<li>What do I want to learn from my data? How will my findings be actionable?</li>
<li>What tools exist that can help me solve this problem?</li>
</ul>
<p><em>I can&#8217;t stress enough how important these first few questions are. The answers to these questions should drive every design decision you make. It&#8217;s all too easy to get caught up in all of the awesome algorithms and techniques out there in this field, and get completely distracted from your end goal. That being said, there is obviously a place for playing around, exploring, and experimenting. Just be sure you always keep the problem you&#8217;re trying to solve in mind. Anyhow&#8230; back to the list!</em></p>
<h3>Preprocessing</h3>
<ul>
<li>Where do I find a corpus?</li>
<li>How do I extract text from web pages in a scalable/generalizable way <i>(if using a corpus from the web)</i></li>
<li>How do I split my text into individual words and extract the roots of those words?</li>
<li>How do I build my vocabulary?</li>
<li>Which words do I filter out?</li>
</ul>
<h3>Vector Space Model</h3>
<ul>
<li>What sort of normalization does the problem call for?</li>
<li>Which distance metric makes sense for the problem at hand?</li>
</ul>
<h3>Dimensionality Reduction</h3>
<ul>
<li>How do I pick how many dimensions I ultimately want to be working with?</li>
</ul>
<h3>Machine learning</h3>
<ul>
<li>Am I solving a supervised or an unsupervised learning problem?</li>
<li>Which model/algorithm makes sense for the problem at hand?</li>
</ul>
<h3>Testing</h3>
<ul>
<li>How do I gauge how well my algorithm is doing?</li>
<li>Is it enough just to look at accuracy?</li>
<li>Should I consider precision/recall?</li>
<li>Are my classes of equal sizes, or does one dominate the other? How do I remedy this?</li>
<li>Are my training/test datasets actually representative of what my algorithm will encounter in every day use?</li>
</ul>
<p>If you take anything out of this article, it should be that useful applications of machine learning are all about the decisions you make. There are countless algorithms and techniques to choose from, but they are only as useful as your application of them. Identify the problem you&#8217;re solving, experiment with solutions, understand where there might be shortcomings in those solutions and remedy them when possible.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/computerz/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Wonders of Vim</title>
		<link>http://blog.scripted.com/staff/the-wonders-of-vim/</link>
		<comments>http://blog.scripted.com/staff/the-wonders-of-vim/#comments</comments>
		<pubDate>Fri, 18 Jan 2013 00:00:03 +0000</pubDate>
		<dc:creator>Murad</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[flake8]]></category>
		<category><![CDATA[pathogen]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[syntastic]]></category>
		<category><![CDATA[vim]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1818</guid>
		<description><![CDATA[As a software engineer, your choice of development environment will affect your productivity eight hours a day, five days a week. Different people have different tastes as far as text editors/IDE&#8217;s go, and for good reason. Personally, I always tend to gravitate toward more minimalist tools and greater transparency, and &#8230;]]></description>
				<content:encoded><![CDATA[<div>As a software engineer, your choice of development environment will affect your productivity eight hours a day, five days a week. Different people have different tastes as far as text editors/IDE&#8217;s go, and for good reason. Personally, I always tend to gravitate toward more minimalist tools and greater transparency, and as I&#8217;m sure many would agree, you can&#8217;t get much more minimalist than Vim.</div>
<div> </div>
<div>Vim is an open-source text editor which comes by default with many Unix-based systems. Although it has a graphical, stand-alone interface as well, I use Vim for the command-line functionality. Let&#8217;s face it, being able to quickly SSH into a server, open up Vim, write some code, and test it all without changing windows is pretty sweet. Additionally, Vim won&#8217;t waste your time with too many menus (and submenus, and sub-submenus, etc.). When you&#8217;re using Vim, the only thing you have to worry about is the code itself; you&#8217;re free of distractions, menus, unnecessary boxes, and well-meaning yet tragically misguided paperclips who offer you advice.</div>
<div> </div>
<div>Despite the welcome absence of menus, Vim is still highly configurable and functional. All you need to do to change Vim&#8217;s appearance/behavior is make modifications to your <strong><a href="http://vim.wikia.com/wiki/Vimrc">.vimrc</a></strong> file (which you will, unfortunately, most likely have to do, as the default behaviors can be silly at times). Some people go crazy with configuration, but I haven&#8217;t really found the need to yet. For example, my .vimrc file is 5 lines long at the moment and my world has yet to come crumbling down as a result of it.</div>
<h3>Plug and Play with Pathogen</h3>
<div>Outside of Vim&#8217;s built-in configurations, there are also countless plugins out there for various specialized purposes. Thanks to <a href="https://github.com/tpope/vim-pathogen">Pathogen</a>, installing and managing plugins is as easy as downloading them, putting them in the right directory, and adding &#8220;<strong>call pathogen#infect()</strong>&#8221; to your .vimrc file. Most plugin installations are as simple as going into your bundle directory and cloning the Github repository of the plugin you&#8217;re after.</div>
<h3>For Python Programmers</h3>
<div>One plugin that I would recommend to any Python programmer using Vim is<em> </em><a href="https://github.com/scrooloose/syntastic">Syntastic</a> paired with the Python package <a href="http://pypi.python.org/pypi/flake8">Flake8</a>. Syntastic<strong> </strong>is a syntax checking tool for Vim that notifies you of static/syntax errors. Flake8,<strong> </strong>on the other hand,<strong> </strong>is a Python package which combines similar syntax checking with a static analyzer that makes sure your code is consistent with <a href="http://www.python.org/dev/peps/pep-0008/">pep8</a>, a style guide for writing beautiful Python code. The quality of your code increases immediately, and you learn about good Python style without having to religiously study <em>pep8</em>. For an example, I&#8217;ve listed a picture below of some terrible Python code, so you can note how angry Syntastic is at said code:</div>
<div> </div>
<div><a href="http://blog.scripted.com/wp-content/uploads/2012/10/blah.png"><img class="alignnone size-full wp-image-1819" title="blah" alt="" src="http://blog.scripted.com/wp-content/uploads/2012/10/blah.png" width="450" height="119" /></a></div>
<div> </div>
<div>These are only the first few reasons that come to mind of why I&#8217;m a big fan of Vim, though that&#8217;s only scratching the surface. It&#8217;s</div>
<div>incredibly well supported (in that a simple Google search will generally remedy any issue you might have) and has a community of super skillful, enthusiastic developers backing it up. Getting used to Vim is certainly an undertaking if you&#8217;re completely new to it, but becoming a Vim wizard will make you a better programmer and is well worth it.</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/the-wonders-of-vim/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>This, Jen, is the Internet</title>
		<link>http://blog.scripted.com/staff/this-jen-is-the-internet/</link>
		<comments>http://blog.scripted.com/staff/this-jen-is-the-internet/#comments</comments>
		<pubDate>Wed, 19 Dec 2012 23:15:11 +0000</pubDate>
		<dc:creator>Sara</dc:creator>
				<category><![CDATA[Staff]]></category>
		<category><![CDATA[education]]></category>
		<category><![CDATA[IT Crowd]]></category>
		<category><![CDATA[The I.T. Crowd]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1810</guid>
		<description><![CDATA[There’s this great show called The I.T. Crowd about three people working in an I.T. department. One of them, in classic sitcom fashion, has no idea what she’s doing. She knows nothing about I.T., including what the letters I and T stand for. When she’s voted employee of the month and gets &#8230;]]></description>
				<content:encoded><![CDATA[<p>There’s this great show called <em>The I.T. Crowd</em> about three people working in an I.T. department. One of them, in classic sitcom fashion, has no idea what she’s doing. She knows nothing about I.T., including what the letters I and T stand for.</p>
<p>When she’s voted employee of the month and gets to give a speech to the shareholders, the other members of the department are furious. Long story short, they end up giving her a box with a red light on it and tell her it’s the Internet. The whoooole Internet. They recommend that she should present it to the shareholders, and that it would get quite the reaction.</p>
<p>Her coworkers’ attempt to humiliate her backfires, however, when the shareholders <em>also believe it is the Internet.</em> Like Jen herself, the shareholders have no idea what she does for a living.</p>
<p>In a way, I can sympathize with Jen (although I know quite well what I&#8217;m doing for a living). When I first moved to the Silicon Valley, my tech knowledge ended at some basic HTML. When the head of the engineering department told us about the unicorn sending out workers to keep the system running, I accepted it without question (as I should have – he is <a href="http://www.youtube.com/watch?v=zrzMhU_4m-g">wise in the ways of science</a>. Er, coding). But the phrase “a unicorn sends out workers to retrieve it” struck chords of “this, Jen, is the Internet” in me. I wanted to learn more about what happened on the back end, both so I had a better understanding of how our system works and so I could help keep work flow smooth between the departments.</p>
<p>In particular, there were three quotes from <em>The I.T. Crowd </em>that actually helped me accomplish this, by showing me the opposite of how to act:</p>
<ol>
<li>“I like you, Jen. You don’t ask questions.” First, I started keeping lists of the questions that came up for me during the week. This helped me find common themes in what I didn’t understand. Knowing where to start was a challenge, and keeping track of questions that I (or others in my office) had helped me to identify the best place to begin.</li>
<li>“I have it on good authority that if you type ‘Google’ into Google, you’ll break the Internet.” Instead of listening to random hearsay, I went to Meet Up events where people were discussing some of these same themes – search engine functionality became a popular topic, and it actually led me to groups and events about SEO.</li>
<li>“I have a lot of experience with the whole computer…thing. You know, emails, sending emails, receiving emails, deleting emails. I could go on.” I brought some friends with me to these panels, and together we made sure to take a more hands-on approach to getting better acquainted with code by signing up for <a href="http://www.codecademy.com/#!/exercises/0">Codecademy</a>, <a href="http://teamtreehouse.com/">Team Treehouse</a>, or a few other free online intro classes through several universities.</li>
</ol>
<p>As painful as it is to admit, this was also a great example of the benefits (albeit small) of T.V. in everyday life. Who knows, maybe next Liz Lemon will teach me the best way to leave a meeting.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/this-jen-is-the-internet/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Startup lessons learned in the 2012 NYC Marathon</title>
		<link>http://blog.scripted.com/staff/startup-lessons-learned-in-the-2012-nyc-marathon/</link>
		<comments>http://blog.scripted.com/staff/startup-lessons-learned-in-the-2012-nyc-marathon/#comments</comments>
		<pubDate>Tue, 11 Dec 2012 23:15:00 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Business Strategy]]></category>
		<category><![CDATA[Staff]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1951</guid>
		<description><![CDATA[Before Hurricane Sandy hit New York, I expected to run 26.2 miles through all five boroughs of New York City, starting in Staten Island and ending in Manhattan. I was volunteering as a guide for a disabled athlete with Achilles, an international organization that helps athletes with disabilities complete long &#8230;]]></description>
				<content:encoded><![CDATA[<p>Before Hurricane Sandy hit New York, I expected to run 26.2 miles through all five boroughs of New York City, starting in Staten Island and ending in Manhattan. I was volunteering as a guide for a disabled athlete with Achilles, an international organization that helps athletes with disabilities complete long races.</p>
<p>The day before I arrived in New York, Mayor Bloomberg cancelled the race, reversing his vocal support of the marathon only two days earlier. I was relieved. I did not train nearly as much as I did last year, and I knew that I wasn’t in the best shape to wake up at 5am and suffer through six hours of road running.</p>
<p>When I arrived in New York on Saturday afternoon, my friend and fellow guide casually asked me how I felt about running the marathon anyway. “Sure,” I replied, not taking him seriously. When he paid for lunch and said, “Cool, now I can guilt-trip you even more,” a brief moment of dread passed over me. Oh man, I thought. He wasn’t joking.</p>
<p>The next day at 10am I was standing at the original marathon’s finish line in Central Park, where several thousand marathon refugees decided to run 26.2 miles anyway in 4.25 loops around Central Park. I met my friend’s athlete there, a tall man named EJ Scott, who suffers from a degenerative eye disease. He was going to run the course blind, guided only by a small rolled-up towel about the size you’d pick up at a gym, held between EJ and his guide.</p>
<p>I was to flank on his right; the towel and guide were on his left. Over the next 6 hours, in order to help the time pass and preserve my sanity (it’s a real mental challenge to run a 6-mile loop four times), I thought about how this experience applies to startups. Here’s my list.</p>
<p>WHAT STARTUPS CAN LEARN FROM THE 2012 NYC MARATHON</p>
<ol>
<li><strong>The people in charge can screw up</strong>. Whether it’s your investors, advisors, or largest clients, just because they have money and/or influence doesn’t mean they’re always right. Mayor Bloomberg screwed up by saying the marathon was still on two days before he canceled it.</li>
<li><strong>Consumers will find a way to get what they want</strong>. There are about 50,000 customers of the New York City marathon every year, and you can’t tell that many people yes and then no when thousands of dollars and hundreds of hours were invested in the product. It took only a few hours for the alternate route around Central Park to get distributed on Facebook and Twitter.</li>
<li><strong>Breaking up the monotony makes it easier</strong>. Finally, a note about running. Our athlete preferred to run a 10:1 split, meaning for every 10 minutes running, take a minute to walk. When he said this, I did the math in my head: 6 hours, 6 times per hour, that means starting and stopping a timer 36 times. More repetition! Ugh! In fact, this technique made the marathon bearable. Instead of thinking about how we have two laps to go, I thought, alright, only six more splits until our last lap. Thinking in ten minute chunks actually made the six hours go by faster.</li>
</ol>
<p>They say that building a company is a marathon and not a sprint. And in running they say a marathon is 6 mile race with a 20 mile warm-up. Combine the two and I think you have a very compelling picture of what it’s like to build a company.</p>
<p>It starts slow. You may be working like a dog, but user growth is slower than your projections, customers don’t come as fast as you’d hoped, and fundraising’s not easy either. But you slog through the lows and relish the highs. You’re joined by competitors and partners, some of whom you like and others you don’t. You keep running. Finally you reach a critical point right around mile 20. Runners call it “the wall.” Here you have a choice, either summon all your guts and break through it, or crumble and sit on the sidelines.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/startup-lessons-learned-in-the-2012-nyc-marathon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scripted App for HootSuite</title>
		<link>http://blog.scripted.com/staff/scripted-app-for-hootsuite/</link>
		<comments>http://blog.scripted.com/staff/scripted-app-for-hootsuite/#comments</comments>
		<pubDate>Thu, 06 Dec 2012 19:00:09 +0000</pubDate>
		<dc:creator>Sunil</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>
		<category><![CDATA[content management]]></category>
		<category><![CDATA[content strategy]]></category>
		<category><![CDATA[hootsuite]]></category>
		<category><![CDATA[scheduled content]]></category>
		<category><![CDATA[social media strategy]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1934</guid>
		<description><![CDATA[Content automation is made even simpler when scalable content meets a social media management system. This is why we created the Scripted app for HootSuite, a tool that effectively closes the loop from content production to distribution and management. HootSuite is a social media management system that allows businesses and &#8230;]]></description>
				<content:encoded><![CDATA[<p>Content automation is made even simpler when scalable content meets a social media management system. This is why we created the Scripted app for HootSuite, a tool that effectively closes the loop from content production to distribution and management.</p>
<p>HootSuite is a social media management system that allows businesses and organizations to manage their social media outlets in bulk from one simple dashboard. You can view your scheduled posts for the day or month, make amendments to the posting schedule, and even re-tweet, message, and respond to posts directly from the HootSuite dashboard.</p>
<p><a href="http://blog.scripted.com/wp-content/uploads/2012/12/image01.png"><img class="alignnone size-medium wp-image-1939" title="image01" alt="" src="http://blog.scripted.com/wp-content/uploads/2012/12/image01-269x300.png" width="269" height="300" /></a></p>
<p>With the Scripted app for HootSuite, you can now view and publish Scripted content seamlessly from the HootSuite dashboard. Posts are visible in batches all on one page, allowing for automated streaming and simple distribution. No more hassling with excel sheets and altering formatting.</p>
<p>For real-time posts, click “Send Now”, and your post will be published immediately to one or more of your social media profiles. Or, select the calendar button to schedule posts in advance.</p>
<p><a href="http://blog.scripted.com/wp-content/uploads/2012/12/image02.png"><img class="alignnone size-medium wp-image-1940" title="image02" alt="" src="http://blog.scripted.com/wp-content/uploads/2012/12/image02-300x300.png" width="300" height="300" /></a></p>
<p><strong>Additional Features in HootSuite:</strong></p>
<ul>
<li>Receive custom analytics on the performance of your posts. See how Scripted content stacks up to your other content.</li>
<li>Manage multiple Twitter and Facebook accounts. Use Scripted content to promote your day business, volunteer organizations, or for personal branding.</li>
<li>View your mentions, inbox, and Tweets/Facebook posts from one tool, designed for simple social media management.</li>
</ul>
<p>The Scripted app fuses the high quality content-production capacity of Scripted with the content distribution and scheduling capacity of HootSuite. To install the Scripted app and see first-hand how it works, login to your HootSuite account today and visit the <a href="http://hootsuite.com/app-directory">App Directory</a>.</p>
<p><a href="http://blog.scripted.com/wp-content/uploads/2012/12/image03.png"><img class="alignnone size-full wp-image-1938" title="image03" alt="" src="http://blog.scripted.com/wp-content/uploads/2012/12/image03.png" width="262" height="279" /></a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/scripted-app-for-hootsuite/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hacking for Sales, Part 3</title>
		<link>http://blog.scripted.com/staff/hacking-for-sales-part-3/</link>
		<comments>http://blog.scripted.com/staff/hacking-for-sales-part-3/#comments</comments>
		<pubDate>Tue, 04 Dec 2012 19:58:45 +0000</pubDate>
		<dc:creator>Ryan</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Staff]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[crunchbase]]></category>
		<category><![CDATA[hacking for sales]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1851</guid>
		<description><![CDATA[In Part 1 and Part 2, we covered techniques for how to scrape data from pages on Crunchbase. Well, I have news for you, and I don&#8217;t want you to get upset. There&#8217;s a much easier way to do it, one that requires far less time. Instead of loading pages &#8230;]]></description>
				<content:encoded><![CDATA[<p>In <a title="Hacking for Sales, Part 1" href="http://blog.scripted.com/staff/hacking-for-sales-part-1/">Part 1</a> and <a title="Hacking for Sales, Part 2" href="http://blog.scripted.com/staff/hacking-for-sales-part-2/">Part 2</a>, we covered techniques for how to scrape data from pages on Crunchbase. Well, I have news for you, and I don&#8217;t want you to get upset. There&#8217;s a much easier way to do it, one that requires far less time. Instead of loading pages and scraping them, we&#8217;ll instead use the <a href="http://developer.crunchbase.com/">Crunchbase API</a>.</p>
<p>What&#8217;s an API? It stands for Application Programming Interface, but you should think of it more as a drive-thru window. If the inside of a McDonalds is the website or web app itself, then the drive-thru with its (sometimes) limited menu and streamlined experience  is the API. Since we&#8217;re bypassing the user interface altogether, everything becomes much easier to program.</p>
<h3>Step 1.</h3>
<p><a href="http://developer.crunchbase.com/member/register">Register</a> for the API. If you don&#8217;t already have a Mashery account, you may need to cerate one. This is easy and free.</p>
<h3>Step 2.</h3>
<p>Save your key into a local .py file. Again, you can use any text editor, but there are many free text editors for writing code. Save this file (call it crunchbase_api.py or something like that) in a folder on your computer. We&#8217;ll write the rest of the code in this file too. And while you&#8217;re at it, let&#8217;s import the libraries you&#8217;ll need: json and urllib.</p><pre class="crayon-plain-tag">import json, urllib
key = 'not_gonna_show_you_my_key'</pre><p></p>
<h3> Step 3.</h3>
<p>Prepare to be amazed! This is going to be so much easier than the last two.</p>
<p>First, let&#8217;s noodle through the <a href="http://developer.crunchbase.com/docs">documentation</a> a little bit. You should actually read through this because it&#8217;s written in English and will give you a sense of how APIs work. You might notice that working with the API still involves URLs. Indeed, most APIs are little more than http calls to websites, which return data in plain structured text.</p>
<p>You might also have noticed a new acronym: JSON. This sounds scarier than it is (for it&#8217;s length, not the <a href="http://sports.gunaxin.com/13-goalie-masks-in-pop-culture/57668">hockey-masked killer</a>). You should rejoice whenever your API returns JSON data, because it&#8217;s super easy to use and has a native library to interpret it in Python. I&#8217;ll explain more when we get our first response.</p>
<p>Let&#8217;s start by listing some advertising companies.</p><pre class="crayon-plain-tag">url = 'http://api.crunchbase.com/v/1/search.js?query=advertising'
response = urllib.urlopen(url).read()
result = json.loads(response)</pre><p>If you view the result, you&#8217;ll just see a lot of text. But buried within here are signs that the text is structured and iterable. You should note the locations of the commas, colons, square brackets, and curly brackets. They all play an important role in parsing this data. I&#8217;ll show you what I mean:</p><pre class="crayon-plain-tag">for r in result:
	print r</pre><p>You should see this output:</p>
<p>total<br />
crunchbase_url<br />
page<br />
results</p>
<p>Now go back and print the result (type &#8220;result&#8221; and hit enter). Scroll to the top of that block of text. Do you see &#8220;{u&#8217;total&#8217;: 7646, u&#8217;crunchbase_url&#8217;: u&#8217;http://www&#8230;&#8221;? JSON is basically a python dictionary, or a hash of key-value pairs. Here, &#8216;total&#8217; is a key and 7646 is the value. To get only the total out of result, simply type result['total'] and press enter. Pretty intuitive, right? You get the Crunchbase URL the same way.</p>
<p>JSON results can be deeply nested, as seen in result['results']. The value of the key &#8216;results&#8217; is a list of exactly 10 companies and their data (I calculated this with len(result['results']). This is, of course, iterable:</p><pre class="crayon-plain-tag">for r in result['results']:
	print r</pre><p>Let&#8217;s just take one of them and store the name so we can look up the juicy data we really want.</p><pre class="crayon-plain-tag">company = result['results'][0]
name = company['name']</pre><p>Here I took the first company in the list of results and then stored the value of &#8216;name&#8217; in a variable called name. If you print &#8220;company&#8221;, you&#8217;ll see that company is itself a JSON-type dictionary object. That&#8217;s why I can use the ['name'] index to pull the name data I need for the next step.</p>
<h3>Step 4.</h3>
<p>Finally, the good stuff.  You might have noticed we haven&#8217;t used the API key yet. Now we will, and I&#8217;m going to show you another cool thing about Python.</p><pre class="crayon-plain-tag">qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
qry_response = urllib.urlopen(qry_url).read()
qry_result = json.loads(response)</pre><p>Alright. What&#8217;s up with the % signs? These are placeholders for inserting variables into a string. The first %s is where our URL-friendly company name goes, and the second %s takes the key. This structure is common in modern languages, so take note. The % between the variables and the string with the %s&#8217;s is just convention that tells Python we&#8217;re using this technique.</p>
<p>The urllib.quote function replaces the space in company['name'] with safe %20 characters. Try typing urllib.quote(name) in the Python prompt and you&#8217;ll see what I mean.</p>
<p>And now.. drumroll&#8230; go ahead and print qry_result. There&#8217;s our data! More specifically:</p><pre class="crayon-plain-tag">qry_result['email_address']
qry_result['blog_url']
qry_result['phone_number']</pre><p>Just like that, juicy, actionable sales data.</p>
<h3>Step 5.</h3>
<p>Here&#8217;s how we&#8217;d iterate through all the advertising companies and store the data we want into a list of our own.</p><pre class="crayon-plain-tag">my_results = []
for r in result['results']:
	sales_data = {}
	name = r['name']
	qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
	qry_response = urllib.urlopen(qry_url).read()
	qry_result = json.loads(qry_response)
	sales_data['email'] = qry_result['email_address']
	sales_data['blog'] = qry_result['blog_url']
	sales_data['phone'] = qry_result['phone_number']
	my_results.append(sales_data)</pre><p>First, create the empty list we&#8217;ll fill up with data. Then, when we iterate through the results, we&#8217;re going to create a small dictionary for each company. sales_data starts each loop empty and fills up with the email, blog, and phone number of each business in result['results'].</p>
<p>But darn, if you run this, you&#8217;ll get this error: KeyError: &#8216;email_address&#8217;. That means one of my results had no email_address key. So, we&#8217;ll have to check for it before we pull it into our sales_data dictionary.</p><pre class="crayon-plain-tag">for r in result['results']:
	sales_data = {}
	name = r['name']
	print "Running", name
	qry_url = 'http://api.crunchbase.com/v/1/company/%s.js?key=%s' % (urllib.quote(name), key)
	qry_response = urllib.urlopen(qry_url).read()
	qry_result = json.loads(qry_response)
	sales_data['company'] = name
	if qry_result.has_key('email_address'): 
		sales_data['email'] = qry_result['email_address']
	if qry_result.has_key('blog_url'):
		sales_data['blog'] = qry_result['blog_url']
	if qry_result.has_key('phone_number'):
		sales_data['phone'] = qry_result['phone_number']
	my_results.append(sales_data)</pre><p>There we go. Instead of having to scrape using BeautifulSoup, we can use urllib and json to make API calls and interpret the responses. If all goes well, you should see something like this:</p>
<p>[{'blog': '', 'phone': u'937.531.6631', 'company': u'Commuter Advertising', 'email': u'sparker@commuter-advertising.com'}, {'blog': u'http://www.qubed.us', 'phone': '', 'company': u'qubed advertising', 'email': u'oz@qubed.ro'}, {'blog': u'http://blog.mpression.net/', 'phone': u'+44 (0) 870 235 4042 ', 'company': u'4th Screen Advertising', 'email': u'info@4th-screen.com'}, {'blog': u'http://prova.com/blog/', 'phone': '', 'company': u'Prova | Advertising', 'email': u'support@prova.com'}, {'blog': u'http://hiliteadvertising.com/blog/', 'phone': u'877-457-5837', 'company': u'HiLite Advertising', 'email': u'info@hiliteadvertising.com'}, {'blog': '', 'phone': u'+91-011-26197623', 'company': u'WorldWide Advertising Network Private Ltd', 'email': u'info@worldwideadvertisingnetwork.com'}, {'blog': '', 'phone': u'401-272-1122', 'company': u'Creative Circle Advertising Solutions', 'email': u'bill@creativecirclemedia.com'}, {'company': u'VLG Advertising'}, {'blog': u'http://www.17stories.com/', 'phone': u'(512) 532-2907', 'company': u'Tocquigny Advertising &amp; Interactive', 'email': u'awinsett@tocquigny.com'}, {'blog': u'http://www.blackdogadvertising.com/miami-advertising-blog/', 'phone': '', 'company': u'BlackDog Advertising', 'email': ''}]</p>
<p>There are hundreds of great APIs to explore for sales, including Yelp, LinkedIn, and Twitter. I&#8217;ll follow up with additional posts on each of these!</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/hacking-for-sales-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Your Blog and Social Media Channels Can Work Together</title>
		<link>http://blog.scripted.com/staff/how-your-blog-and-social-media-channels-can-work-together/</link>
		<comments>http://blog.scripted.com/staff/how-your-blog-and-social-media-channels-can-work-together/#comments</comments>
		<pubDate>Mon, 26 Nov 2012 22:00:25 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Business Strategy]]></category>
		<category><![CDATA[Staff]]></category>
		<category><![CDATA[business blog]]></category>
		<category><![CDATA[content strategy]]></category>
		<category><![CDATA[social media]]></category>
		<category><![CDATA[social media strategy]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1796</guid>
		<description><![CDATA[A common problem that arises with potential Scripted clients is they claim that they need a social media strategy.  Their competitors are more present in social media and they aren’t doing anything. In order to have success with social media, it&#8217;s often important to maintain an active company blog. Blogging &#8230;]]></description>
				<content:encoded><![CDATA[<p>A common problem that arises with potential Scripted clients is they claim that they need a social media strategy.  Their competitors are more present in social media and they aren’t doing anything. In order to have success with social media, it&#8217;s often important to maintain an active company blog.</p>
<p><strong>Blogging is Still Core</strong></p>
<p>It can be very difficult for a company to have a social media strategy without having a blog. After all, social media posts are too short form and don’t provide enough content to keep visitors engaged. With a blog, you can create a base camp for all of your content that is always connected to the core of your content strategy.  It&#8217;s the perfect place to demonstrate your company’s thought leadership while allowing for readers to register for white papers or download ebooks. Then, by utilizing Facebook, Twitter, and LinkedIn you can easily leverage your content by promoting new content.</p>
<p><strong>Don’t Be So Self-Serving on Twitter </strong></p>
<p>Once companies have a steady amount of content they assume that they only need to use their social media channels to announce new blog posts, promos, or services. While this is a good strategy, if social media followers only see self-promotion posts they will get bored and not click on your links. A great solution to this is to use a variation of the <a href="http://tippingpointlabs.com/2009/07/01/twitter-is-dead-long-live-twitter/">4-1-1 strategy</a> where 4 posts are tweets on other original content, 1 post is a re-tweet, and 1 post is self-serving.</p>
<p><strong>Post Content Outside Your Niche on Facebook</strong></p>
<p>With Facebook, the same strategy applies, however it&#8217;s important to realize that not every post needs to be relevant to your company’s niche. Post photos, current events, funny videos, and other intriguing posts that people want to click on and view. This is important because Facebook’s algorithm in showing a company’s posts to their followers has to do with the CTR of that company’s posts. With this in mind, a company does not need to only post serious subject matter or call to action promos. A balance of each will lead to more views of posts that actually need to be seen.</p>
<p><strong>Publish Consistently</strong></p>
<p>Lastly, it’s important to publish consistently. Some experts may even say that publishing consistently is more important than volume. By posting a new blog post once a day or even once a week, loyal visitors can depend on viewing new content at the same time. This improves repeat visitors and grows organic traffic and pageviews. With twitter, companies can schedule their tweets to be published at a certain time everyday.</p>
<p>A company’s blog is the hub for all content and information while a company’s social media channels are the hub for promotional announcements and brand engagement. By balancing self-serving posts with the posting of other relevant and entertaining posts a company’s messages will have a higher chance of being seen, read, and clicked on. Furthermore, by publishing content consistently, a brand will gain loyal repeat visitors and increase organic traffic.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/staff/how-your-blog-and-social-media-channels-can-work-together/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Top Six Tips to Meet Writing Deadlines</title>
		<link>http://blog.scripted.com/writers/top-six-tips-to-meet-writing-deadlines-2/</link>
		<comments>http://blog.scripted.com/writers/top-six-tips-to-meet-writing-deadlines-2/#comments</comments>
		<pubDate>Mon, 19 Nov 2012 23:07:26 +0000</pubDate>
		<dc:creator>Scripted Writers</dc:creator>
				<category><![CDATA[Writers]]></category>
		<category><![CDATA[freelance writing]]></category>
		<category><![CDATA[freelance writing tips]]></category>
		<category><![CDATA[meeting deadlines]]></category>

		<guid isPermaLink="false">http://blog.scripted.com/?p=1628</guid>
		<description><![CDATA[The cultivation of your freelance writing career can be fun and profitable. However, it is a tasking challenge that requires your ability to meet a client&#8217;s demands such as delivery of work on time. Meeting deadlines is an imperative part of writing and is vital for successful writing careers. For &#8230;]]></description>
				<content:encoded><![CDATA[<p>The cultivation of your freelance writing career can be fun and profitable. However, it is a tasking challenge that requires your ability to meet a client&#8217;s demands such as delivery of work on time. Meeting deadlines is an imperative part of writing and is vital for successful writing careers. For many writers, meeting deadlines is a growth process that involves improving writing skills and speed. It is thus essential that you follow these useful tips on how to meet work deadlines effectively:</p>
<ol>
<li><strong>Commit to Meeting Deadlines: </strong>Writing deadlines should be your priority, and you should always work towards delivering your work on time. The commitment to meet deadlines requires early preparation and starting to work on the project early enough. As a writer, get your fingers on the keyboard and start typing immediately after you understand the work content. The more time you waste before starting to work, the more workload piles up and may eventually seem like a monster you want to avoid.</li>
<li><strong>Show Discipline: </strong>Without discipline, focus on work and erasing all the intrusions that come with working from home, meeting deadlines is impossible. Practicing rigorous self-discipline helps you plan for your work and other things later. For instance, you can set specific hours for strict work whereby you avoid interruptions such as phone calls or eating snacks. This will improve your concentration levels and help you maximize your writing potential.</li>
<li><strong>Clear Deadline Communication and Outcomes: </strong>Being in agreement with the client on the set deadline is significantly useful for the writer. It can be done by confirming the correct deadline time zone and possible extensions. There are emergency cases when you will be expected to communicate with your client to request for a deadline extension, such as when power goes off or you have become ill.</li>
<li><strong>Break Down Large Projects: </strong>Some projects are large and require a break down. As a result, you have to break this project into subsections and allocate certain amounts of time for each. The breakdown of a project quickly facilitates working. As you piece up the subsections, you will realize that you have worked faster than working on the whole project unbroken.</li>
<li><strong>Research: </strong>Accurate and reliable works are credited to thorough research and writing skills. Though creative in your writing, it is required that you deliver your work on time and make readers feel confident with the work. Research offers many resource materials and facts, giving the writer confidence and accurate information. This significantly improves the writing speed.</li>
<li><strong>Be Realistic: </strong>Be as realistic as you can to meet writing deadlines. Avoid over commitment as it delays your ability to meet deadlines. Do not repeat mistakes such as late work submission or grammatical errors. Do not stay up till late while you need to work the next day. Make sure you get a good night sleep so that you are fresh and alert at all working times.</li>
</ol>
<p>Meeting deadlines in your work can be very rewarding. By continually meeting deadlines it&#8217;s much easier to acquire future work from past clients, and maximize your work time.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.scripted.com/writers/top-six-tips-to-meet-writing-deadlines-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
