Quick hack for dealing with EC2

Recently, I needed to run a load-test and wanted to run it on a set of 100 nodes in EC2. I wanted a really quick way of sending commands to all 100 nodes without spending more than 30 minutes on setup. Sure, I could have used python & boto to write some scripts that help me out but as I mentioned, it was a one-off thing and I wanted to get done quickly. Since my boss is a big fan of iTerm, I gave it a shot, it supports sending keyboard input to all open tabs. However, I didn't want to copy-paste 100 instance URLs, so I learned a tiny bit of AppleScript in order to help me out. The resulting script asks the user for an input file that contains the public DNS of each instance, one per line. You can launch it from your terminal with 'osascript launch_instances' - the code is here: 

Keyword extraction using Lexical Chains

Today I'm posting about keyword extraction with Lexical Chains - it's something that I first looked into during college but which has resurfaced recently and I used it for a couple of projects. The original paper I read is a couple of years old, it's called "Efficient text summarization using lexical chains" and was written by Gregory Silber and Kathleen McCoy in 2000. A link to the text can be found here.

You may wonder why you would want to generate keywords of text in first place. Some sample use cases are listed below: 
* generate tag-clouds (for richer UI experience)
* relate content to keyword-based advertising
* identify key-search terms for indexing

Traditionally, people use Term Frequence / Inverse Document Frequency (TF/IDF) for this task, which works fairly well for many applications. TF/IDF gives a weight for each term within a given document based on the frequency of the term relative to the document and to the number of documents globally that contain the word (e.g. how often does the term "dog" appear in an article X and how many articles contain the word dog). However, this isn't really the best solution (arguably, LCs aren't the best solution either), my reasoning for that is, often you'll have noisy data - different documents might have a different vocabulary (e.g. people using slang) - so you might end up with really useless data. Also, if you have very expressive authors of the textual content, they'll rarely use the same word twice but will often use a myriad of synonyms. That isn't really picked up by TF/IDF and this is really the strength of Lexical Chains. 

Lexical chains are chains of words that share some relationship and co-occur, mostly limited to a document or paragraph. Let's say for example you have the following text: 

"Today we drove our new car, a convertible to the lake. We just added another vehicle to the collection at our vacation home - we bought a new boat!". (random rich person)

The nouns of this text fragment are "car", "convertible", "lake", "vehicle", "collection", "vacation_home", "boat". My mental parser has identified that this text is about vehicles, but how can this be extracted algorithmically? The answer is with a taxonomy. Wordnet is one such example, it organizes words or concepts according to their relationship with one another. The included graphic shows what the 'type-of' relationship in wordnet looks like. You can see that Vehicle has two children, "car" & "boat" (siblings) and "car" has a child (hyponym) "convertible". This is the basis for Lexical Chains. 

Taxonomy

Lexical Chains are usually constructed of nouns and possibly of verbs - though let's focus on nouns. This means, you'll need to use some sort of Part-of-Speech tagger to determine the likelihood of a given word being a noun, which is computationally expensive in comparison with TF/IDF. The part of speech tagger will assign each word a tag, (often Penn Treebank tags are used, http://www.cis.upenn.edu/~treebank/), we'll disregard everything that's not a noun. 

Here is the algorithm in pseudo-code: 

Allocate a hashtable HT
For each noun n in document D get all P (Synonyms, Hypernyms, Hyponyms, Sibling) of n from Wornet
found = false
  For each p in P 
   If HT contains LC(p)
   append n to LC(p) and increment LC(p).score by (1 if rel(n,p) == synonym, 0.7 if rel(n,p) hypernym or hyponym, 0.5 if rel(n,p) sibling) 
   found = true
  If ! found
   create new LC(n) and set LC(n).score = 1
For HT compute MEAN and STDEV based on each LC(x).score
return all LC in HT where LC.score > (MEAN + X * STDEV)

As you can see, once you have setup a Taxonomy, e.g. Wordnet and the preprocessing setup, this algorithm is very simple to implement. I've done it with Wordnet but i've done experiments using Wikipedia as a Taxonomy and using a simplified scoring that was just based on a radial search (since you don't have the nature of the relationship in Wikipedia). 

In a previous startup I worked in (HeadCase), we used a similar approach to generate tag-clouds. In my opinion, it works best in conversational text such as IM, comment-chains on websites, etc.

Cross-site scripting with hidden iframes and jQuery

Recently, I've been doing a little bit of JavaScript and have enjoyed playing with jQuery. I don't write JS very often, so maybe what I'm describing in this blog entry is common knowledge to all of you (i beg your pardon...), but it took me the better part of an afternoon, so I figured I'll share my findings... 

For one application, I had to get around the cross-site scripting restrictions of Mozilla and co. I essentially wanted to append my own elements and overlays to arbitrary html pages, and then allow to send data to one of my servers. While JsonP allows to url-encode parameters and thus would allow me to fake a HTTP POST, I really didn't want to deal with additional transformers within my REST endpoint - I just wanted to continue sending straight JSON data. 

Sure, I could put up a html form and just allow to POST the data... but then again, the browser didn't play nice and I couldn't overwrite the encoding of the form to "application/json". Thus, I went for a hidden iframe that was loaded from my server but I needed to inject data into the iframe - commonly that's done via url-encoding parameters into the src of the iframe. I didn't want to do this (the URL has browser specific length restrictions besides it's ugly, well i guess my approach is ugly too :) ) and saw that I could just 'abuse' the "name"-tag of the iframe and serialize my data into it. And the tags of an iframe that document-x (form server x) creates which is filled with content from document-y (from server y) can be accessed from either document. First attempt, I thought that I needed to base64 encode it so it would work - but it seems as if the browser does most of the work for me - and I can just set the name to my javascript-to-json translated (thanks jQuery!) object. 
As soon as the iframe loads, it submits the data it reads out of the name-tag via the jQuery .ajax(...) method via POST. By setting the content type to application/json, I can lazily rely on Jackson to map my JSON object into the corresponding Java Data Object - thanks to Guice, Jackson, Jetty and Jersey - this is really not much work.

The attached picture outlines the described method... 

Iframe

Cheers

ESTA.us - a scam site

Recently, someone I know was fooled into believing http://esta.us is a site affiliated with the government but it is a scam site trying to trick people into paying money for "information" about the new ESTA system, which is the registration required for foreign nationals who can enter the US under the visa waiver program. ESTA looks like an official page but is just another web-scam. 

Achtung: esta.us ist eine Falle, die Seite hat nichts mit der US Regierung zu tun!

Stocks

Looking at my portfolio today, I think I've done pretty well sticking with Ben Graham's strategy of value investing. Given, I started building out my portfolio in March 2008 and investing 60% of my liquid assets prior to October/November 2008, the results look pretty good. Overall, I'm up from March 2008. While I usually stick to long term investment, I unloaded YHOO at 14.5$ and a couple of other companies that I felt wouldn't be worth investing in given either bad business decisions or warning signals in their fundamentals such has a high debt-to-equity ratio - which for me is a no-go.

I am looking at a couple of companies that make up the "risky" part of my portfolio - these include the golf club shaft manufacturer "ALDA", a communications firm "ATGN" and the mobile service provider "LVWR". The financials of these companies might not be super impressive but given liquid asset / market cap ratio and the growth opportunities in their respective sectors these companies seemed like a bargain.

I'm still firmly believing in two companies - pharma-empire "MRK", which compared to it's sector seems undervalued and "MKL" an insurance company that has very savvy management and seems to be fairly cheap as well.

Let's talk about the "bad" investments I've made - I had bet strongly on BRK-B - will continue to do so. I think Buffet got a good deal with GS and this will be reflected when he sells the warrants. There I'm down quite a bit. Also with BAC & GE I'm down. UAHC (health care) is down - though I believe they know how to deal with the government (Medicare partner) so should Obama's plan come through (healthcare reform) - they'll have a huge advantage...


I'm up with SAP (software) and FMS. FMS is a company that "owns" the dialysis market - it seems to be still very cheap. That's my biggest bet - I figure that people will always have health problems associated with the wrong diet - there seems to be a correlation between obesity (which as we know is steadily increasing) and kidney failures. This is backed by a lot of research (#1, #2). What gives me hope for FMS is that people who undergo dialysis want to trust the equipment and facilities - it's not like you're eating at a new restaurant and worst case you'll get some old french fries... A friend who is on dialysis has also confirmed that FMS is a really great company with awesome customer service and highly skilled staff. FMS often owns the entire "supply-chain", from manufacturing to "sales" and "maintenance" - they manufacture the equipment such as dialysis machines, they operate the dialysis centers and train the staff - people feel safe and return. 

I think the current lift in the stock market will soon end in another crash - but there is no such thing as the "right" time to buy 

--

DISCLAIMER: I am NOT giving any financial advise!

References:
 #1 American Society Of Nephrology. "Obesity Triples The Risk Of
Chronic Kidney Failure." ScienceDaily 13 May 2006. 22 September 2009
<http://www.sciencedaily.com /releases/2006/05/060513122553.htm>

#2 Journal of the American Society of Nephrology, "Obesity: What Does
It Have to Do with Kidney Disease?", 2004, 22 September 2009
<http://jasn.asnjournals.org/cgi/content/full/15/11/2768>

R quick start

I have recently gotten more into R - I'm loving it.

R is really cool - apart from being an awesome free tool for all sorts of calculations - it allows rapid analysis / visualization of small to mid sized data sets (1 to sizeof(data) < mem , which is roughly in the millions, depending on the type of data). It takes a bit to get used to the mathy way of mapping names to vectors - but it's really powerful!

While R can do almost anything (there are so many libraries available!), I do my preprocessing via the unix command line on my MBP. "tr", "awk", "cut" and "sed" are invaluable. 

Let's assume we're hosting a web-app primarily used by mobile users and magically, we know the connection speeds at which content is delivered to our customers. It's weird, because some of our customers complain about page loading and others don't. We don't have much data but we know the average download speed and the maximum download speed of our clients. Also we manually added the zip-code for each client. 

user1,765.3,1498.2,66333
user2,882.9,1200.0,66342
user3,901.2,980.8,77878
user4,587.2,640,77879
user5,1327.5,1924.4,77878
user6,45.2,55.3,23923
user7,22.2,58.3,99993
user8,29.3,44.9,92399
user9,13.3,19.4,23923
user10,12.4,45.3,99992
user11,12.2,23.2,99994
user13,11.4,22.9,99992
user14,66.1,69.9,99972

If you copy this file to your disk (e.g. to "/tmp/conn_speeds.csv"), you can do the following to load the data into R. 

>x <- read.table(file='/Users/florian/BLOG/conn_speed',sep=',')

Then try to run some basic statistical analysis over the data:

>summary(x)
       V1          V2               V3               V4        
 user1  :1   Min.   :  11.4   Min.   :  19.4   Min.   : 23923  
 user10 :1   1st Qu.:  13.3   1st Qu.:  44.9   1st Qu.: 66342  
 user11 :1   Median :  45.2   Median :  58.3   Median : 77879  
 user13 :1   Mean   : 359.7   Mean   : 506.4   Mean   : 77423  
 user14 :1   3rd Qu.: 765.3   3rd Qu.: 980.8   3rd Qu.: 99992  
 user2  :1   Max.   :1327.5   Max.   :1924.4   Max.   : 99994  
 (Other):7

et voilà, there is some basic analysis.

Maybe we come up with the hypothesis that certain areas don't have 3G access yet, and hence there should be clusters around each zipcode with similar speeds. To verify this, we'll run kmeans over the average connection speed and then plot the zip & avg_speed where the color is determined by which cluster each data point lies in.

 

>zip <- x$V4
>avg_speed <- x$V2
>km <- kmeans(avg_speed,2)
>plot(zip,avg_speed,col=km$cluster,lwd=4)

 

 

This is just one example of how useful r is!