12.9.14

Hadoop/Spark and RoR Summer Adventures

I worked at the Pittsburgh Supercomputing Center (PSC) over the summer during an internship sponsored by XSEDE. I also juggled helping my 67-272 (Application Design and Development) professor fix/polish an existing Ruby on Rails (RoR) project.

Whew! That's a lot of acronyms. Expect more coming.

PSC XSEDE: A dip into BigData


My goal for the internship was to utilize log data about a large webserver named the 'Archiver.' Like most log data, it wasn't being analyzed, just being collected.

This powerpoint I created for my presentation about this project @ the XSEDE'14 Conference summarizes things nicely.

But to go more into detail...I used Hadoop to store cleaned logs. Hadoop allows multiple machines/nodes to act as if it are a single machine, in a sense. I took advantage of the HDFS (Hadoop Distributed File System). Instead of writing a mapReduce function, I ran Spark on top of the Hadoop cluster. Spark allowed me to utilize data inside of the HDFS and conveniently create large in-memory data structures called RDDs (Resilient Distributed Datasets) which I could perform repeated tasks on. Spark also came with a machine learning library module that could be applied onto RDDs.
Data flow @ the PSC for my project
The challenges I bumped into are common for anything 'BigData.' One was getting access to data, and subsequently having to learn my way around the OpenTSDB API (Open Time Series Database). Another was cleaning said log data. I wrote a python script that would query the OpenTSDB, create a CSV, and then place it into the HDFS. I ran this script once a day via crontab. Then there was the problem of incorrectly setup collectors - that had to be amended as well. Of course, bugs were to be expected too.

Collection and cleaning took a substantial chunk of my time. When I had enough data in the HDFS, I moved onto using K-means clustering to examine the data. K-means is a commonly used machine learning algorithm which locates n centoids and matches your datapoints to 'closest' centoids. It allows you to find relationships between dimensions (i.e. filereads vs. CPU) and more! My project's data looked at 5 dimensions about the Archiver: filereads, filewrites, net IO, CPU, and disk IO.

The final step was visualizing the results. For this, I decided to use d3, a javascript library that can visualize documents on the web. The user could change what dimensions he/she wanted to view at a time. Here's what came out of early analysis (shown only on 2 dimensions):
White points are centoids. Colors determine 'area of influence.' If it looks like some centoids have more than others, it's cause other dimensions that aren't being displayed are influencing....
For the last few steps, I was working on making it possible to call the Spark K-means script from the web page and update what was viewed.

My experience at the PSC was overly positive. I got great working experience using Hadoop, Spark, python, d3, javascript, OpenTSDB, and machine learning algorithms. Definitely challenging but greatly rewarding - and I can tell what I've learning will be useful in the years to come.

>> Github repo to most of code

A poster made for XSEDE'14 and Duquesne Presentations. Take a look!


FamilyTyes: A RoR rollercoaster

This is a project still in motion. It started at the beginning of summer. A previous (and now separated) team worked to create a web application for the organization, FamilyTyes. This site was to record attendances, keep track of quiz data, and visualize said data. It was to help the organization prove that it had a STEM impact on those enrolled in its classes.

When I came, the site was already very made...in a sense. The backend was quite solid. The frontend, maybe not so much.

BEFORE:
The site as I saw it for the first time. Deployed too
There was a plan to give every student enrolled with FamilyTyes an ID card, and scan said cards to access the system. Cool, but maybe not quickly applicable. So instead, the professor and I decided to make the site mobile friendly (teens have smartphones, right?) and modify the way the whole quizzing process occurred.
That's not good.
Several issues. Current site was not mobile friendly, at all. Also, at some point the site was using Bootstrap, but then switched to foundation. Foundation was not installed as a gem, but rather being loaded in multiple times in many areas. TLDR; this project's views needed a lot of love.

I decided to make the site resemble FamilyTye's home site, otherwise no one would know that the two sites were closely associated. To make it mobile friendly, I created two views - one for mobile, one for web - through the tools foundation provided me. I also have some AJAX here and there to make taking role, etc. faster and more logical.

The new mobile homepage, nav in top left.
You can actually look at the site now @ http://ftdev.info/

Since this is a WIP and for FamilyTyes, I don't want to go any further. At least, until the site is officially released! We want to use the system with kids at the Baldwin High School starting in October. Let's hope that goes well! ; )


No comments:

Post a Comment