Whew! That's a lot of acronyms. Expect more coming.
My goal for the internship was to utilize log data about a large webserver named the 'Archiver.' Like most log data, it wasn't being analyzed, just being collected.
This powerpoint I created for my presentation about this project @ the XSEDE'14 Conference summarizes things nicely.
But to go more into detail...I used Hadoop to store cleaned logs. Hadoop allows multiple machines/nodes to act as if it are a single machine, in a sense. I took advantage of the HDFS (Hadoop Distributed File System). Instead of writing a mapReduce function, I ran Spark on top of the Hadoop cluster. Spark allowed me to utilize data inside of the HDFS and conveniently create large in-memory data structures called RDDs (Resilient Distributed Datasets) which I could perform repeated tasks on. Spark also came with a machine learning library module that could be applied onto RDDs.
Data flow @ the PSC for my project |
Collection and cleaning took a substantial chunk of my time. When I had enough data in the HDFS, I moved onto using K-means clustering to examine the data. K-means is a commonly used machine learning algorithm which locates n centoids and matches your datapoints to 'closest' centoids. It allows you to find relationships between dimensions (i.e. filereads vs. CPU) and more! My project's data looked at 5 dimensions about the Archiver: filereads, filewrites, net IO, CPU, and disk IO.
The final step was visualizing the results. For this, I decided to use d3, a javascript library that can visualize documents on the web. The user could change what dimensions he/she wanted to view at a time. Here's what came out of early analysis (shown only on 2 dimensions):
White points are centoids. Colors determine 'area of influence.' If it looks like some centoids have more than others, it's cause other dimensions that aren't being displayed are influencing.... |
My experience at the PSC was overly positive. I got great working experience using Hadoop, Spark, python, d3, javascript, OpenTSDB, and machine learning algorithms. Definitely challenging but greatly rewarding - and I can tell what I've learning will be useful in the years to come.
>> Github repo to most of code
A poster made for XSEDE'14 and Duquesne Presentations. Take a look! |
FamilyTyes: A RoR rollercoaster
This is a project still in motion. It started at the beginning of summer. A previous (and now separated) team worked to create a web application for the organization, FamilyTyes. This site was to record attendances, keep track of quiz data, and visualize said data. It was to help the organization prove that it had a STEM impact on those enrolled in its classes.
When I came, the site was already very made...in a sense. The backend was quite solid. The frontend, maybe not so much.
BEFORE:
The site as I saw it for the first time. Deployed too |
That's not good. |
I decided to make the site resemble FamilyTye's home site, otherwise no one would know that the two sites were closely associated. To make it mobile friendly, I created two views - one for mobile, one for web - through the tools foundation provided me. I also have some AJAX here and there to make taking role, etc. faster and more logical.
The new mobile homepage, nav in top left. |
Since this is a WIP and for FamilyTyes, I don't want to go any further. At least, until the site is officially released! We want to use the system with kids at the Baldwin High School starting in October. Let's hope that goes well! ; )