This repository servers as a public homepage for MSBA 6330 - Big Data Analytics at Carlson School, University of Minnesota. It also hosts the syllabus and FAQs.
-
Install Spark Updates on Cloudera VM: if you run into issues, here are suggestions for diagnosis
- How to edit Linux text/script files
- How to convert Windows line endings to Linux ones
- How to debug Hadoop streaming programs outside of Hadoop
- How to Debug MapReduce Jobs
- Install Apache Spark on your own computer: Install Apache Spark Virtual Machine with many featrues on your own computer (but no hadoop or Hive). This may take a while (e.g. 1 hour) to have everything ready.
- Use DataBricks Community Edition for Spark: Databricks provides a single node spark cluster for free. It is quite easy to start it with a Jupyter note environment.
- Mount s3 folder in Databricks
- Common Issues with Running PySpark: Addresses a few common issues with running PySpark on Cloudera VM.