The focus of this course is on exposing students to processing data on a large scale using a distributed platform. First, students will learn the functional approach to processing large data sets. Along the way, we will encounter many of the techniques that are employed in large distributed data-processing systems, such as using common higher-order functions, employing lazy evaluation, and relying on immutable data structures. By the end of the course students will be familiar with processing large amounts of data in one or more high-level languages (e.g. Python and/or Scala) and working with a number of frameworks for distributed computation (e.g. Hadoop/MapReduce/Spark).
Big Data Technologies: Students learn about technologies and platforms designed to handle large volumes of data efficiently. This may include distributed computing frameworks such as Apache Hadoop and Apache Spark, as well as cloud-based solutions like Amazon Web Services (AWS) Elastic MapReduce (EMR), Google Cloud Dataproc, and Microsoft Azure HDInsight. They learn how to set up and configure these platforms and use them to process, analyze, and store big data.
Data Streaming and Real-Time Analytics: Students learn about techniques for processing and analyzing streaming data in real-time. This may involve technologies such as Apache Kafka for data ingestion and messaging, Apache Flink for stream processing, and Apache Storm for real-time analytics. They learn how to build data pipelines that can handle continuous streams of data and perform real-time analytics and monitoring.
Distributed Computing and Parallelism: Students learn about distributed computing principles and techniques for parallel processing of large datasets. They study concepts such as MapReduce, parallel computing models, and distributed file systems. They also learn how to design and implement parallel algorithms that can take advantage of distributed computing resources to process data in parallel.
Scalable Machine Learning: Students learn about techniques for scaling machine learning algorithms to handle large datasets efficiently. This may involve distributed machine learning frameworks such as Apache Mahout, MLlib (part of Apache Spark), and TensorFlow Extended (TFX). They learn how to train and deploy machine learning models at scale and optimize their performance for large-scale data processing.
Data Visualization and Exploration: Students learn about techniques for visualizing and exploring large-scale datasets. They study tools and libraries for interactive data visualization, such as Plotly, Bokeh, and D3.js. They also learn about exploratory data analysis (EDA) techniques for summarizing, visualizing, and understanding large datasets to extract meaningful insights and patterns.
This course will give students an overview of the issues related to the management of structured data. Topics to be covered in this course include: data warehousing, data integrity and quality, data cleansing, basic programming concepts, the construction of simple algorithms, and the appropriate descriptive and graphical summaries of data. Commonly used software packages for the analysis and management of data will be emphasized.
Relational Database Management Systems (RDBMS): Students learn about relational databases and the principles of database management systems (DBMS). They study concepts such as tables, rows, columns, keys, relationships, and constraints. They also learn about popular RDBMS platforms such as MySQL, PostgreSQL, SQL Server, and Oracle, and how to perform basic database operations such as creating, querying, updating, and deleting data.
SQL (Structured Query Language): SQL is the standard language used to interact with relational databases. Students learn how to write SQL queries to retrieve, manipulate, and analyze data stored in a database. They study SQL syntax, data manipulation language (DML) statements (e.g., SELECT, INSERT, UPDATE, DELETE), data definition language (DDL) statements (e.g., CREATE TABLE, ALTER TABLE), and data control language (DCL) statements (e.g., GRANT, REVOKE).
Database Design and Normalization: Students learn about principles of database design and normalization to ensure data integrity and efficiency. They study concepts such as normalization forms (e.g., first normal form, second normal form, third normal form), functional dependencies, and entity-relationship (ER) modeling. They also learn how to design and implement relational database schemas that minimize redundancy and improve data consistency and integrity.
Indexing and Optimization: Students learn about techniques for optimizing database performance, such as indexing, query optimization, and database tuning. They study different types of indexes (e.g., B-tree, hash, bitmap) and their effects on query performance. They also learn about query execution plans, database statistics, and techniques for improving the efficiency of database queries and operations.
Data Warehousing and Business Intelligence: Students learn about data warehousing concepts and architectures for storing and analyzing large volumes of structured data. They study techniques for data extraction, transformation, and loading (ETL), as well as tools and platforms for building data warehouses and implementing business intelligence solutions. They also learn about data modeling techniques for multidimensional analysis (e.g., star schema, snowflake schema) and online analytical processing (OLAP).
This course will focus on methods, procedures, and application tools used to summarize and visualize data. Students will design and create summaries and visualizations to transform data into information in a variety of contexts. Students will complete a visualization project.
Tableau Interface: Students become familiar with the Tableau interface, including the various tools, menus, and panes available for building visualizations. They learn how to navigate the workspace, connect to data sources, and organize worksheets and dashboards.
Data Connection and Preparation: Students learn how to connect Tableau to different data sources, such as Excel spreadsheets, databases, and online data repositories. They also learn how to prepare data for visualization within Tableau, including cleaning, filtering, and aggregating data as needed.
Basic Chart Types: Students learn how to create a variety of basic chart types in Tableau, such as bar charts, line charts, scatter plots, pie charts, and histograms. They learn how to customize these charts by adjusting colors, labels, axes, and other visual properties to effectively represent their data.
Advanced Visualization Techniques: Students explore advanced visualization techniques in Tableau, such as dual-axis charts, combined charts, geographic maps, treemaps, heatmaps, and small multiples. They learn how to use parameters, filters, sets, and calculated fields to create dynamic and interactive visualizations that allow users to explore data in depth.
Dashboard Design: Students learn how to design interactive dashboards in Tableau by combining multiple visualizations into a single layout. They learn principles of dashboard layout, design hierarchy, and interactivity to create dashboards that are intuitive, informative, and engaging for end users.
An introduction to the major concepts of algorithm design and problem solving. Emphasis is on algorithm development, analysis, and refinement. Programming strategies and elements of programming also are covered. Various practical applications of problem-solving are demonstrated. Includes formal labs.
Algorithm Analysis: Students learn how to analyze the efficiency and performance of algorithms in terms of time complexity (Big O notation) and space complexity. They study techniques for measuring and comparing the efficiency of algorithms, such as asymptotic analysis, worst-case analysis, average-case analysis, and amortized analysis.
Data Structures: Data structures are fundamental building blocks for organizing and manipulating data in computer programs. Students learn about various data structures, including arrays, linked lists, stacks, queues, trees (e.g., binary trees, binary search trees, AVL trees), heaps, hash tables, and graphs. They study their properties, operations, implementation, and applications.
Sorting and Searching Algorithms: Sorting and searching algorithms are fundamental algorithms used in many applications. Students learn about popular sorting algorithms such as bubble sort, selection sort, insertion sort, merge sort, quicksort, and heap sort. They also study searching algorithms such as linear search, binary search, and hash-based searching techniques.
Graph Algorithms: Graphs are versatile data structures used to model relationships between objects. Students learn about graph representations (e.g., adjacency matrix, adjacency list), graph traversal algorithms (e.g., depth-first search, breadth-first search), shortest path algorithms (e.g., Dijkstra's algorithm, Bellman-Ford algorithm), and minimum spanning tree algorithms (e.g., Kruskal's algorithm, Prim's algorithm).
Dynamic Programming: Dynamic programming is a technique for solving optimization problems by breaking them down into smaller subproblems and solving each subproblem only once, storing the solutions to subproblems to avoid redundant computations. Students learn about the principles of dynamic programming, memoization, and bottom-up and top-down approaches. They also study examples of dynamic programming problems, such as the knapsack problem, longest common subsequence problem, and Fibonacci sequence computation.
An introduction to methods and techniques commonly used in data science. The management, preparation, analysis, visualization, and modeling of data will be discussed in this class. Students will complete a data science project.
Data Acquisition and Cleaning: Data science begins with obtaining data from various sources, which may include databases, APIs, web scraping, or manual collection. Students learn techniques for collecting and accessing data, as well as methods for cleaning and preprocessing data to ensure its quality and suitability for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and visualizing datasets to understand their structure, patterns, and relationships. Students learn descriptive statistics, data visualization techniques (e.g., histograms, scatter plots, box plots), and exploratory methods (e.g., clustering, dimensionality reduction) to gain insights into the data before performing more advanced analyses.
Statistical Modeling and Machine Learning: Students are introduced to statistical modeling and machine learning techniques for predictive modeling, classification, regression, clustering, and other tasks. They learn about popular algorithms such as linear regression, logistic regression, decision trees, random forests, support vector machines, and k-nearest neighbors, as well as techniques for model evaluation and validation.
Data Visualization and Communication: Effective communication of results is essential in data science. Students learn how to create informative and visually appealing data visualizations using tools such as matplotlib, seaborn, ggplot2, and D3.js. They also learn best practices for presenting and interpreting data visualizations to communicate insights and findings to stakeholders.
Ethical and Legal Considerations: Data science raises important ethical and legal issues related to privacy, security, bias, fairness, and accountability. Students learn about ethical guidelines, regulations (e.g., GDPR), and best practices for responsible data handling and analysis. They also discuss case studies and real-world examples to understand the implications of ethical and legal decisions in data science projects.