Tutorial 1: Turning Dream into Reality: Big Data Mining & Analytics by Developing Super Algorithms for Commodity Computers
- Konstantinos Xylogiannopoulos, University of Calgary, Canada
The past three decades witnessed rapid growth in computing technology leading to powerful personal computers capable of solving most traditional processing problems. Expansion in hardware and software platforms with minimized cost allowed creating tools useful for science and business, which could easily analyze significant amounts of data. Yet, the passage from analog to digital era following the introduction of thousands of new micro devices like smartphones, cameras, watches, building controllers, sensors, etc. and the interconnection of all these devices over internet created a new barrier for data science. Enormous amounts of data are producing daily from millions of devices and data science is facing a new challenge after researchers and practitioners realized the need to deal with and maximize the benefit from big data which continues to be a moving target. Whatever is considered huge a few years ago it is already bypassed by current technology. However, how big data is defined? What are the challenges we expect to face in the coming years in the field of big data and analytics? Can cloud computing help toward the analysis of big data? This tutorial will address these questions by discussing various aspects associated with big data, how they are defined and what limitations are faced by computer and data science. We will explore how it is possible to live with limited hardware resources by concentrating on new data structures that give new promising hopes for defeating hardware and software limitations currently faced when dealing with big data mining. We will show how innovative algorithms based on novel data structures allow the detection of every possible pattern in amazingly fast time using ordinary computer systems. We will see how these data structures and algorithms can be applied in many different scientific and commercial fields and provide solutions on some of the most important, yet, common problems in various domains, including mathematics, bioinformatics, network security, oil and gas, finance, traffic, energy, etc. Some case studies will be covered to show how one solution may be mapped to address problems in new domains.
Presenter:
Konstantinos Xylogiannopoulos is a PhD Candidate in Computer Science at University of Calgary, with background in Mathematics, IT and Finance. His research focuses on Big Data Mining and Analytics, with the design and development of innovative, advanced, data structures and algorithms.
More particularly, his research focuses on the detection of single, multiple and all repeated patterns existing in a sequence by optimizing space and time complexity simultaneously. For this purpose, new data structures have been created and the mathematical foundation that guarantees the correctness and validity of it has been built and proved. Innovative algorithms, which take advantage of the unique characteristics of the introduced data structure and allow big data mining with space and time optimization, have also been created. The combination of the innovative data structures and algorithms permit the analysis of any sequence of enormous size, greater than a trillion, in realistic time using conventional hardware configuration.
Methodologies developed during his research have found application in many diverse scientific and commercial fields like Mathematics, Bioinformatics, Finance, Business, Network Security, Weather and Seismic Data Analysis, Sequential Frequent Itemsets Detection, Social Network Analysis etc.
He extensively publish high quality papers in reputable venues including journals with high impact factor and A class conferences. He has been keynote speaker and lecturer at international workshops and schools. He closely works with the industry and has successfully adapted his techniques to solve their problems.
Tutorial 2: Analysis of Large-Scale Data Using Hadoop and Spark
- Emad A. Mohammed and Seyed Mohammad Pakdaman, University of Calgary, Canada
This tutorial is delivered in two parts:
Part One:
This presentation is for programmers or business people with fair background in R who would like to understand how to use R to process big data. In this presentation, you will be walked through the basic approaches to explore data using R packages built on top of an Hadoop platform. After this presentation, you will be able to identify the kinds of analysis you can perform on big data using Hadoop and R and how to interpret the results.
Presentation content:
- What is big data and what are the available tools to process?
- What is the MapReduce framework and the Hadoop Platform?
- How to setup an Hadoop cluster (natively)
- Hadoop third-party distribution
- Basic R commands
- Descriptive statistics example: Flight arrival delay statistics
- Predictive statistics example: Flight arrival delay prediction for a specific carrier using random forest
To follow with this presentation, a fair knowledge of basic R commands is required. However, during the presentation these commands will be discussed.
Part Two:
This talk will provide a jump start into Apache Spark that focuses on its internal architecture and gives an explanation of the runtime behavior of a Spark application. The content will be geared towards those already familiar with basics of python, who want to start using Apache Spark. Previous experience with Spark or distributed computing is NOT required.
This talk will cover the following topics:
- Big Data, In-memory Computing and Apache Spark
- Spark Runtime Architecture
- Spark APIs (R, Python and Scala)
- Tuning and debugging Spark applications
- Graph Processing with GraphX libraries
- Machine Learning with MLlib libraries
Presenters:
Emad A. Mohammed: Received BSc. degree in System and Biomedical Engineering from Cairo University, Egypt in 1999 and MSc. degree in Software Engineering from University of Calgary, Canada, in 2013. He is currently a Ph.D. candidate at the University of Calgary. The research fields of his interest are biomedical image analysis, data mining, data fusion, distributed software systems, and Big Data analytics. He is currently a Business Analyst at the Utilization office, Calgary Laboratory Services (CLS), University of Calgary. He published several papers on biomedical image analysis and big data analytics. In recognition of his work, he received many awards and scholarships including the Alberta Innovates Technology Futures (AITF), MITACS Accelerate, and the Advanced Energy Analytics Competition sponsored by the University of Calgary and IBM-Canada.
Seyed Mohammad Pakdaman: Received his B.Sc. degree in Electrical Engineering from Azad University (IAUCTB), Iran in 2011 and his M.Sc. degree in Telecommunications Engineering from University of Buenos Aires, Argentina, in 2014. He is currently a PhD student at University of Calgary. His research interests include cloud computing, distributed systems, and networked computer systems. Mohammad has also more than 5 years of experience working as the head of IT of the second biggest automobile industry in Iran.