After undertaking the relevant big data online courses, earning credentials, and gaining some experience through internships or voluntary work, it’s time to step into the job market. Perhaps, you may have upskilled or advanced your studies, and it’s time to take the next big leap to another position or another employer with better prospects. Being shortlisted is only a small step towards getting that dream job.
Even though big data is highly on demand right now and into the future, you are not guaranteed a position unless your skills and experience match the requirements of your prospective employer. It is not just about answering questions in an interview. It is demonstrating that given the opportunity, you will bring in value.
How do you prepare yourself for the big day?
Top 15 Big Data Interview Questions
Here are 15 questions you will not miss in an interview for a big data-related position.
1. Explain the different types of big data
There are three types of big data. These are:
Structured data – This type of data conforms to a predefined format usually tabular such that there exists a relationship between two columns. Structured data can be easily stored, processed, or retrieved.
Unstructured data – This type of data comes in a range of formats for instance videos, audios, texts, emails, social media posts, images, etc. This way, structured data does not conform to any specific schema.
Semi-structured data – Semi-Structured data does not conform to the rigid tabular schema of structured data but still has some structure. It contains elements that define aspects of data in fields and layers of records.
2. What are the five ‘v’s of data
The five V’s of big data are
Volume. This refers to the amount of data.
Velocity. The speed with which data is being generated and analyzed
Variety. Different forms of data including texts, videos, images, audio, etc.
Value. The worth of data with regard to the usefulness of insight extracted out of it.
Veracity. The degree of accuracy of data.
3. How do you carry out data preparation?
Data preparation is the process of cleaning, transforming, and storing raw data to be used to create models used in data analysis. Different types of models can be formed out of prepared data including physical, conceptual, logical data model, star schema, enterprise, and object-oriented database model.
4. What is feature selection?
Feature selection is the process whereby only the required features are extracted from a particular dataset. It is expected that not all data collected will be useful for a particular purpose as different business needs require different insights.
Feature selection makes it possible to create simple but targeted machine learning models for accurate prediction and interpretation of data. There are three main methods of feature selection including:
Filters method – Features are selected based on their correlation with the dependent variable.
Wrappers method – Feature selection is based on a specific machine learning algorithm. This algorithm is applied to a dataset where it
produces classifiers used for feature selection by actually training the model on it.
Embedded method – As the name suggests, the feature selection is integrated (embedded) into the learning algorithm so that feature selection is done during model training.
5. Which platforms are available for big data?
Big data platforms fall under two categories; open-source platforms and the license-based platforms.
Open source big data platforms include Hadoop and HPCC (High-Performance Computing Cluster)
Popular licensed big data platforms include Cloudera (CDH) and MapR (MDP).
Big data platforms can also be categorized based on functions for instance Cassandra and MongoDB for storage, DataCleaner for data cleaning, IBM SPSS and Teradata for data mining, and Tableau and SAS for visualization.
6. What is cluster sampling?
This is a sampling technique that divides a population into subgroups known as clusters. From these, a few clusters are randomly selected to be used for analysis. In the single-stage clustering technique, all members of the clusters are used for the study. In the two-stage cluster analysis, few individuals from each cluster are selected to be used in the study. In multi-stage sampling, you repeat the random sampling process of elements within clusters until you get your required sample size.
7. Explain 4 types of biases that can happen during sampling.
Sampling bias occurs when there is a systematic high likelihood of some members in a sample to be selected than others. The types of biases that can happen during sampling are:
Non-response bias in which subjects are unwilling or unable to be part of the study.
Survivorship bias in which there is a tendency to concentrate more on successful observations, objects, or people than the unsuccessful ones mostly due to lack of visibility.
Undercover bias in which some members in the sample are inadequately represented.
Healthy user bias in which the study population is likely to be healthier than the general population.
8. What is the correlation between Big Data and Hadoop?
While big data represents the large volume of a variety of structured and unstructured data generated at a high velocity, Hadoop is an open-source framework used to process, store, and analyze big data to extract actionable insights.
9. Why is Hadoop used in Big Data analytics?
Hadoop is an open-source framework written in Java with cluster commodity hardware for processing large volumes of data. Hadoop comes with data collection, storage, and data analysis functions. It features a distributed computing model for fast data processing.
10. What are the key components of the Hadoop framework?
The key components of the Hadoop framework are
HDFS refers to Hadoop File Distributed System designed to run on commodity hardware. It is designed to be fault-tolerant as it runs on low-cost commodity hardware and delivers high throughput for application data. This makes it suitable for large data sets.
MapReduce – Programming-based data processing that consists of the map phase and the reduce phase. Mapreduce allows you to scale unstructured data across multiple nodes in the Hadoop cluster for processing.
YARN (Yet Another Resource Negotiator) function was introduced in Hadoop V2.0 to perform resource management and task scheduling functions.
Hadoop Common is the collection of Hadoop’s libraries and utilities that support its modules.
11. What are the key differences between NAS (Network-attached storage) and HDFS (Hadoop distributed file system)
NAS is a Java-based file-level storage system that is connected to a network of computers such that data is stored on a dedicated server. Due to the number of machines involved, NAS is the more expensive option. NAS runs on individual machines hence does not experience data redundancy.
HDFS is the primary storage system for Hadoop that runs on a cluster of commodity hardware. Data blocks are distributed across multiple local drives of machines in the cluster. HDFS is a cost-effective system as it runs on cheaper commodity hardware. HDFS, because it uses replication to enhance fault tolerance, will create data redundancy.
12. Name various daemons in Hadoop and YARN
Hadoop daemons
- Namenode is the master node that contains metadata information for all the files in the HDFS.
- Datanode stores data in a Hadoop cluster and is also the name of the daemon that manages the data.
- Secondary Namenode is a dedicated node in the HDFS cluster that takes checkpoints of the file system metadata on the name node.
YARN daemons
- ResourceManager manages resources for applications run on a Hadoop cluster.
- NodeManager is the node agent that manages memory and disk resources in a node.
- JobHistoryServer executes all job history-related requests in YARN.
13. What is ‘jps’ command and what is its function in Hadoop?
This is a command that is initiated to perform a health check on Hadoop daemons like the namenodes, masternodes, resource manager, and nodemanager to ensure that they are working on the machine.
14. List common input formats in Hadoop
Three common input formats in Hadoop are:
- Key-value input format which is the default input format
- Sequence file input format used for reading files in a sequence
- Text input format used for plain text files
15. Name the modes in which Hadoop can run
Hadoop can run in three nodes which are:
Standalone mode. This is Hadoop’s default mode that runs on the local file system (single node) for both input and output operations. The standalone mode does not support HDFS and so is mainly used for debugging.
Pseudo-Distributed mode. This mode is also known as the single-node cluster. The daemons run on a single node which includes both the NameNode and the DataNode.
Fully distributed mode. This mode is also known as the Multi-node cluster. In this mode, daemons run on several individual nodes to form a multiple-node cluster. The Master nodes and the Slave nodes run on separate nodes.
Conclusion
The demand for big data skills will continue to be on the rise. However, competition remains stiff in the job market therefore preparing well for your interview is essential.