# Top 30 Data Science Interview Questions

**What is Data Science? **

Data science is a blend of 3 things that are statistics, technical skills and business vision. It is used to analyze the available data and predict the future in a better way.

**Compare Data Science Vs. Machine Learning **

Data science is a blend of 3 things that are statistics, technical skills and business vision. It is used to analyze the available data and predict the future in a better way.

**Which language is more suitable for text analytics? R or Python? **

Python will be more suitable for text analysis, as it consists of a rich library called Pandas that allows the analysts to use high-level data analysis tool along with data structures. This feature lacks with R language.

**What do you understand by the term Normal Distribution? **

Data is generally distributed in different ways, with a bias of either right or left or it can be jumbled up. There are chances that Data can be distributed around a central value without any bias and further reaches normal distribution in the form of a bell-shaped curve.

**What is root cause analysis? **

Initially Root cause analysis was developed to analyze industrial accidents, but now it is used in other areas as well. As the name sounds, it is a problem-solving technique used for removing the root cause of all faults or problems. After deducting the problem from the problem-fault sequence it averts the final undesirable event from reoccurring.

**What is logistic regression? **

It is also referred to as logit model. In this technique to forecast the binary outcome from a linear combination of predictor variables.

**What is selection bias, and how can you avoid it? **

It is an experimental error when the participant pool or any other subsequent data is not representative of the target population. It cannot be overcome with statistical analysis of existing data alone, though Heckman correction can be used in special cases.

**What are Recommender Systems? **

It is a subclass of information filtering systems that are mean to predict the future choices or ratings that a user gives to a product.

**What is Collaborative Filtering?**

This process is used by most of the recommender systems to find patterns and information by collaborating perspectives and several other agents.

**What are the drawbacks of the linear model? **

It can’t be used for count or binary outcomes. There are few overfitting problems that it can’t solve.

**Explain the various benefits of R language? **

Operators use this language for performing calculations on matrix and array. It is a highly developed yet simple and effective programming language. It extensively supports machine learning applications. It also acts as a connecting link between various software, tools and datasheets. Further, it is useful when you have to solve a data-oriented problem.

**What are the two main components of the Hadoop Framework?**

HDFS and YARN are basically the two major components of Hadoop framework.

**What is the difference between data science and big data? **

Big Data refers to the large amount of data which cannot be analyzed by traditional methods whereas Data Science is a field applicable to any data sizes.

**Which technique is used to predict categorical responses?**

The classification technique is used for categorical responses.

**What is the goal of A/B Testing? **

This is statistical hypothesis testing for random experiments with two variables A and B. The objective of this testing is to detect changes to a web page to maximize or increase the outcome of a strategy.

**What are feature vectors?**

It is an n-dimensional vector of numerical features that represent some object.

**Name some of the prominent resampling methods in data science**

The Bootstrap, Permutation Tests, Cross-validation and Jackknife are some of the methods.

**What is a Gaussian distribution and how it is used in data science? **

This is known as bell curve a common probability distribution curve.

**What is an Eigenvalue and Eigenvector?**

Eigenvectors are used for understanding linear transformations and on the other hand Eigenvalue can be referred to as the strength of the transformation in the direction of the factor by which compression occurs.

**What is association analysis? Where is it used? **

It is a task of uncovering relationships among data. It is basically used to understand how the data items are associated with each other.

**How do you check for data quality? **

Some definitions to check Data Quality are: Completeness, Consistency, Uniqueness, Integrity, Conformity and Accuracy.

**What is power analysis? **

It is size required to detect an effect of a given size with a given degree of confidence.

**Can you use machine learning for time series analysis?**

Yes, it can be used but it depends on the type of applications.

**Explain what resampling methods are **

They are used to estimate the precision of the sample statistics, exchanging label on data points and validating models.

**What is an RDBMS? Name some examples for RDBMS?**

It is based on relational model. Some examples of RDBMS are MS SQL server, IBM DB2, Oracle, MySQL and Microsoft Access.

**What is the difference between squared error and absolute error? **

They are used to estimate the precision of the sample statistics, exchanging label on data points and validating models.

**What is an RDBMS? Name some examples for RDBMS? **

Squared error measures the average of the squares of the errors. Absolute error is different from the above and is measured or inferred value of a quality and its actual value.

**Why do data scientists use combinatorics or discrete probability?**

It is basically done because it is useful in studying any predictive model.

**It is basically done because it is useful in studying any predictive model. **

API stands for Application program Interface and is a set of routines, tools and protocols. With API one can develop software applications.

**Differentiate between wide and long data formats? **

In wide data format categorial data is always grouped and in long data format there are number of instances with many variables.

**Is it possible to perform logistic regression with Microsoft Excel?**

Yes, surely it is possible