How to Prepare for SQL Interviews as a Data Scientist: Top 25+ Company Questions
Liam Patel – Principal Data Scientist Mentor | Former Netflix Data Talent Manager
Jun 5, 2025
Master key SQL concepts—from aggregations and joins to window functions—and practice 25+ real interview questions to excel in data science SQL interviews for 2025 roles.
Structured Query Language (SQL) is the backbone of data manipulation and analysis for most data science roles. In a typical data science interview, roughly 70% of companies will assess your ability to write SQL queries that mirror real-world tasks—pulling metrics, aggregating key figures, and uncovering insights from raw tables.
Being well-prepared for SQL questions means more than memorizing syntax – you must understand why each command exists, how relational tables fit together, and when to apply complex functions like window functions or advanced subqueries. In this guide, we’ll first cover the key SQL concepts that frequently appear in data science interviews. Then, we’ll present 30+ actual interview questions drawn from top tech companies, consulting firms, and other data-driven organizations. By internalizing these concepts and practicing a wide variety of questions, you’ll move confidently into any SQL interview for data science roles.
SQL Concepts for Data Science Interviews
In data science interviews, evaluators focus on intermediate-to-advanced SQL skills, since day-to-day work as a data scientist often involves combining multiple tables, manipulating dates, and computing rolling metrics. Based on industry surveys, the five most frequently tested SQL areas are:
Aggregations and GROUP BY
JOINs (including complex JOIN types)
Date and Time Manipulations
Subqueries and Common Table Expressions (CTEs)
Window Functions (analytic functions)
Below, we explain each concept in detail, illustrating why it matters and how you might see it tested during an interview.
Aggregations and GROUP BY
Aggregations using functions like SUM()
, AVG()
, COUNT()
, and MAX()/MIN()
are fundamental for summarizing large datasets. As a data scientist, you often need to compute monthly revenue, average session length, or count the number of unique users who performed a certain action. In many interviews, you’ll be asked not only to apply these functions but also to build conditional aggregations using CASE
statements—effectively pivoting or categorizing data on the fly.
Example Interview Prompt
“Given a table
transactions(transaction_id, user_id, amount, created_at)
, write a query to find the total amount spent and average amount per transaction for eachuser_id
during Q1 of 2024. Exclude users whose total spending is below $500.”
JOINs (Inner, Left, Right, Full, Self-JOIN)
Data usually resides across multiple tables (e.g., users, orders, product catalog). As a data scientist, you must combine these tables to build a unified view for analysis. In interviews, questions often involve more complex join logic: joining three or four tables together, performing self-joins to compare rows within the same table, or using RIGHT/LEFT/FULL OUTER JOINs to identify missing or orphaned records.
Example Interview Question
“You have two tables:
users(user_id, name, signup_date)
andorders(order_id, user_id, product_id, order_value, order_date)
. Write a query that returns each user who signed up before 2023 along with the total value of orders placed in 2023. Include users who did not place any orders by showing a zero total.”
Date and Time Manipulations
Time series analysis is a staple for data scientists—whether you’re forecasting sales, analyzing churn rates, or monitoring campaign performance. Interviewers often test your ability to extract parts of a date (year
, month
, week
, day
), compute differences between dates, and group data into custom buckets (e.g., “first week of each month,” “quarterly intervals”).
Example Interview Question
“Given a table
web_logs(log_id, user_id, event_time)
that records every pageview, write a query to compute the number of unique users per week for the last 12 weeks. The output should include columnsweek_start_date
andactive_users
.”
Subqueries and Common Table Expressions (CTEs)
Subqueries (nested queries) and CTEs allow you to structure complex logic step-by-step, improving readability and maintainability. In an interview setting, a single query might require multiple passes—for example, first filtering a table based on certain conditions, then performing aggregations, and finally joining the result set back to another table. CTEs (with the WITH
clause) help you break that logic into digestible pieces. Interviewers will expect you to use CTEs to avoid overly nested queries.
Example Interview Question
“You have
users(user_id, signup_date)
andorders(order_id, user_id, order_date, amount)
. First, find users who have placed at least five orders. Then, among those users, calculate the average order value for orders placed in March 2025. Use CTEs to structure your solution in two steps.”
Window Functions
Window functions (also called analytic functions) let you compute aggregated values across partitions of a result set without collapsing rows. Data scientists rely on window functions for advanced metrics like running totals, ranking, and lag/lead comparisons. In an interview, you might be asked to calculate a user’s churn risk by comparing their current-month spend to the previous-month spend, which requires window functions like LAG()
or LEAD()
. Because window functions produce results for each row, they help preserve granularity while still delivering summary information.
Example Interview Question
“Given a table
transactions(user_id, transaction_date, amount)
, write a query to compute each user’s rolling 30-day spending total for every transaction date in 2024. The output should includeuser_id
,transaction_date
,amount
, androlling_30_day_total
.”
Common SQL Interview Questions for Data Scientists
For SQL Beginners
1. What is a JOIN in SQL?
Explain how a JOIN
merges two or more tables based on matching key columns. Describe potential pitfalls when joining on non-unique or nullable columns.
2. Which SQL command is used to add rows to a table?
Identify the INSERT
statement, including syntax variations for inserting one row at a time versus batch inserts.
3. What is OLAP and OLTP? When do you denormalize data?
Distinguish between Online Analytical Processing (optimized for reads/aggregations) and Online Transaction Processing (optimized for inserts/updates). Explain why OLAP systems often denormalize (to reduce JOINs and speed up aggregations) whereas OLTP systems remain normalized for consistency.
4. What is the difference between WHERE and HAVING?
5. When do you use the CASE WHEN expression?
Describe how CASE WHEN
can produce conditional columns—e.g., converting numeric scores into labels—or pivot data from wide to long format by categorizing values within a single query.
6. What is the difference between a LEFT JOIN and an INNER JOIN?
INNER JOIN
returns only matching rows in both tables.LEFT JOIN
returns all rows from the left table, matching rows from the right table, and fills withNULL
when no match exists
7. When would you use UNION vs. UNION ALL?
Explain that UNION
eliminates duplicate rows between result sets, whereas UNION ALL
includes all rows (including duplicates). If you know the two queries being combined have no overlapping rows, UNION ALL
is faster since it does not perform a deduplication pass.
8. What is the difference between COUNT(*) and COUNT(DISTINCT column_name)?
COUNT(*)
counts all rows in the result set (including duplicates).COUNT(DISTINCT)
counts unique, non-null values in the specified column.
Note thatCOUNT(DISTINCT)
may incur a performance penalty on large data sets.
9. When would you use a HAVING clause versus a WHERE clause?
Restating the key difference: use WHERE
to filter rows before aggregation; use HAVING
to filter groups after aggregation. For example, filtering for employees hired before 2020 uses WHERE
—but filtering for departments with more than 10 employees requires HAVING
.
Intermediate SQL Questions
10. Retrieve All Flight Records
You have a table flights(flight_id, airline, departure_date, arrival_date, source, destination)
. Write a query to select every column from this table.
11. Find Neighborhoods with No Users
Two tables exist: users(user_id, name, neighborhood_id, signup_timestamp)
and neighborhoods(neighborhood_id, neighborhood_name, city_id)
. Return all neighborhoods that have zero associated users.
12. Search Result Quality Metric
Table search_events(term, result_id, position, relevance_score)
records user searches. The position
column indicates rank (1 is top), and relevance_score
ranges from 1 (low) to 5 (high). For each search term, produce a metric that compares how well top positions align with high relevance scores.
13. Second-Highest Salary in Engineering
You have employees(emp_id, first_name, last_name, salary, department_id)
and departments(dept_id, dept_name)
. Write a query to find the second-highest salary among those in the “Engineering” department.
14. Annual Retention Calculation
Given a subscriptions(sub_id, user_id, payment_date, status)
table, write SQL to compute the annual retention rate at the end of each year. In other words, for each year (e.g., 2023, 2024), calculate the percentage of customers who were active at year-end compared to the previous year’s active customers, excluding those who joined that same year.
15. Count Upsold Customers
Consider a transactions(user_id, transaction_date)
table. Identify the number of customers who purchased more than once on different dates (i.e., upsold).
16. Swipes and Variant Performance
Two tables exist: swipes(user_id, swipe_id, action, timestamp)
(where action
is either ‘right’ or ‘left’) and variants(user_id, experiment_variant)
(indicating which A/B test variant each user belongs to). Write a query to find, for users who swiped at least ten times, the average rate of right swipes per variant.
17. Current Salary After ETL Error
You have salaries(entry_id, emp_id, salary, updated_at)
, where each entry_id
is unique and higher entry_id
means a more recent update. Write a query to return each employee’s latest salary (the most current salary) after an ETL error created duplicate rows.
18. Third Purchase per User
Table purchases(user_id, purchase_id, purchase_date)
records every purchase event. For each user, return the details (including purchase_id
and purchase_date
) of their third purchase.
19. Top Five Product Pairs
Assume you have order_items(order_id, product_id)
and products(product_id, product_name)
. Write a query to find the top five most frequent pairs of products purchased together in the same order. Return the product names alongside the count of co-occurrences. Ensure each pair is listed only once (e.g., “A,B” is the same as “B,A”).
20. Project Budget Forecast
Tables: projects(project_id, project_name, budget)
and employee_assignments(emp_id, project_id, start_date, end_date, prorated_salary)
. For each project, forecast if it will be “overbudget” or “within_budget.” A project is “overbudget” if the sum of prorated salaries of assigned employees exceeds the project’s budget
. Return project_id
, project_name
, total_prorated_salary
, and a label indicating budget status.
21. First Touch Attribution for Converting Users
You have sessions(session_id, user_id, channel, session_start, conversion_flag)
and user_conversions(user_id, conversion_date)
. Identify the channel through which each converting user had their first session that eventually led to conversion. Return user_id
, first_channel
, and conversion_date
.
22. Users Posting Duplicate Jobs
Two tables: job_posts(post_id, user_id, job_title, post_date)
and users(user_id, name)
. Determine how many users posted the same job title more than once and how many users posted only unique job titles. Return two columns: users_with_multiple_posts
and users_with_unique_posts
.
23. Daily Login Counts
Table logins(user_id, login_timestamp)
records every user login. How many users logged in exactly N times on January 1, 2022? Return a list of (login_count, user_count)
, where login_count
is the number of times a user logged in that day, and user_count
is how many users had that login frequency.
24. Average Downloads by Account Type
Two tables: accounts(account_id, is_paying_customer, signup_date)
and downloads(download_id, account_id, download_date)
. Return the average number of downloads per day for paying and non-paying accounts. Output columns: (download_date, avg_downloads_free, avg_downloads_paying)
.
25. Longest Continuous Login Streak
Table events(user_id, event_date)
logs each day a user logged into a platform. Find the top five users with the longest consecutive-day login streak and return (user_id, streak_length, streak_start_date, streak_end_date)
.
26. Identify Users Without Any Orders in Past Year
users(user_id, signup_date)
and orders(order_id, user_id, order_date)
. Return users who have not placed any order during the past 12 months (from today’s date).
✨AMA Career Tip: Our AI-powered interview platform offers hundreds of real SQL questions from top companies and lets you customize a learning path tailored to your weak areas—so you can practice the exact problems you’ll see on the job.
With this guide and the curated list of 30+ questions, you can structure a rigorous study plan. By focusing on core SQL concepts, practicing realistic problems, and honing your explanation skills, you’ll be well-equipped to tackle any SQL interview for a data science role—whether at a FAANG company, a fast-growing startup, or a consulting firm.