Skip to content

Exploratory data analysis of a fictional tech business using PostgreSQL, focusing on revenue performance, customer segmentation, seasonality, churn, and pricing anomalies.

License

Notifications You must be signed in to change notification settings

SeanW-Data/sql-exploratory-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 SQL Exploratory Data Analysis (EDA) – Tech Business

PostgreSQL | Data Analytics Portfolio Project


📌 Project Overview

This project presents an end-to-end SQL exploratory data analysis (EDA) of a fictional tech-based business across a two-year period (2024–2025). The dataset is modelled using a star schema (fact + dimension tables) to reflect real-world analytics workflows.

The analysis focuses on revenue performance, customer behaviour, seasonality, churn/reactivation, pricing behaviour, and operational anomalies. The dataset is clean enough for immediate exploration but intentionally includes a small number of missing values and pricing outliers to support anomaly investigation and validation of assumptions.


🎯 Project Objectives

  • Understand overall revenue performance and identify key products driving sales
  • Analyse customer behaviour by segment and location to support targeting strategies
  • Identify seasonal demand patterns and product-level seasonality
  • Assess churn and reactivation trends over time
  • Detect operational anomalies (outliers + missing data), validate assumptions, and assess impact on conclusions
  • Demonstrate SQL capability in PostgreSQL (CTEs, window functions, time analysis) and business-focused interpretation

❓ Business Questions Answered

  • Which products generate the most sales?
  • Which customer segments (SMB, Consumer, Enterprise) are most valuable?
  • Are revenues growing over time (month/quarter/year)?
  • Which cities or segments are purchasing the most?
  • Are any customer segments paying more per transaction? If so, for which products?
  • Do certain customer segments buy certain product types?
  • Which periods spike in sales?
  • Are some products more seasonal?
  • Do customers churn or reappear based on time?
  • Do outliers represent bulk deals, pricing errors, or special contracts? Is missing data affecting visibility?

🗂️ Data Model (Star Schema)

The dataset follows a star schema design with one central fact table (fact_sales) and supporting dimensions (dim_customer, dim_product, dim_date). This structure enables efficient aggregation, segmentation, and time-series analysis.


🔍 Key Insights Discovered (Q&A)

1️⃣ Understand Revenue Performance

Q: Which products generate the most sales?

A: Products 37 and 38 generated the most sales by a significant margin, outperforming the third-best product by approximately £1.5–£2.5 million. These are clear revenue drivers and should remain a strategic focus.

Q: Which customer segments (SMB, Consumer, Enterprise) are most valuable?

A: The Consumer segment is the most valuable, contributing roughly total sales compared to the next highest segment (SMB).

Q: Are revenues growing over time (month/quarter/year)?

A: Revenue decreased slightly year-on-year, which is worth monitoring to ensure it does not become a sustained trend. 2024 showed stronger seasonality (notable peaks in March and December). 2025 was more consistent across the year, although Q2 underperformed relative to other quarters and would benefit from deeper investigation into potential causes.

Business impact:
➡️ Identify where the business should invest, promote, or expand.

Relevant departments:

  • Finance
  • Commercial / Strategy
  • Marketing

These insights support revenue forecasting, strategic investment decisions, and prioritisation of high-performing products and customer segments.


2️⃣ Analyse Customer Behaviour & Segments

Q: Which cities or segments are purchasing the most?

A: Bristol recorded the highest volume of orders (633). The customer segment placing the most orders was Consumer with 2,295 total orders.

Q: Are any customer segments paying more per transaction? If so, for which products?

A: Initial analysis suggested Enterprise customers paid significantly more than other segments. However, after excluding extreme pricing anomalies, segment-level differences narrowed considerably. The refined analysis shows that pricing premiums are small and largely consistent across segments, with Consumers paying a slightly higher average premium. Importantly, price deviations from list are driven by specific products rather than customer segments, highlighting product-level pricing opportunities over broad segment-based pricing strategies.

Note: This highlights the importance of validating assumptions against outliers, as a small number of extreme transactions can materially distort pricing conclusions.

Q: Do certain customer segments buy certain product types?

A: All three segments purchase more SaaS Subscription products than any other category, with the Consumer segment ordering a significantly higher number of orders and quantity. This suggests an opportunity to explore ways to increase profitability and monetisation within the SaaS subscription portfolio.

Business impact:
➡️ Marketing targeting, tailored product offers, and regional strategies.

Relevant departments:

  • Marketing
  • Sales
  • Product Management

This analysis helps guide marketing, sales priorities, and product positioning by examining how different customer segments purchase and generate value.


3️⃣ Seasonal or Sales Cycle Trends

Q: Which periods spike in sales?

A: Sales show recurring spikes in Q4, particularly October and December, which are the highest-volume months in both years. Outside of Q4, demand also lifts in mid-year (Q2–Q3)—notably June 2024 and July/September 2025—with a smaller but consistent bump in early Q1 (January/March). Overall, the pattern suggests a strong year-end peak plus secondary peaks in mid-year and early-year periods.

Q: Are some products more seasonal?

A: Several products show clear seasonal demand, with peak quarterly sales often 1.5–2.2× higher than their average quarterly sales. The strongest seasonality is observed in Product 3 (Q3 2025), where sales were 2.21× the typical quarterly level, followed by Product 39 (Q4 2025) and Product 9 (Q3 2024), both approaching average demand. Seasonal spikes are most commonly concentrated in Q2–Q4.

Q: Do customers churn or re-appear based on time?

A: Customer churn is largely balanced by reactivations, indicating stable demand driven by returning customers rather than sustained new customer growth.

Business impact:
➡️ Inventory planning, subscription renewal focus, and demand forecasting.

Relevant departments:

  • Operations / Supply Chain
  • Finance
  • Marketing
  • Customer Success

Understanding seasonal demand patterns enables better inventory planning, cash-flow forecasting, campaign timing, and proactive customer engagement.


4️⃣ Identify Operational Anomalies

Q: Do outliers represent bulk deals, pricing errors, or special contracts? Is missing data affecting visibility?

A: Further anomaly analysis identified a small number of extreme pricing events that occurred only once per product. These transactions were deemed likely data quality issues rather than legitimate pricing behaviour and were excluded from pricing analysis.

Quantifying the impact of these exclusions showed that removing just three anomalous transactions reduced the maximum unit price from £120,000 to £18,901, while the average unit price changed only marginally (£1,442 → £1,402). At the product level, each affected product contained a single extreme transaction, inflating both average and maximum prices, while normal pricing behaviour remained stable once removed. This confirms the outliers materially distorted pricing conclusions without representing genuine commercial patterns.

In contrast, repeated high-value transactions for other products were retained as they likely reflect legitimate bundled or contract-level pricing.

Records with missing customer city or segment information were excluded from segmentation and geographic analyses to ensure accurate comparisons. These transactions were retained in the dataset for overall performance metrics but omitted where attribution was required.

Missing data summary:

  • fact_sales: 10 missing sales_amount values
  • dim_product: 0 missing values
  • dim_date: 0 missing values
  • dim_customer: 3 missing city values and 3 missing segment values

Business impact:
➡️ Improve data capture systems and increase confidence in reporting and decision-making.

Relevant departments:

  • Data / Analytics
  • Finance
  • Sales Operations

These findings improve data reliability, protect reporting accuracy, and support better governance around pricing, contracts, and operational reporting.


⚠️ Assumptions & Limitations

  • Pricing analysis excludes a small number of one-off extreme transactions identified as data anomalies.
  • Segment and city analyses exclude records with missing customer attributes to avoid misclassification.
  • Some transactions may represent bundled or contract-level pricing, meaning quantity does not always reflect discrete units sold.

🛠 Tools & Skills Demonstrated

  • PostgreSQL
  • CTEs and window functions (RANK, ROW_NUMBER)
  • Time-based analysis (month, quarter, year)
  • Customer segmentation, geographic analysis, and pricing analysis
  • Anomaly detection, validation of assumptions, and data-quality interpretation
  • Business-focused insight generation and stakeholder relevance mapping

This project demonstrates how SQL-based exploratory analysis can support multiple business functions, from strategic decision-making and pricing optimisation to operational planning and data quality improvement.


Project created by Sean Worrall as part of a SQL analytics portfolio.

About

Exploratory data analysis of a fictional tech business using PostgreSQL, focusing on revenue performance, customer segmentation, seasonality, churn, and pricing anomalies.

Topics

Resources

License

Stars

Watchers

Forks