Introduction

This project focuses on integrating, cleaning, and analyzing data from Cyclistic, a bike-sharing service, to uncover insights into bike usage patterns and disparities between annual members and casual riders. By merging multiple CSV files and conducting SQL-based analysis, the project aims to provide actionable insights for marketing strategies, service enhancements, and operational decisions.


Objective

  • Merge 12 CSV files (5.7M+ rows) containing bike sharing data into a single DataFrame.
  • Perform data cleaning to handle missing values, duplicates, and standardize formats.
  • Analyze bike usage patterns between annual members and casual riders, focusing on ride frequency and durations.
  • Identify preferences for rideable types and temporal variations in bike usage.
  • Generate insights to inform marketing strategies, service improvements, and resource allocation.

Methodology

  1. Data Collection

    The CSV files containing bike sharing data has been made available by Motivate International Inc.

  2. Data Integration

    The os and Pandas library in Python is utilized to merge all CSV files into a single DataFrame using the pd.concat() function, ensuring that no data is lost during the merging process.


  3. Importing Data

  4. Data Cleaning

    Null Value Handling: Missing values in the dataset are identified and addressed using the fillna() method. Object-type columns such as station names and IDs are filled with "not specified", while float-type columns such as geographical coordinates are filled with a placeholder value (-999).

    Duplicate Detection: Duplicate rows in the dataset are identified using the duplicated() method.


  5. Importing Data

  6. Data Transformation

    Date and time columns are converted to datetime format using the pd.to_datetime() function for consistency and ease of analysis.

  7. Data Export

    The cleaned and integrated dataset is saved as a new CSV file named "merged_data.csv" using the to_csv() method.

  8. Data Analysis in SQL

    Data Import

    The dataset is imported into a PostgreSQL database table named "cyclist" using the COPY command.

    Data Transformation

    Created a temporary table "riders_analysis" to improve the performance of query. Additional columns are added to this temporary table to facilitate analysis, including extracting month, weekday, ride length, and time of day from the timestamp columns using SQL functions such as TO_CHAR() and EXTRACT(). Another table "riders_cyclist" is created with releveant columns, necessary for analysis.




    Data Cleaning

    Null values are checked and consistent values are ensured for categorical variables such as rideable type and rider type.

  9. Analysis

    General Overview:

    Breakdown of ride counts by rider type and rideable type.

    Average ride length for each rider type and rideable type.



    Total Number of Rides:

    Analysis of ride counts by months, weekdays, and time of day for each rider type and rideable type.



    Average Ride Lengths:

    Comparison of average ride lengths across months, weekdays, and time of day for each rider type and rideable type.




Result

  1. The integration process successfully merged 12 CSV files, containing over 5.7M+ rows, into a single DataFrame.
  2. Data cleaning operations addressed missing values and removed duplicate rows, ensuring the quality and integrity of the dataset.
  3. The cleaned dataset is ready for exploratory data analysis, modeling, and generating insights into bike sharing patterns and trends.
  4. General Overview:
    • Electric bikes are the most popular among both rider types, followed by classic bikes. Docked bikes have the least usage.
    • Casual riders tend to have longer average ride lengths compared to annual members.
  5. Total Number of Rides:
    • Annual members predominantly ride on weekdays, particularly during mornings and afternoons, suggesting commuting patterns.
    • Casual riders show a preference for weekend rides, with longer durations, especially in the afternoons.
  6. Average Ride Lengths:
    • Annual members tend to take shorter rides, suggesting they use the bikes for practical purposes like commuting. On the other hand, casual riders prefer longer rides, especially during weekends and evenings, indicating they use the service more for leisurely activities.

Summary

  • Tools Python & SQL (PostgreSQL)
  • Objective To Analyze Riding Patterns Among Casual and Member riders.
  • Data Stages Data Collection Data Integration Data Cleaning Data Transformation Data Export Data Analysis in SQL Extracting Insights

  • Go to Github!

  • Go to Dashboard!