Data Scientist
Technical Skills
Programming: Python, R, SQL, Git/GitHub
Machine Learning: Regression Analysis, Boosted Trees, PCA, Clustering
Data Visualization: Tableau, Matplotlib, Seaborn, ggplot2
MS Office: Excel, PowerPoint, Word
Education
M.S., Statistics and Data Science | Yale University (December 2023)
- Finalist at US Olympic & Paralympic Committee Data Challenge 2024
- Poster Presenter at UConn Sports Analytics Symposium 2024
- Teaching Fellow for S&DS 220 Intro Statistics (Intensive) and S&DS 230 Data Exploration & Analysis
B.S., Mathematics | Hans Raj College, University of Delhi (May 2022)
- Awarded merit scholarship for a full tuition fee waiver by the Department of Science and Technology, Government of India
Work Experience
Data Scientist @ Yale Sports Analytics Lab (May 2023 - Present)
- Optimized the impact of age and rest of WNBA players’ performance and projected an increase in team points scored by 12%; extracted insights to aid franchise decision-making regarding player load-management and specifics such as playing position and strategy
- Developed comprehensive dashboards utilizing the player metrics identified as vulnerable to performance decline and offered actionable insights for maximizing team performance by strategically rotating players and managing player contract lengths to save costs
- Spearheaded the streamlining of data collection, cleaning, and standardization for various sports leagues by developing an all-encompassing R package “tidysports” to scrape player, game, and season data, thus automating these tasks and saving 6 hours weekly for the team’s researchers
Data Scientist @ TransOrg Analytics (August 2021 - September 2021)
- Improved checkout page conversion rate by 8% and increased monthly revenue by $24,000 for a clothing website by analyzing stage-by-stage purchase funnel drop-off rates to find friction points; implemented one-click checkout, autofill feature, and abandoned cart email reminders
- Predicted visitor purchase decisions with 87% accuracy by training multiple machine-learning models, enhancing consumer behavior understanding, allowing targeted marketing, personalized recommendations and optimized inventory management to save costs and drive profit
- Resolved imbalance problems in data using the synthetic minority oversampling technique, resulting in more representative datasets for model training. Facilitated informed decision-making through advanced statistical approaches for data exploration and SQL for data filtering
Business Analyst @ General Electric (July 2021 - September 2021)
- Reduced the GE plant’s yearly raw water consumption by 17.4% (1.6 million gallons), overachieving the target by 7.4% by analyzing site data and implementing rainwater harvesting, remapping the water flow systems, and industry-leading water treatment solutions
- Prevented 87.9 metric tons of CO2 emissions yearly, generated yearly savings of INR 375,000, and achieved zero liquid discharge outside the plant by enhancing the water management system through various engineering controls while working with plant managers
- Projected the payback period to be 1.4 years for suggested changes and presented solutions to GE India heads
Research Analyst @ Intellify (July 2021 - September 2021)
- Identified emerging issues caused by the COVID-19 pandemic on the Indian education sector such as lack of technical knowledge and low internet penetration in rural India by conducting primary research and collating data from 500+ teachers across the country
- Examined and visualized sectoral trends in schools, colleges, tutoring centers, etc., highlighting significant challenges and adaptations made by educators, constructing an informed approach through intervention and decision-making to address these challenges
Projects
Team USA Gymnast Selection Optimization for Paris 2024 Olympics
Poster, App, GitHub Repository
Designed an interactive R Shiny app for the USOPC to select Team USA gymnasts optimized for maximizing medals at the 2024 Paris Olympics by developing a model to simulate 10,000+ team combinations and compare the expected medal count

Machine Learning for Breast Cancer Detection: Unveiling Diagnostic Potentials
GitHub Repository
Analyzed tabular data of cancer cell features and trained supervised ML algorithms like XGBoost, Logistic Regression, Naive Bayes, K-Nearest Neighbors, Random Forests, and more to classify malignant or benign tumors with 98% accuracy

To Swing or Not to Swing: Baseball Swing Probability Modeling
GitHub Repository
Developed predictive models to estimate batting swing probability for pitches thrown during a baseball game, with the best-performing model reaching an accuracy of 86%. By analyzing a dataset of around 2,000,000 pitches, the model aims to provide accurate swing probability estimates for pitches, thereby aiding strategic decision-making in gameplay and player analysis.

Spatial Trends in New York City Parking Traffic
GitHub Repository
Geospatially elicited parking difficulty insights in New York City by implementing Spatial Autocorrelation and Clustering. Utilized R and Google Maps API to create visualizations. Employed Universal Kriging to predict and map time to find parking
