Available for opportunities · May 2026

Omkar
Pathare

Data Scientist & ML Engineer

Currently at

Stevens Institute of Technology

M.S. Data Science · 2026

Scroll
+20% Forecast Accuracy · +17% Recommendation Precision · 0.88 AUC Score · 98.7% Classification Accuracy · 80% Spark Runtime Reduction · 94.2% Face Recognition Accuracy · 8+ Projects Shipped · 3 Certifications · 1 Publication · IJCA 2023 · +20% Forecast Accuracy · +17% Recommendation Precision · 0.88 AUC Score · 98.7% Classification Accuracy · 80% Spark Runtime Reduction · 94.2% Face Recognition Accuracy · 8+ Projects Shipped · 3 Certifications · 1 Publication · IJCA 2023 ·

Redefining what's possible with data — one model, one insight, one decision at a time.

I'm Omkar Shashank Pathare — a Data Scientist and ML Engineer pursuing my Master's in Data Science at Stevens Institute of Technology. I engineer scalable ML solutions across finance, healthcare, and supply chain — from fine-tuning LLMs with QLoRA to production-grade Azure MLOps pipelines and recommendation engines that genuinely move the needle.

My toolkit: Python, R, SQL, Scikit-learn, TensorFlow, Hugging Face, Apache Spark, Azure, and everything in between. When I'm not in the data, I'm an F1 fan who occasionally takes it too far.

View My Work ↓ Get in Touch
Omkar Pathare
Omkar Pathare · Jersey City, NJ
+20%
Forecast accuracy uplift via BI dashboards
0.88
AUC — credit default prediction via PCA
98.7%
Classification accuracy in behavioural analytics
80%
Spark runtime reduction on cluster scale-up

The
Work

Eight end-to-end ML projects spanning NLP, computer vision, recommendation systems, big data, and time-series forecasting.

01
LLM Fine-Tuning · Kaggle Competition

Drawing with LLM

Fine-tuned a Qwen-7B large language model using QLoRA (Quantized Low-Rank Adaptation) on a curated dataset of 10,000 cleaned SVG–natural language prompt pairs sourced from Kaggle competition data in .csv format. The pipeline included SVG canonicalization — normalizing path commands, stripping metadata, and enforcing consistent coordinate systems — alongside structural validation to discard malformed or non-renderable SVGs before training.

The fine-tuned model learned to generate structured SVG markup from free-form English descriptions of geometric shapes and compositions, significantly outperforming zero-shot Qwen-7B on prompt-to-shape alignment. The training regime used 4-bit quantization to fit the 7B parameter model on a single A100 GPU, with LoRA applied to all attention projection layers. Inference benchmarking showed improved shape accuracy and path coherence over the base model across blind test prompts.

QLoRA · 4-bit 10K training pairs Better alignment over baseline
Qwen-7BQLoRAHugging FacePythonPrompt Engineering
02
Recommender Systems · NLP

Dual Model Job Recommendation System

Engineered a hybrid recommendation engine that fuses content-based and collaborative filtering signals to match job seekers with relevant roles. The content-based component uses TF-IDF vectorization on job descriptions and résumé text with cosine similarity scoring, capturing lexical overlap between candidate profiles and postings. The collaborative component applies Singular Value Decomposition (SVD) on a user–item interaction matrix built from 1,000+ applicant profiles and application histories.

A dynamic weighting mechanism blends both signals at inference time, shifting emphasis toward collaborative filtering for users with sufficient interaction history and toward content-based scoring for cold-start users. This cold-start mitigation was the core design challenge. Final evaluation against a held-out test set showed +17% precision, +14% recall, and +19% F1-score compared to a TF-IDF only baseline. The system was implemented in Python using Scikit-learn with a preprocessing pipeline that handles noisy text, inconsistent formatting, and skill-level normalization.

+17% Precision +14% Recall +19% F1-Score Cold-start solved
PythonScikit-learnTF-IDFSVDNLP
03
Classification · Clustering · Healthcare

Predicting Problematic Internet Usage in Adolescents

Developed a multi-class classification pipeline to identify levels of problematic internet usage severity in adolescents using a dataset of 1,500+ records containing behavioural, psychological, and physical health features from the Healthy Brain Network study. Two primary classifiers were evaluated: a Decision Tree optimized via CART criterion with max-depth tuning, and a Softmax Logistic Regression with L2 regularization.

A key methodological contribution was the use of K-Means clustering (k=4, selected via elbow method and silhouette score) as a feature engineering step — cluster assignments were appended as a categorical feature to the training data, introducing unsupervised structure that improved model performance by 7.5% over raw-feature baselines. The final ensemble achieved 98.7% accuracy on the held-out test set. Class imbalance was handled using SMOTE oversampling on the minority classes. Model interpretability was examined via feature importance plots, revealing that parental monitoring, sleep duration, and time spent on social media were the strongest predictors across classes.

98.7% Accuracy +7.5% vs baseline 1,500+ records
Decision TreesLogistic RegressionK-MeansSMOTEScikit-learn
04
Dimensionality Reduction · Finance

Credit Card Default Prediction with PCA

Applied Principal Component Analysis to 23 financial and behavioral features from a 30,000+ record Taiwanese credit dataset to reduce dimensionality for binary default classification. PCA extracted 8 principal components that collectively explained 96.3% of the total variance — compressing the feature space by 65% while retaining nearly all predictive signal. This reduction cut logistic regression training time by 41% and reduced susceptibility to multicollinearity common in correlated financial features (e.g., payment history across consecutive months).

The PCA decomposition was computed via Singular Value Decomposition on standardized features (zero mean, unit variance). The final logistic regression classifier achieved 81.4% overall accuracy, an AUC of 0.88 on the ROC curve, and 15% improved recall specifically for default cases — the minority class — which is the critical outcome in credit risk modeling. The project also explored the interpretability of principal components by examining their loadings against original features, linking PC1 predominantly to payment delay history and PC2 to credit utilization ratio.

0.88 AUC 41% Faster Training 96.3% Variance Retained 30K+ records
PCASVDLogistic RegressionPythonScikit-learn
05
Computer Vision · Eigenfaces

Yale Face Recognition via PCA & SVM

Implemented the classical Eigenfaces approach for facial recognition on the Yale and ORL benchmark datasets, combining PCA-based dimensionality reduction with an SVM classifier. Raw face images (128×128 grayscale pixels) were vectorized into 16,384-dimensional feature vectors, then PCA was applied to compute the eigenvectors of the covariance matrix — reducing dimensionality by 99%, retaining only the top-k eigenvectors (Eigenfaces) that captured maximum inter-class variance.

The reduced-dimension representations were used as inputs to a multi-class SVM with an RBF kernel, tuned via grid search over C and gamma. The Eigenfaces approach achieved 94.2% recognition accuracy, outperforming the raw-pixel SVM baseline by approximately 10 percentage points — demonstrating the power of feature extraction in high-dimensional image spaces with limited training data. The project included thorough visualization of top Eigenfaces and a confusion matrix analysis showing which individuals were most frequently confused under various lighting conditions.

94.2% Accuracy 99% Dim. Reduction +10% over baseline
PCASVMEigenfacesOpenCVScikit-learn
06
Big Data · ETL Pipeline · Economics

Pricing & Regional Trends in Airbnb Markets

Designed and deployed a scalable Apache Spark ETL pipeline to ingest, clean, and normalize millions of daily listing and calendar records from the Inside Airbnb open dataset across multiple U.S. cities. The ingestion stage handled messy real-world issues including inconsistently formatted price strings ($1,200/night vs 1200.0), missing amenity fields, irregular date formats, and duplicate listing entries from host re-submissions.

After cleaning, log transformations were applied to right-skewed price and occupancy distributions, and engineered features were created for borough/region indicators, room type dummies, bedroom/bathroom counts, and amenity richness scores. Log-log regression models were fitted to estimate price elasticity: location explained 11–25% of listing price variability (R² by borough), with East Coast markets showing higher mean prices but tighter variance. Demand elasticities were consistently negative across all markets, confirming standard economic theory. Scaling the pipeline from a single-node local Spark instance to a 4-worker cluster delivered approximately 80% wall-clock runtime reduction on the full dataset, achieved through partition optimization and broadcast joins for small lookup tables.

80% Runtime Reduction Millions of records Multi-city analysis
Apache SparkPythonSQLRegressionBig Data
07
Cloud Architecture · MLOps · Azure

Cloud-Based Image Sharing Platform

Built a cloud-native, production-grade image-sharing platform using Python and Docker, orchestrated entirely on Microsoft Azure. The backend architecture follows a microservices pattern: Azure SQL / SQL Server handles user authentication and session management; Cosmos DB stores image metadata and user-generated tags; Azure Blob Storage persists raw image binaries with CDN delivery; Table Storage logs audit trails and access events for compliance.

Serverless Azure Functions run statelessly on demand and integrate with Cosmos DB change-feed triggers to implement a fully event-driven ML pipeline: when a new photo is uploaded, a queue-based microservice automatically sanitizes the image (removing EXIF metadata), resizes it to standardized resolutions, and runs an image classification model to auto-tag content before persisting results back to Cosmos DB. Secrets and API keys are managed via Azure Key Vault with managed identities — no secrets in code or environment variables. The full application is containerized with Docker and deployed to Azure App Service with auto-scaling rules, maintaining sub-200ms latency under simulated load of 500 concurrent users.

Event-driven MLOps <200ms latency Zero-secret architecture
AzureDockerServerlessCosmos DBMLOpsPython
08
Time-Series · Forecasting · R

Ferrari F1 Pit-Stop & Lap-Time Forecasting

Developed a time-series forecasting framework in R to model Ferrari Formula 1 pit-stop durations and lap times across multiple race weekends. The project began with exploratory data analysis: visualizing pit-stop duration distributions by circuit and year, checking for trend, seasonality, and calendar effects (e.g., safety car periods, weather-induced outliers).

Stationarity was tested using Augmented Dickey-Fuller and KPSS tests; non-stationary series were corrected via first-order differencing and log transformation to stabilize variance. Autocorrelation (ACF) and partial autocorrelation (PACF) plots informed initial ARIMA order selection, which was then formalized using auto.arima — selecting optimal (p,d,q)(P,D,Q)[s] parameters by minimizing AIC/BIC. Models were validated through time-series k-fold cross-validation with expanding windows, and residual diagnostics confirmed no remaining autocorrelation (Ljung-Box test). The resulting SARIMA models captured seasonal race patterns and delivered accurate lap-time forecasts evaluated via RMSE and MAE across held-out races. Strategic implications for pit-stop timing windows were extracted from the forecast intervals.

SARIMA · auto.arima k-fold CV validated Race strategy insights
RSARIMAACF/PACFTime-SeriesForecasting
09
Supply Chain Analytics · BI · SQL

Supply Chain Sales Analytics & Forecasting

Engineered an end-to-end supply chain analytics pipeline to process and analyze over 1 million sales records using SQL, surfacing a 15% revenue decline across 5 key regions tied to seasonal demand misalignment and distribution bottlenecks. The analysis combined regional segmentation, product-level trend decomposition, and inventory turnover metrics to pinpoint root causes at granular geographic and SKU levels.

Built 10+ Tableau dashboards covering inventory velocity, sales trend heatmaps, demand forecasting overlays, and KPI tracking — adopted by operations leadership for weekly strategic reviews. Integrated historical trend analysis into supply chain planning models, improving forecast accuracy by 20% compared to prior manual estimation methods. The project demonstrated how structured BI tooling and SQL-driven analytics can directly translate into operational decision-making improvements in logistics and inventory management.

+20% Forecast Accuracy 1M+ records analyzed 15% decline uncovered
SQLTableauPythonSupply ChainForecasting
"It doesn't matter where you start —
it's about the impact you leave in the data."

— Omkar Pathare

Skills &
Tools

A full-stack ML toolkit spanning languages, frameworks, cloud platforms, and domain expertise.

Languages & Core
Python95%
SQL88%
R80%
ML / AI Frameworks
Scikit-learn92%
TensorFlow / Keras78%
Hugging Face75%
Data Engineering
Apache Spark72%
Statistical Modeling88%
Time-Series Forecasting82%
BI & Visualization
Tableau85%
Power BI82%
Cloud & DevOps
Microsoft Azure80%
AWS65%
Docker70%
Domain Expertise
NLP & LLMs80%
MLOps70%
Computer Vision68%

Certi
fications

Industry-recognized credentials validating expertise across big data, cloud, and analytics.

Databricks
Apache Spark Developer
Apache Spark Big Data Data Engineering
Verified Credential
View Credential ↗
Amazon Web Services
AWS Cloud Practitioner
Cloud AWS Infrastructure
Certified
Google
Data Analytics Professional
Analytics SQL Tableau
Certified

Experi
ence

Two industry internships spanning data analytics and applied machine learning.

May – Jun 2023
Maxgen Technologies Pvt. Ltd.
Mumbai, India
Data Analytics Intern

Analyzed 1M+ sales records using SQL, uncovering a 15% decline across 5 key regions and surfacing root causes tied to seasonal demand shifts and distribution gaps.

Improved forecast accuracy by 20% by integrating historical trend analysis into supply chain planning models used by the operations team.

Engineered 10+ Tableau dashboards visualizing inventory velocity, regional sales trends, and KPI performance — adopted by leadership for weekly strategic reviews.

Jan – Jun 2022
Robust Results Pvt. Ltd.
Kanpur, India
Machine Learning Intern

Built breast cancer classification models using Python and Scikit-learn, boosting accuracy by 15% through SVM kernel selection and hyperparameter optimization via grid search.

Processed and analyzed RNA-Seq expression data from 800+ tumor samples — handling normalization, batch correction, and feature selection for high-dimensional genomic inputs.

Enhanced model evaluation workflows by implementing stratified k-fold cross-validation and ROC-AUC metrics, improving reporting reliability for clinical research stakeholders.

Edu
cation

Master of Science
in Data Science
Stevens Institute of Technology · Hoboken, NJ
Expected May 2026
B.E. in Artificial Intelligence
& Data Science
Thadomal Shahani Engineering College · Mumbai University
June 2024 · Distinction 🏅
Research
"
Enhancing Credit Card Fraud Detection via Generative Adversarial Networks
Omkar S. Pathare · IJCA LLMUC Proceedings, 2023
Always Building    Always  Learning    Always  Shipping    Always  Building    Always  Learning    Always  Shipping

Let's
Talk

Email
opathare@stevens.edu
📞
Phone
+1 (201) 589-3135
📍
Location
Jersey City, NJ, USA

Open to full-time Data Scientist and ML Engineer roles starting May 2026. Let's connect.