Omkar Pathare

01

LLM Fine-Tuning · Kaggle Competition

Drawing with LLM

Fine-tuned a Qwen-7B large language model using QLoRA (Quantized Low-Rank Adaptation) on a curated dataset of 10,000 cleaned SVG–natural language prompt pairs sourced from Kaggle competition data in .csv format. The pipeline included SVG canonicalization — normalizing path commands, stripping metadata, and enforcing consistent coordinate systems — alongside structural validation to discard malformed or non-renderable SVGs before training.

The fine-tuned model learned to generate structured SVG markup from free-form English descriptions of geometric shapes and compositions, significantly outperforming zero-shot Qwen-7B on prompt-to-shape alignment. The training regime used 4-bit quantization to fit the 7B parameter model on a single A100 GPU, with LoRA applied to all attention projection layers. Inference benchmarking showed improved shape accuracy and path coherence over the base model across blind test prompts.

QLoRA · 4-bit 10K training pairs Better alignment over baseline

Qwen-7BQLoRAHugging FacePythonPrompt Engineering

View on GitHub

02

Recommender Systems · NLP

Dual Model Job Recommendation System

Engineered a hybrid recommendation engine that fuses content-based and collaborative filtering signals to match job seekers with relevant roles. The content-based component uses TF-IDF vectorization on job descriptions and résumé text with cosine similarity scoring, capturing lexical overlap between candidate profiles and postings. The collaborative component applies Singular Value Decomposition (SVD) on a user–item interaction matrix built from 1,000+ applicant profiles and application histories.

A dynamic weighting mechanism blends both signals at inference time, shifting emphasis toward collaborative filtering for users with sufficient interaction history and toward content-based scoring for cold-start users. This cold-start mitigation was the core design challenge. Final evaluation against a held-out test set showed +17% precision, +14% recall, and +19% F1-score compared to a TF-IDF only baseline. The system was implemented in Python using Scikit-learn with a preprocessing pipeline that handles noisy text, inconsistent formatting, and skill-level normalization.

+17% Precision +14% Recall +19% F1-Score Cold-start solved

PythonScikit-learnTF-IDFSVDNLP

View on GitHub

03

Classification · Clustering · Healthcare

Predicting Problematic Internet Usage in Adolescents

Developed a multi-class classification pipeline to identify levels of problematic internet usage severity in adolescents using a dataset of 1,500+ records containing behavioural, psychological, and physical health features from the Healthy Brain Network study. Two primary classifiers were evaluated: a Decision Tree optimized via CART criterion with max-depth tuning, and a Softmax Logistic Regression with L2 regularization.

A key methodological contribution was the use of K-Means clustering (k=4, selected via elbow method and silhouette score) as a feature engineering step — cluster assignments were appended as a categorical feature to the training data, introducing unsupervised structure that improved model performance by 7.5% over raw-feature baselines. The final ensemble achieved 98.7% accuracy on the held-out test set. Class imbalance was handled using SMOTE oversampling on the minority classes. Model interpretability was examined via feature importance plots, revealing that parental monitoring, sleep duration, and time spent on social media were the strongest predictors across classes.

98.7% Accuracy +7.5% vs baseline 1,500+ records

Decision TreesLogistic RegressionK-MeansSMOTEScikit-learn

View on GitHub

04

Dimensionality Reduction · Finance

Credit Card Default Prediction with PCA

Applied Principal Component Analysis to 23 financial and behavioral features from a 30,000+ record Taiwanese credit dataset to reduce dimensionality for binary default classification. PCA extracted 8 principal components that collectively explained 96.3% of the total variance — compressing the feature space by 65% while retaining nearly all predictive signal. This reduction cut logistic regression training time by 41% and reduced susceptibility to multicollinearity common in correlated financial features (e.g., payment history across consecutive months).

The PCA decomposition was computed via Singular Value Decomposition on standardized features (zero mean, unit variance). The final logistic regression classifier achieved 81.4% overall accuracy, an AUC of 0.88 on the ROC curve, and 15% improved recall specifically for default cases — the minority class — which is the critical outcome in credit risk modeling. The project also explored the interpretability of principal components by examining their loadings against original features, linking PC1 predominantly to payment delay history and PC2 to credit utilization ratio.

0.88 AUC 41% Faster Training 96.3% Variance Retained 30K+ records

PCASVDLogistic RegressionPythonScikit-learn

View on GitHub

05

Computer Vision · Eigenfaces

Yale Face Recognition via PCA & SVM

Implemented the classical Eigenfaces approach for facial recognition on the Yale and ORL benchmark datasets, combining PCA-based dimensionality reduction with an SVM classifier. Raw face images (128×128 grayscale pixels) were vectorized into 16,384-dimensional feature vectors, then PCA was applied to compute the eigenvectors of the covariance matrix — reducing dimensionality by 99%, retaining only the top-k eigenvectors (Eigenfaces) that captured maximum inter-class variance.

The reduced-dimension representations were used as inputs to a multi-class SVM with an RBF kernel, tuned via grid search over C and gamma. The Eigenfaces approach achieved 94.2% recognition accuracy, outperforming the raw-pixel SVM baseline by approximately 10 percentage points — demonstrating the power of feature extraction in high-dimensional image spaces with limited training data. The project included thorough visualization of top Eigenfaces and a confusion matrix analysis showing which individuals were most frequently confused under various lighting conditions.

94.2% Accuracy 99% Dim. Reduction +10% over baseline

PCASVMEigenfacesOpenCVScikit-learn

View on GitHub

06

Big Data · ETL Pipeline · Economics

Pricing & Regional Trends in Airbnb Markets

Designed and deployed a scalable Apache Spark ETL pipeline to ingest, clean, and normalize millions of daily listing and calendar records from the Inside Airbnb open dataset across multiple U.S. cities. The ingestion stage handled messy real-world issues including inconsistently formatted price strings ($1,200/night vs 1200.0), missing amenity fields, irregular date formats, and duplicate listing entries from host re-submissions.

After cleaning, log transformations were applied to right-skewed price and occupancy distributions, and engineered features were created for borough/region indicators, room type dummies, bedroom/bathroom counts, and amenity richness scores. Log-log regression models were fitted to estimate price elasticity: location explained 11–25% of listing price variability (R² by borough), with East Coast markets showing higher mean prices but tighter variance. Demand elasticities were consistently negative across all markets, confirming standard economic theory. Scaling the pipeline from a single-node local Spark instance to a 4-worker cluster delivered approximately 80% wall-clock runtime reduction on the full dataset, achieved through partition optimization and broadcast joins for small lookup tables.

80% Runtime Reduction Millions of records Multi-city analysis

Apache SparkPythonSQLRegressionBig Data

View on GitHub

07

Cloud Architecture · MLOps · Azure

Cloud-Based Image Sharing Platform

Built a cloud-native, production-grade image-sharing platform using Python and Docker, orchestrated entirely on Microsoft Azure. The backend architecture follows a microservices pattern: Azure SQL / SQL Server handles user authentication and session management; Cosmos DB stores image metadata and user-generated tags; Azure Blob Storage persists raw image binaries with CDN delivery; Table Storage logs audit trails and access events for compliance.

Serverless Azure Functions run statelessly on demand and integrate with Cosmos DB change-feed triggers to implement a fully event-driven ML pipeline: when a new photo is uploaded, a queue-based microservice automatically sanitizes the image (removing EXIF metadata), resizes it to standardized resolutions, and runs an image classification model to auto-tag content before persisting results back to Cosmos DB. Secrets and API keys are managed via Azure Key Vault with managed identities — no secrets in code or environment variables. The full application is containerized with Docker and deployed to Azure App Service with auto-scaling rules, maintaining sub-200ms latency under simulated load of 500 concurrent users.

Event-driven MLOps <200ms latency Zero-secret architecture

AzureDockerServerlessCosmos DBMLOpsPython

View on GitHub

08

Time-Series · Forecasting · R

Ferrari F1 Pit-Stop & Lap-Time Forecasting

Developed a time-series forecasting framework in R to model Ferrari Formula 1 pit-stop durations and lap times across multiple race weekends. The project began with exploratory data analysis: visualizing pit-stop duration distributions by circuit and year, checking for trend, seasonality, and calendar effects (e.g., safety car periods, weather-induced outliers).

Stationarity was tested using Augmented Dickey-Fuller and KPSS tests; non-stationary series were corrected via first-order differencing and log transformation to stabilize variance. Autocorrelation (ACF) and partial autocorrelation (PACF) plots informed initial ARIMA order selection, which was then formalized using auto.arima — selecting optimal (p,d,q)(P,D,Q)[s] parameters by minimizing AIC/BIC. Models were validated through time-series k-fold cross-validation with expanding windows, and residual diagnostics confirmed no remaining autocorrelation (Ljung-Box test). The resulting SARIMA models captured seasonal race patterns and delivered accurate lap-time forecasts evaluated via RMSE and MAE across held-out races. Strategic implications for pit-stop timing windows were extracted from the forecast intervals.

SARIMA · auto.arima k-fold CV validated Race strategy insights

RSARIMAACF/PACFTime-SeriesForecasting

View on GitHub

09

Supply Chain Analytics · BI · SQL

Supply Chain Sales Analytics & Forecasting

Engineered an end-to-end supply chain analytics pipeline to process and analyze over 1 million sales records using SQL, surfacing a 15% revenue decline across 5 key regions tied to seasonal demand misalignment and distribution bottlenecks. The analysis combined regional segmentation, product-level trend decomposition, and inventory turnover metrics to pinpoint root causes at granular geographic and SKU levels.

Built 10+ Tableau dashboards covering inventory velocity, sales trend heatmaps, demand forecasting overlays, and KPI tracking — adopted by operations leadership for weekly strategic reviews. Integrated historical trend analysis into supply chain planning models, improving forecast accuracy by 20% compared to prior manual estimation methods. The project demonstrated how structured BI tooling and SQL-driven analytics can directly translate into operational decision-making improvements in logistics and inventory management.

+20% Forecast Accuracy 1M+ records analyzed 15% decline uncovered

SQLTableauPythonSupply ChainForecasting

View on GitHub

Omkar
Pathare

The
Work

Drawing with LLM

Dual Model Job Recommendation System

Predicting Problematic Internet Usage in Adolescents

Credit Card Default Prediction with PCA

Yale Face Recognition via PCA & SVM

Pricing & Regional Trends in Airbnb Markets

Cloud-Based Image Sharing Platform

Ferrari F1 Pit-Stop & Lap-Time Forecasting

Supply Chain Sales Analytics & Forecasting

Skills &
Tools

Certi
fications

Experi
ence

Edu
cation

Let's
Talk

TheWork

Drawing with LLM

Dual Model Job Recommendation System

Predicting Problematic Internet Usage in Adolescents

Credit Card Default Prediction with PCA

Yale Face Recognition via PCA & SVM

Pricing & Regional Trends in Airbnb Markets

Cloud-Based Image Sharing Platform

Ferrari F1 Pit-Stop & Lap-Time Forecasting

Supply Chain Sales Analytics & Forecasting

Skills &Tools

Certifications

Experience

Education

Let'sTalk

Omkar
Pathare

The
Work

Skills &
Tools

Certi
fications

Experi
ence

Edu
cation

Let's
Talk