Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Computer Vision Enabled Basketball Analytics
Computer Vision   Latest   Machine Learning

Computer Vision Enabled Basketball Analytics

Author(s): Derek Austin

Originally published on Towards AI.

Image generated via DALL-E 3 by author


Over the last decade, analytics have revolutionized basketball, highlighted by the rising dominance of the three-pointer. This shift is evident in the jump from 18.0 three-point attempts per game in the 2010–2011 NBA season to 34.2 in the 2022–2023 season.¹ The transformation is rooted in the increased emphasis on quantifying and playing “efficient” basketball. For instance, a 50% success rate for two-pointers is equivalent in expected value to a 33% rate for three-pointers, leading NBA teams to phase out mid-range shots that are typically converted at a ~40% clip. Beyond these basic stats, firms like Second Spectrum² have advanced the analytics revolution, implementing sophisticated tracking technologies to gather detailed player and game data. Second Spectrum is able to provide extremely detailed stats and clips to players and teams on specific defensive coverages, shot quality and much more. However, this technological advancement has widened the gap in the broader basketball landscape between teams with and without the resources to adopt such systems. This disparity inspired the creation of this project, a proof of concept demonstrating that even basic open-source computer vision tools can assist in democratizing analytics for the masses. I provide a quick demo at, enabling users to navigate through 2 games showing clips where the model anticipates a shot occurred with its subsequent predictions of the four key tasks discussed below. Give it a look!

Data Curation & Tasks

I initiated the data gathering process by manually gathering publicly available NCAA basketball data. I developed a minimalist Flask application to mark the precise location and timing of shots on a basketball court, amassing data from approximately 60 complete games viewed on YouTube, totaling around 8000 shot instances. For each shot, I noted the exact moment the ball was released, the shot type (three-pointer or two-pointer), the outcome (made or missed), and the shot’s coordinates on the court.

Clip showcasing the various properties of the demo website. Source: Author

Initially, I labeled about 20 videos, then applied a semi-supervised learning approach to review clips where my model suggested a greater than 5% likelihood of a shot for an additional 40 videos. This method significantly expedited the review process as I would not have to watch large amounts of dead periods where shots clearly did not occur (side note: automating removing dead time from game film, essentially condensing game film, would be extremely valuable to any middle or high school team).

I then sought to automate a very basic form of shot analytics by automating the following 4 use cases:

  1. Detection of a shot occurrence within a clip.
  2. Identification of the precise frame the shot occurred.
  3. Localization of the shot on the court.
  4. Assessment of the shot’s relative quality, expressed as the probability of the shot going in.

Methods & Results

Each task above presented unique challenges. The primary initial hurdle was the slow video file loading, resulting in minimal GPU utilization and extremely long training runs. Although various libraries are available, I found that Nvidia Dali³, which performs video decoding on the GPU, significantly expedited loading times, enhancing loading speed by 3–5x (if anyone knows of any other accelerated loading libraries or techniques for loading clips quickly please reach out!). For all experiments, I utilized the VideoMAE⁴ base model and fine-tuned all parameters as most tasks exhibited very little signs of overfitting. Future plans include exploring V-JEPA⁵ to see if the change in pretraining could extract any meaningful improvements. I adapted the code from the VideoMAE authors to use flash attention, which reduced training time by 30% and halved GPU memory usage per clip. Clips were sampled with 16 frames, with a 0.3-second stride between each frame, totaling 4.5 seconds per clip. All results reported are on a validation set of 10 complete videos (ie. no video in the val set was ever trained on). I trained using the 8 bit bitsandbytes⁶ Adam optimizer thus saving GPU memory as well.

Clip showcasing the various properties of the demo website. Source: Author

Predicting Precisely When Shots Take Place

The predominance of negative examples over positive ones has the potential to significantly bias the model without adjustments to training. A popular technique is to use the focal loss⁶; however, I found that simple sample balancing worked quite well. I used a 2:1 ratio of negatives to positives and resampled the dataset to allow for a greater diversity of negatives each epoch. Additionally, resampling the dataset for each epoch allowed for a form of temporal augmentation where the model would see the same shot but from different starting points for each iteration.

I was able to use the same backbone with two different classifiers attached to the embedding layer for the tasks of predicting whether a shot occurred and precisely what frame that shot occurred in, allowing for cost saving at both inference and training time. In fact I found that fine-tuning for those tasks at the same time led to improved performance in both tasks as opposed to fine-tuning for each task with a separate model. While the tasks are obviously quite related, it was somewhat of a surprise and led me to believe that due to lack of data, training both at the same time led to a sort of regularization/ensembling effect that has been known to improve model performance.

I evaluate the model over all possible 4.5-second clips in the validation set, starting from 0 seconds and ending with the last possible full-length clip, essentially chunking the video into non-overlapping 4.5-second long clips. The model achieves a ROC-AUC score of .96, a recall of .86, a precision of .61, and an overall accuracy of 92%. I found that the majority of the errors take place along the edges of the clip (first couple and last couple frames), leading to an optimization where I stagger the start times of my data loader, thus only taking ‘positives’ from those clips where the shot prediction was in the middle 8 frames, trading additional inference run time for accuracy (I also used a form of non-max suppression to ensure no duplicates where shown). This led to a precision improvement from .61 to .7 and an overall F1 improvement of .04. Furthermore, in the demo website, I use a threshold of 0.8 to predict whether a shot occurred, thus boosting the precision (at the cost of recall) in order to allow for quick iteration through the clips for users. The model also achieved an L1 error in frame prediction of .34 (or about .1 of a second), enabling us to be highly confident in predicting the frame in which a shot occurred.

Predicting Where a Shot Took Place

I initially approached the shot location task as a regression problem, targeting x and y coordinates in order to allow for exact shot location. However, the model struggled to distinguish between three-pointers and mid-range shots, a significant issue since mid-range shots are less efficient and differentiation is crucial for coaching strategies. In order to alleviate this problem, I changed the task from regression to classification. Specifically I had 7 classes that a shot could fall under:

  1. Paint-extended
  2. Left Mid-Range
  3. Middle Mid-Range
  4. Right Mid-Range
  5. Left Wing Three Pointer
  6. Middle Three Pointer
  7. Right Wing Three Pointer

This approach significantly enhanced performance by substantially narrowing the model’s search space. I transformed all shots to occur on only one half of the court, eliminating the need to double the class count for each courtside. To ensure an unbiased classifier, I also balanced the dataset, ensuring each cluster had an equal number of examples every epoch, boosting average accuracy per cluster substantially. This strategy led to a 78% accuracy rate, with most errors occurring when the shot was taken near the border of two clusters. Take a look at the website to judge for yourself the relative quality of the predictions!

Predicting The Shot Quality

Finally, I had to predict the relative quality of the shot (I framed this as the percentage chance the shot would go in). This was a tough task to quantify as deep learning models are notoriously poorly calibrated. From a quick eye-ball test, the model does not perform well in terms of overall accuracy or calibration. The loss function and accuracy fail to budge much after the first epoch or so as well, sitting at 60% accuracy. Future work on this project could potentially dive into the extensive literature on model calibration in order to achieve far better results. As shown in the demo, the confidence scores are often quite low and do not align well with human judgment (at least my human judgment).

Maybe the shot quality metric does work some times… Source: Author


The primary constraint was the data quantity, as just 8000 examples is quite small, especially with the diversity of possible basketball shots. However, I did not perform any targeted scaling experiments to assess potential enhancements with increasing training data. Automation with large-scale multi-modal models would enable far greater efficiency in the data labeling process; however, after evaluating advanced multi-modal models like Gemini-1.0 and GPT4-V’s performance in identifying shots within a ten-frame window, both performed no better than chance, essentially leaving them useless for the current data labeling process. Furthermore, the memory-hungry nature of attention limits the input to 16 frames. A future area I am quite excited about are State Space Models which scale linearly instead of quadratically and show quite promising results in the video domain as well.⁸ Image size likely also plays a large factor as each image is resized to 224×224 during training which likely makes the ball far harder to pick out for the models when compared to larger image sizes. The demo also shows some key failure cases as well which are important to note, like cluster accuracy, as well as various localization errors that are important to highlight the limitations of this system in its current form.


Computer vision analytics offers an affordable way for sports teams across various levels to enhance their understanding and optimization of gameplay. A standout example is SwingVision⁹, which is pioneering analytics in tennis and pickleball with just an iPhone camera. Similarly, in basketball, a model trained on games from middle school to college levels democratizes access to detailed statistics, once exclusive to wealthier teams. An exciting idea for me personally is a world in which any high school or middle school coach could submit game footage and swiftly receive comprehensive game analyses with advanced shot tracking and efficiency metrics. By harnessing the power of computer vision, the field of sports analytics could undergo a transformative shift, making advanced data insights accessible to all and thereby reshaping the competitive landscape in sports.


  4. @InProceedings{wang2023videomaev2,
    author = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
    title = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2023},
    pages = {14549–14560}
  5. @article{bardes2024revisiting,
    title={Revisiting Feature Prediction for Learning Visual Representations from Video},
    author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Rabbat, Michael, and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
    journal={arXiv preprint},
  7. @misc{lin2018focal, title={Focal Loss for Dense Object Detection},
    author={Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár},
  8. @misc{li2024videomamba,
    title={VideoMamba: State Space Model for Efficient Video Understanding},
    author={Kunchang Li and Xinhao Li and Yi Wang and Yinan He and Yali Wang and Limin Wang and Yu Qiao},

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓