Using NLP in Disaster Response
Last Updated on July 25, 2023 by Editorial Team
Author(s): Abhishek Jana
Originally published on Towards AI.
In this project, weβll apply the ETL, NLP, and ML pipeline to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages.
This is one of the most critical problems in data science and machine learning. During a disaster, we get millions and millions of messages either directly or via social media. Weβll probably see 1 in 1000 relevant messages. A few critical words like water, blocked roads, and medical supplies are used during a disaster response. We have a categorical dataset with which we can train an ML model to see if we identify which messages are relevant to disaster response.
Main Web Interface
In this project, three main features of a data science project have been utilized:
- Data Engineering β In this section, I worked on how to Extract, Transform and Load the data. After that, I prepared the data for model training. For preparation, I cleaned the data by removing bad data (ETL pipeline), then used NLTK to tokenize and lemmatize the data (NLP Pipeline). Finally used, custom features like StartingVerbExtractor, and StartingNounExtractor to add new to the main dataset.
- Model Training β I used XGBoost Classifier to create the ML pipeline for model training.
- Model Deployment β For model deployment, I used the flask API.
This project is done on an anaconda platform using jupyter notebook jupyter notebook. Detailed instructions on how to install an anaconda can be found here. To create a virtual environment, see here
in the virtual environment, clone the repository :
git clone https://github.com/abhishek-jana/Disaster-Response-Pipelines.git
Python Packages used for this project are:
Numpy
Pandas
Scikit-learn
NLTK
re
sqlalchemy
pickle
Flask
Plotly
- Run the following commands in the projectβs root directory to set up your database and model.
- To run an ETL pipeline that cleans data and stores it in the database
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run an ML pipeline that trains classifiers and saves
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
2. Run the following command in the appβs directory to run your web app. python run.py
The project is structured as follows:
the data folder contains the data βdisaster_categories.csvβ, and βdisaster_messages.csvβ to extract the messages and categories. βDisasterResponse.dbβ is a cleaned version of the dataset saved in the SQLite database. βETL Pipeline Preparation.ipynbβ is the jupyter notebook explaining the data preparation method. βprocess_data.pyβ is the python script of the notebook.
βML Pipeline Preparation.ipynbβ is the jupyter notebook explaining the model training method. The relevant python file βtrain_classifier.pyβ can be found in the βmodelsβ folder. The final trained model is saved as βclassifier.pklβ in the βmodelsβ folder.
The app folder contains the βrun.pyβ script to render the visualization and results on the web. templates folder contains the .html files for the web interface.
The accuracy, precision, and recall are:
accuracy
precision and recall
Some of the predictions on messages are given as well:
message 1
message 2
message 3
In the future, I am planning to work on the following areas of the project:
- Testing different estimators and adding new features in the data to improve the model accuracy.
- Add more visualizations to understand the data.
- Improve the web interface.
- Based on the categories that the ML algorithm classifies text into, advise some organizations to connect to.
5. This dataset is imbalanced (i.e., some labels like water have few examples). In the README, discuss how this imbalance affects training the model and your thoughts about emphasizing precision or recall for the various categories.
The GitHub link of the project can be found here.
Acknowledgment
I am thankful to the Udacity Data Science Nanodegree program and figure eight for motivating me in this project.
I am also thankful to figure eight for making the data publicly available.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI