nl4dv requires a 64-bit Python 3 environment. Windows users must ensure that Microsoft C++ Build Tools is installed. Mac OS X users must ensure that Xcode is installed.
Install using one of the below methods:
PyPi.
To download, run
pip install nl4dv==3.0.0
A local distributable. Download
nl4dv-3.0.0.tar.gz or nl4dv-3.0.0.zip
Accordingly, run pip install nl4dv-3.0.0.tar.gz
or pip install nl4dv-3.0.0.zip
.
Note: We recommend installing NL4DV in a virtual environment as it avoids version conflicts with globally installed packages.
NL4DV installs nltk by default but requires a few datasets/models/corpora to be separately installed. Please download the popular
nltk artifacts using:
python -m nltk.downloader popular
NL4DV requires a third-party Dependency Parser module to infer tasks. Download and install one of:
Stanford CoreNLP (recommended):
Download the English model of Stanford CoreNLP version 3.9.2 and copy it to `examples/assets/jars/` or a known location.
Download the Stanford Parser version 3.9.2 and after unzipping the folder, copy the `stanford-parser.jar` file to `examples/assets/jars/` or a known location.
Note: This requires JAVA installed and the JAVA_HOME / JAVAHOME environment variables to be set.
Download the Stanford CoreNLPServer, unzip it in a known location, and cd
into it.
Start the server using the below command. It will run on http://localhost:9000.
java -mx4g -cp "*"
edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators
"tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000
-timeout 30000
Note: This requires JAVA installed and the JAVA_HOME / JAVAHOME environment variables to be set.
Spacy :
NL4DV installs Spacy by default but requires a model to be separately installed. Install the sample English model using:
python -m spacy download en_core_web_sm
If you face the urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with LibreSSL 2.8.3 error while installing spacy models, as per this stackoverflow post, perform the following operations until there is a better fix:
pip uninstall urllib3 && pip install 'urllib3<2.0'
If everything went well in the Installation and Post Installation steps above, you are all set.
Use the below Sample Code to get started. A more detailed sample file can be found here.
from nl4dv import NL4DV
#Your dataset must be hosted on Github for the LLM-based mode to function.
data_url="https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/movies-w-year.csv" #paste your data URL
# Choose your processing mode LLM or parsing. Choose "gpt" for the LLM-based mode or "semantic-parsing" for the rules-based mode.
processing_mode="gpt"
#Enter your OpenAI key
gpt_api_key="[OpenAI KEY HERE]"
# Initialize an instance of NL4DV
nl4dv_instance = NL4DV(data_url=data_url, processing_mode=processing_mode, gpt_api_key=gpt_api_key)
# Define a query
query = "create a barchart showing average gross across genres"
# Execute the query
output = nl4dv_instance.analyze_query(query)
# Print the output
print(output)
{
"query": "create a barchart showing average gross across genres",
"dataset": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/cars-w-year.csv",
"attributeMap": {"..."},
"taskMap": {"..."},
"visList": ["..."],
"followUpQuery": false,
"contextObj": null
}
{
"Worldwide Gross": {
"name": "Worldwide Gross",
"queryPhrase": ["gross"],
"inferenceType": "explicit",
"isAmbiguous": false,
"ambiguity": []
},
"Genre": {
"name": "Genre",
"queryPhrase": ["genres"],
"inferenceType": "explicit",
"isAmbiguous": false,
"ambiguity": []
}
}
}
{
"derived_value": [
{
"task": "derived_value",
"queryPhrase": "average",
"operator": "AVG",
"values": [],
"attributes": [
"Worldwide Gross"
],
"inferenceType": "explicit"
}
]
}
[
{
"attributes": [
"Worldwide Gross",
"Genre"
],
"queryPhrase": "barchart",
"visType": "barchart",
"tasks": [
"derived_value"
],
"inferenceType": "explicit",
"vlSpec": {
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"mark": {
"type": "bar",
"tooltip": true
},
"encoding": {
"y": {
"field": "Worldwide Gross",
"type": "quantitative",
"aggregate": "mean",
"axis": {
"format": "s"
}
},
"x": {
"field": "Genre",
"type": "nominal",
"aggregate": null
}
},
"transform": [],
"data": {
"url": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/cars-w-year.csv",
"format": {
"type": "csv"
}
}
}
}
]
If everything went well in the Installation and Post Installation steps above, you are all set.
Use the below Sample Code to get started. A more detailed sample file can be found here.
from nl4dv import NL4DV
import os
# Initialize an instance of NL4DV
# ToDo: verify the path to the source data file. modify accordingly.
nl4dv_instance = NL4DV(data_url = os.path.join(".", "examples", "assets", "data", "movies-w-year.csv"))
# using Stanford Core NLP
# ToDo: verify the paths to the jars. modify accordingly.
dependency_parser_config = {"name": "corenlp", "model": os.path.join(".", "examples","assets","jars","stanford-english-corenlp-2018-10-05-models.jar"),"parser": os.path.join(".", "examples","assets","jars","stanford-parser.jar")}
# using Stanford CoreNLPServer
# ToDo: verify the URL to the CoreNLPServer. modify accordingly.
# dependency_parser_config = {"name": "corenlp-server", "url": "http://localhost:9000"}
# using Spacy
# ToDo: ensure that the below spacy model is installed. if using another model, modify accordingly.
# dependency_parser_config = {"name": "spacy", "model": "en_core_web_sm", "parser": None}
# Set the Dependency Parser
nl4dv_instance.set_dependency_parser(config=dependency_parser_config)
# Define a query
query = "create a barchart showing average gross across genres"
# Execute the query
output = nl4dv_instance.analyze_query(query)
Use the below Sample Code to get started with Conversational Interaction. A more detailed sample file can be found here.
NL4DV is containerized into a Docker Image. This image comes pre-installed with NL4DV, Spacy, Stanford CoreNLP, and a few datasets with a web application as a Demo. Install it using:
docker pull arpitnarechania/nl4dv
Note: This mode of installation does not require the Post Installation steps. For more informations, follow the detailed instructions in the Github repository (nl4dv-docker).
Follow these steps to run the example applications:
Download or Clone the
repository using
git clone
https://github.com/nl4dv/nl4dv.git
cd
into the examples directory and create a new virtual
environment.
virtualenv --python=python3 venv
Activate it using:
source venv/bin/activate
(MacOSX/ Linux)
venv\Scripts\activate.bat
(Windows)
Install dependencies.
python -m pip install -r requirements.txt
Manually install nl4dv in this virtual environment using one of the above instructions.
Run python app.py
.
Open your favorite browser and go to http://localhost:7001. You should see something like:
cd
into the examples directory.
Install and enable the Vega extension in the notebook using
jupyter nbextension install
--sys-prefix --py vega
jupyter nbextension enable vega
--py --sys-prefix
Launch the notebook using jupyter notebook
.
Make sure your Jupyter notebook uses an (virtual) environment that has NL4DV installed. Go to examples/applications/notebook and launch Single-Turn-Conversational-Interaction.ipynb to run the demo that showcases NL4DV's single-turn (standalone) conversational capabilities or Multi-Turn-Conversational-Interaction.ipynb for viewing NL4DV's follow-up capabilities
NL4DV exposes a simple, intuitive API for developers to consume. The below methods can be called on the nl4dv_instance object that is created after initializing NL4DV like nl4dv_instance = NL4DV()
.
Method | Params | Description |
---|---|---|
NL4DV(*) |
|
NL4DV constructor that accepts these parameters. These parameters can be set using separate function calls as well; check below. Returns:nl4dv_instance .
|
analyze_query(query=None,debug=False,verbose=False,dialog=None,dialog_id=None,query_id=None) |
|
Analyzes the input query. Returns: a JSON specification of attributes, tasks, and visualizations.Note: If the dataset was input via the "data_value" parameter, to minimize the storage footprint of this output JSON, the Vega-Lite spec (vlSpec) does NOT include the dataset values (under the "data" > "value" property); the developer is expected to supply these to render the visualization. However, if the dataset was input via "data_url" parameter, the 'vlSpec' will have this data configuration by default (under the "data" > "url" property). |
update_query(ambiguity_obj=None) |
|
Resolve attribute-level and value-level ambiguities by setting the correct entities to the corresponding keywords (phrase) in the query. |
get_dialogs(dialog_id=None, query_id=None) |
|
Get a specific dialog (if dialog_id is provided), a specific query in a dialog (if both dialog_id and query_id are provided), or all dialogs (if none of dialog_id and query_id are provided). Returns the requested entities as JSON specifications. |
delete_dialogs(dialog_id=None, query_id=None) |
|
Delete a specific dialog (if dialog_id is provided), a specific query in a dialog (if both dialog_id and query_id are provided), or all dialogs (if none of dialog_id and query_id are provided), practically resetting the corresponding NL4DV instance. Returns the deleted entities as JSON specifications. |
undo() | Delete the most recently processed query; returns the deleted entity as a JSON specification. | |
render_vis(query=None) |
|
Processes the input query and returns a VegaLite() object of the best, most relevant visualization. Returns: VegaLite()Note: If the dataset was input via the "data_value" parameter, to minimize the storage footprint of this output JSON, the Vega-Lite spec does NOT include the dataset values (under the "data" > "value" property); the developer is expected to supply these to render the visualization. However, if the dataset was input via "data_url" parameter, the 'vlSpec' will have this data configuration by default (under the "data" > "url" property). Because this API directly outputs a VegaLite() object, it might be hard to supply the data. Thus, please use analyze_query() instead. |
set_data(data_url=None, data_value=None) |
Or
|
Sets the dataset to query against. |
set_alias_map(alias_url=None, alias_value=None) |
Or
|
Sets the alias values. |
set_thresholds(thresholds=None) |
|
Overrides the default thresholds such as string matching. |
set_explicit_followup_keywords(explicit_followup_keyword_map=None) |
|
Overrides the default explicit_followup_keywords map. The dictionary must be formatted as the following: The key must be the keyword string, and the value must be a list. The list must contain only one element that is a 2-tuple. The first element in the 2-tuple represents the noun version of the follow-up operation (it MUST come from these words - (addition, removal, replacement), and the second element represents the verb version of the follow-up operation (add, remove, replace). |
set_implicit_followup_keywords(implicit_followup_keyword_map=None) |
|
Overrides the default implicit_followup_keyword_map map. The dictionary must be formatted as the following: The key must be the keyword string, and the value must be a list. The list must contain only one element that is a 2-tuple. The first element in the 2-tuple represents the concatenated version of the token with no spaces, and the second element represents the verb version of the follow-up operation (add, remove, replace). |
set_importance_scores(scores=None) |
|
Sets the Scoring Weights for the way attributes / tasks and visualizations are detected. |
set_attribute_datatype(attr_type_obj=None) |
|
Override the attribute datatypes that are detected by NL4DV. |
set_dependency_parser(config=None) |
|
Set the dependency parser to be used in the Tasks detector module. |
set_reserve_words(reserve_words=None) |
["A"]
# "A" - although an article (like 'a/an/the') should be retained in a grades dataset.
|
Set the custom STOPWORDS that should NOT be removed from the query, as they might be present in the domain. |
set_ignore_words(ignore_words=None) |
["movie"] |
Set the words that should be IGNORED in the query, i.e. NOT lead to the detection of attributes and tasks. |
set_label_attribute(label_attribute=None) |
["Model"]
# Correlate horsepower and MPG for sports car models" should NOT apply an explicit attribute for models since there are two explicit attributes already present.
|
Set the words that should be IGNORED in the query, i.e. NOT lead to the detection of attributes and tasks. |
get_metadata() | - | Get the metadata object after processing the dataset. |
NL4DV can be installed as a Python package and imported in your own awesome applications!
NL4DV is written in Python 3. Please ensure you have a Python 3 environment already installed.
Clone this repository (master branch) and enter (`cd`) into it.
Create a new virtual environment.
virtualenv --python=python3 venv
Activate it using:
source venv/bin/activate
(MacOSX/ Linux)
venv\Scripts\activate.bat
(Windows)
Install dependencies.
python -m pip install -r requirements.txt
Bump up the version in setup.py and create a Python distributable.
python setup.py sdist
This will create a new file inside **nl4dv-*.*.*.tar.gz** inside the dist directory.
Install the above file in your Python environment using:
python -m pip install <PATH-TO-nl4dv-*.*.*.tar.gz>
Verify by opening your Python console and importing it:
$python
>>> from nl4dv import NL4DV
NL4DV was created by Arpit Narechania, Arjun Srinivasan, Rishab Mitra, Alex Endert, John Stasko of the Georgia Tech Visualization Lab, along with Subham Sah and Wenwen Dou of the UNC Charlotte Visualization Center.
We thank the members of the Georgia Tech Visualization Lab for their support and constructive feedback.
@misc{sah2024generatinganalyticspecificationsdata,
title={Generating Analytic Specifications for Data Visualization from Natural Language Queries using Large Language Models},
author={{Sah}, Subham and {Mitra}, Rishab and {Narechania}, Arpit and {Endert}, Alex and {Stasko}, John and {Dou}, Wenwen},
year={2024},
eprint={2408.13391},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2408.13391},
howpublished = {Presented at NLVIZ Workshop, IEEE VIS 2024},
}
@inproceedings{mitra2022conversationalinteraction,
title = {Facilitating Conversational Interaction in Natural Language Interfaces for Visualization},
author = {{Mitra}, Rishab and {Narechania}, Arpit and {Endert}, Alex and {Stasko}, John},
booktitle={2022 IEEE Visualization Conference (VIS)},
url = {https://doi.org/10.48550/arXiv.2207.00189},
doi = {10.48550/arXiv.2207.00189},
year = {2022},
publisher = {IEEE}
}
@article{narechania2021nl4dv,
title = {{NL4DV}: A {Toolkit} for Generating {Analytic Specifications} for {Data Visualization} from {Natural Language} Queries},
shorttitle = {{NL4DV}},
author = {{Narechania}, Arpit and {Srinivasan}, Arjun and {Stasko}, John},
journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)},
doi = {10.1109/TVCG.2020.3030378},
year = {2021},
publisher = {IEEE}
}
If you have any questions, feel free to open a GitHub issue or contact Arpit Narechania.
The software is available under the MIT License.