Documentation

Setup

Installation

  1. nl4dv requires a 64-bit Python 3 environment. Windows users must ensure that Microsoft C++ Build Tools is installed. Mac OS X users must ensure that Xcode is installed.

  2. Install using one of the below methods:

    1. PyPi.

      To download, run

      pip install nl4dv==3.0.0

    2. A local distributable. Download

      nl4dv-3.0.0.tar.gz   or   nl4dv-3.0.0.zip

      Accordingly, run pip install nl4dv-3.0.0.tar.gz or pip install nl4dv-3.0.0.zip.

    Note: We recommend installing NL4DV in a virtual environment as it avoids version conflicts with globally installed packages.

Post Installation


Instructions for LLM-based Mode (v3)

NL4DV requires an OpenAI API key for its LLM-based Mode. Please refer to the following guide to generate an OpenAI Key.

Instructions for Semantic Parsing Mode (v1 + v2)

  1. NL4DV installs nltk by default but requires a few datasets/models/corpora to be separately installed. Please download the popular nltk artifacts using:

    python -m nltk.downloader popular

  2. NL4DV requires a third-party Dependency Parser module to infer tasks. Download and install one of:

    1. Stanford CoreNLP (recommended):

      • Download the English model of Stanford CoreNLP version 3.9.2 and copy it to `examples/assets/jars/` or a known location.

      • Download the Stanford Parser version 3.9.2 and after unzipping the folder, copy the `stanford-parser.jar` file to `examples/assets/jars/` or a known location.

        Note: This requires JAVA installed and the JAVA_HOME / JAVAHOME environment variables to be set.

    2. Stanford CoreNLPServer :

      • Download the Stanford CoreNLPServer, unzip it in a known location, and cd into it.

      • Start the server using the below command. It will run on http://localhost:9000.

        java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000

        Note: This requires JAVA installed and the JAVA_HOME / JAVAHOME environment variables to be set.

    3. Spacy :

Sample Code for LLM-based Mode

If everything went well in the Installation and Post Installation steps above, you are all set.
Use the below Sample Code to get started. A more detailed sample file can be found here.

from nl4dv import NL4DV
#Your dataset must be hosted on Github for the LLM-based mode to function.
data_url="https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/movies-w-year.csv" #paste your data URL

# Choose your processing mode LLM or parsing. Choose "gpt" for the LLM-based mode or "semantic-parsing" for the rules-based mode.
processing_mode="gpt"

#Enter your OpenAI key
gpt_api_key="[OpenAI KEY HERE]"

# Initialize an instance of NL4DV
nl4dv_instance = NL4DV(data_url=data_url, processing_mode=processing_mode, gpt_api_key=gpt_api_key)

# Define a query
query = "create a barchart showing average gross across genres"

# Execute the query
output = nl4dv_instance.analyze_query(query)

# Print the output
print(output)

Output

{
  "query": "create a barchart showing average gross across genres",
  "dataset": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/cars-w-year.csv",
  "attributeMap": {"..."},
  "taskMap": {"..."},
  "visList": ["..."],
  "followUpQuery": false,
  "contextObj": null
}

"attributeMap" ▸

{
    "Worldwide Gross": {
        "name": "Worldwide Gross",
        "queryPhrase": ["gross"],
        "inferenceType": "explicit",
        "isAmbiguous": false,
        "ambiguity": []
    },
    "Genre": {
        "name": "Genre",
        "queryPhrase": ["genres"],
        "inferenceType": "explicit",
        "isAmbiguous": false,
        "ambiguity": []
    }
  }
}

"taskMap" ▸

{
    "derived_value": [
        {
            "task": "derived_value",
            "queryPhrase": "average",
            "operator": "AVG",
            "values": [],
            "attributes": [
                "Worldwide Gross"
            ],
            "inferenceType": "explicit"
        }
    ]
}

"visList"▸

[
    {
        "attributes": [
            "Worldwide Gross",
            "Genre"
        ],
        "queryPhrase": "barchart",
        "visType": "barchart",
        "tasks": [
            "derived_value"
        ],
        "inferenceType": "explicit",
        "vlSpec": {
            "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
            "mark": {
                "type": "bar",
                "tooltip": true
            },
            "encoding": {
                "y": {
                    "field": "Worldwide Gross",
                    "type": "quantitative",
                    "aggregate": "mean",
                    "axis": {
                        "format": "s"
                    }
                },
                "x": {
                    "field": "Genre",
                    "type": "nominal",
                    "aggregate": null
                }
            },
            "transform": [],
            "data": {
                "url": "https://raw.githubusercontent.com/nl4dv/nl4dv/master/examples/assets/data/cars-w-year.csv",
                "format": {
                    "type": "csv"
                }
            }
        }
    }
]

Visualization (when the Vega-Lite spec is rendered) ▸

Example

Sample Code for Semantic Parsing Mode (Single Turn Conversational Interaction)

If everything went well in the Installation and Post Installation steps above, you are all set.
Use the below Sample Code to get started. A more detailed sample file can be found here.

from nl4dv import NL4DV
import os

# Initialize an instance of NL4DV
# ToDo: verify the path to the source data file. modify accordingly.
nl4dv_instance = NL4DV(data_url = os.path.join(".", "examples", "assets", "data", "movies-w-year.csv"))

# using Stanford Core NLP
# ToDo: verify the paths to the jars. modify accordingly.
dependency_parser_config = {"name": "corenlp", "model": os.path.join(".", "examples","assets","jars","stanford-english-corenlp-2018-10-05-models.jar"),"parser": os.path.join(".", "examples","assets","jars","stanford-parser.jar")}

# using Stanford CoreNLPServer
# ToDo: verify the URL to the CoreNLPServer. modify accordingly.
# dependency_parser_config = {"name": "corenlp-server", "url": "http://localhost:9000"}

# using Spacy
# ToDo: ensure that the below spacy model is installed. if using another model, modify accordingly.
# dependency_parser_config = {"name": "spacy", "model": "en_core_web_sm", "parser": None}

# Set the Dependency Parser
nl4dv_instance.set_dependency_parser(config=dependency_parser_config)

# Define a query
query = "create a barchart showing average gross across genres"

# Execute the query
output = nl4dv_instance.analyze_query(query)

Sample Code (Multi Turn Conversational Interaction)

Use the below Sample Code to get started with Conversational Interaction. A more detailed sample file can be found here.

Conversation Interaction Sample Code

Docker

NL4DV is containerized into a Docker Image. This image comes pre-installed with NL4DV, Spacy, Stanford CoreNLP, and a few datasets with a web application as a Demo. Install it using:

docker pull arpitnarechania/nl4dv

Note: This mode of installation does not require the Post Installation steps. For more informations, follow the detailed instructions in the Github repository (nl4dv-docker).

Message from Creators

As we plan additional features to add new features / improve the toolkit, we recommend users/developers to be aware of the following:

  • Dependency parser output variations.
    The dependency tree returned by CoreNLP, CoreNLP Server, and Spacy are sometimes different. The current parser logic was developed for CoreNLP, hence it'll work best. However, we are upgrading the rules to work consistently across all dependency parsers.

  • Attribute data types.

    Verify the attribute types (e.g., nominal, temporal) that are detected by NL4DV and override them if they are incorrect as they will most likely lead to erroneous visualizations. The current attribute datatype detection logic is based on heuristics and we are currently working towards a major improvement that semantically infers the data type from both, the attribute's name and its value.

    For Temporal attributes, NL4DV relies on Regular Expressions to detect common date formats (listed in the order of priority in case of conflicts).

    Supported Date Formats (Codes: 1989 C standard) Examples

    %m*%d*%Y or %m*%d*%y where * ∈ {. - /}

    • 12.24.2019
    • 12/24/2019
    • 1-24-19
    • 09.24.20

    %Y*%m*%d or %y*%m*%d where * ∈ {. - /}

    • 2019.12.24
    • 2019/12/24
    • 19-1-24
    • 20.09.24

    %d*%m*%Y or %d*%m*%y where * ∈ {. - /}

    • 24.12.2019
    • 24/12/2019
    • 24-1-19
    • 24.09.20

    %d/%b/%Y or %d/%B/%Y or %d/%b/%y or %d/%B/%y where * ∈ {. - / space}

    • 8-January-2019
    • 31 Dec 19
    • 1/Jan/19

    %d*%b or %d*%B where * ∈ {. - / space}

    • 8-January
    • 31 Dec
    • 1/Jan

    %b/%d/%Y or %B/%d/%Y or %b/%d/%y or %B/%d/%y where * ∈ {. - / space}

    • January-8-2019
    • Dec 31 19
    • Jan/1/19

    %Y

    Only the following series:
    • 18XX (e.g., 1801)
    • 19XX (e.g., 1929)
    • 20XX (e.g., 2010)

  • Filter task.
    NL4DV applies the filter task by matching the condition against each data point but does not encode the involved attributes in the visualization. This was a design decision taken to avoid recommending a complex visualization due to too many encoded attributes.

  • Thresholds and Match scores.
    These are currently set based on heuristics and prior research works; we encourage users/developers to modify them to suit their specific requirements.

Applications

Follow these steps to run the example applications:

  • Download or Clone the repository using

    git clone https://github.com/nl4dv/nl4dv.git

  • cd into the examples directory and create a new virtual environment.

    virtualenv --python=python3 venv

  • Activate it using:

    source venv/bin/activate (MacOSX/ Linux)

    venv\Scripts\activate.bat (Windows)

  • Install dependencies.

    python -m pip install -r requirements.txt

  • Manually install nl4dv in this virtual environment using one of the above instructions.

  • Run python app.py.

  • Open your favorite browser and go to http://localhost:7001. You should see something like:


Showcase

For the Jupyter Notebook application,

  • cd into the examples directory.

  • Install and enable the Vega extension in the notebook using

    • jupyter nbextension install --sys-prefix --py vega

    • jupyter nbextension enable vega --py --sys-prefix

  • Launch the notebook using jupyter notebook.

    Make sure your Jupyter notebook uses an (virtual) environment that has NL4DV installed. Go to examples/applications/notebook and launch Single-Turn-Conversational-Interaction.ipynb to run the demo that showcases NL4DV's single-turn (standalone) conversational capabilities or Multi-Turn-Conversational-Interaction.ipynb for viewing NL4DV's follow-up capabilities

API Reference

NL4DV exposes a simple, intuitive API for developers to consume. The below methods can be called on the nl4dv_instance object that is created after initializing NL4DV like nl4dv_instance = NL4DV().

Method Params Description
NL4DV(*)

data_url (str)

data_value (list|dict|pandas DataFrame)

alias_url (str)

alias_value (dict)

label_attribute (str)

ignore_words (list)

reserve_words (list)

dependency_parser_config (dict)

thresholds (dict)

importance_scores (dict)

attribute_datatype (dict)

debug (bool)

explicit_followup_keywords (dict)

implicit_followup_keywords (dict)

processing_mode (str='gpt' or str='semantic-parsing', optional)

gpt_api_key (str, optional)

NL4DV constructor that accepts these parameters. These parameters can be set using separate function calls as well; check below.

Returns: nl4dv_instance.
analyze_query(query=None,debug=False,verbose=False,dialog=None,dialog_id=None,query_id=None)

query (str)

debug (bool)

verbose (bool)

dialog (bool or str='auto', optional)

dialog_id (str, optional)

query_id (str, optional)

Analyzes the input query.

Returns: a JSON specification of attributes, tasks, and visualizations.

Note: If the dataset was input via the "data_value" parameter, to minimize the storage footprint of this output JSON, the Vega-Lite spec (vlSpec) does NOT include the dataset values (under the "data" > "value" property); the developer is expected to supply these to render the visualization. However, if the dataset was input via "data_url" parameter, the 'vlSpec' will have this data configuration by default (under the "data" > "url" property).

update_query(ambiguity_obj=None)

ambiguity_obj (dict)

Example:

{
"dialog_id": "0", "query_id": "0",
"attribute": {"medals": "Gold Medals"},
"value": {"hockey": "Ice Hockey", "skating": "Speed Skating"}
}
                                                
Resolve attribute-level and value-level ambiguities by setting the correct entities to the corresponding keywords (phrase) in the query.
get_dialogs(dialog_id=None, query_id=None)

dialog_id (str)

query_id (str)

Get a specific dialog (if dialog_id is provided), a specific query in a dialog (if both dialog_id and query_id are provided), or all dialogs (if none of dialog_id and query_id are provided). Returns the requested entities as JSON specifications.
delete_dialogs(dialog_id=None, query_id=None)

dialog_id (str)

query_id (str)

Delete a specific dialog (if dialog_id is provided), a specific query in a dialog (if both dialog_id and query_id are provided), or all dialogs (if none of dialog_id and query_id are provided), practically resetting the corresponding NL4DV instance. Returns the deleted entities as JSON specifications.
undo() Delete the most recently processed query; returns the deleted entity as a JSON specification.
render_vis(query=None)

query (str)

Example: "visualize mpg."

Processes the input query and returns a VegaLite() object of the best, most relevant visualization.

Returns: VegaLite()

Note: If the dataset was input via the "data_value" parameter, to minimize the storage footprint of this output JSON, the Vega-Lite spec does NOT include the dataset values (under the "data" > "value" property); the developer is expected to supply these to render the visualization. However, if the dataset was input via "data_url" parameter, the 'vlSpec' will have this data configuration by default (under the "data" > "url" property). Because this API directly outputs a VegaLite() object, it might be hard to supply the data. Thus, please use analyze_query() instead.

set_data(data_url=None, data_value=None)

data_url (str: path to local file or url)

Example:"euro.csv" (sample) | "euro.tsv" (sample) | "euro.json" (sample)
Or

data_value (list|dict|pandas.DataFrame)

Example:
  • list:  
    [{"acceleration": 19,
    "salary": 1000},
    {"acceleration": 21,
    "salary": 1320}]
  • dict:  
    { "acceleration": [19, 21],
    "salary": [1000, 1320]}
  • DataFrame:  pandas.DataFrame() instance.
Sets the dataset to query against.
set_alias_map(alias_url=None, alias_value=None)

alias_url (str: path to local file or url)

Example:"aliases/euro.json" (sample)
Or

alias_value (dict)

Example:"aliases/cars.json" where the json is like
{"MPG": ["miles per gallon"],
"Horsepower": ["hp"]}
Sets the alias values.
set_thresholds(thresholds=None)

thresholds (dict)

Example:
{"synonymity": 95,
"string_similarity": 85}
Overrides the default thresholds such as string matching.
set_explicit_followup_keywords(explicit_followup_keyword_map=None)

explicit_followup_keyword_map (dict)

Example:
{"put": [("addition", "add")],
"add": [("addition", "add")]}
Overrides the default explicit_followup_keywords map. The dictionary must be formatted as the following: The key must be the keyword string, and the value must be a list. The list must contain only one element that is a 2-tuple. The first element in the 2-tuple represents the noun version of the follow-up operation (it MUST come from these words - (addition, removal, replacement), and the second element represents the verb version of the follow-up operation (add, remove, replace).
set_implicit_followup_keywords(implicit_followup_keyword_map=None)

implicit_followup_keyword_map (dict)

Example:
"also": [("also", "add")],
"as well": [("aswell", "add")]}
Overrides the default implicit_followup_keyword_map map. The dictionary must be formatted as the following: The key must be the keyword string, and the value must be a list. The list must contain only one element that is a 2-tuple. The first element in the 2-tuple represents the concatenated version of the token with no spaces, and the second element represents the verb version of the follow-up operation (add, remove, replace).
set_importance_scores(scores=None)

scores (dict)

Example:
{'attribute': {
    'attribute_exact_match': 1,
    'attribute_similarity_match': 0.9,
    'attribute_alias_exact_match': 0.8,
    'attribute_alias_similarity_match': 0.75,
    'attribute_synonym_match': 0.5,
    'attribute_domain_value_match': 0.5,
},
'task': {
    'explicit': 1,
    'implicit': 0.5,
},
'vis': {
    'explicit': 1
}}
Sets the Scoring Weights for the way attributes / tasks and visualizations are detected.
set_attribute_datatype(attr_type_obj=None)

attr_type_obj (dict)

Example:
{"Year": "T", # Temporal
"Rank": "O", # Ordinal
"Salary": "Q", # Quantitative
"Gender": "N" # Nominal}
Override the attribute datatypes that are detected by NL4DV.
set_dependency_parser(config=None)

config (dict)

Example:
{"name": "corenlp-server",
"url": "http://192.168.99.102:9000"}
Set the dependency parser to be used in the Tasks detector module.
set_reserve_words(reserve_words=None)

reserve_words (list)

Example:["A"] # "A" - although an article (like 'a/an/the') should be retained in a grades dataset.
Set the custom STOPWORDS that should NOT be removed from the query, as they might be present in the domain.
set_ignore_words(ignore_words=None)

ignore_words (list)

Example:["movie"]
Set the words that should be IGNORED in the query, i.e. NOT lead to the detection of attributes and tasks.
set_label_attribute(label_attribute=None)

label_attribute (str)

Example:["Model"] # Correlate horsepower and MPG for sports car models" should NOT apply an explicit attribute for models since there are two explicit attributes already present.
Set the words that should be IGNORED in the query, i.e. NOT lead to the detection of attributes and tasks.
get_metadata() - Get the metadata object after processing the dataset.

Build

NL4DV can be installed as a Python package and imported in your own awesome applications!

  1. NL4DV is written in Python 3. Please ensure you have a Python 3 environment already installed.

  2. Clone this repository (master branch) and enter (`cd`) into it.

  3. Create a new virtual environment.

    virtualenv --python=python3 venv

  4. Activate it using:

    source venv/bin/activate (MacOSX/ Linux)

    venv\Scripts\activate.bat (Windows)

  5. Install dependencies.

    python -m pip install -r requirements.txt

  6. make your changes>
  7. Bump up the version in setup.py and create a Python distributable.

    python setup.py sdist

  8. This will create a new file inside **nl4dv-*.*.*.tar.gz** inside the dist directory.

  9. Install the above file in your Python environment using:

    python -m pip install <PATH-TO-nl4dv-*.*.*.tar.gz>

  10. Verify by opening your Python console and importing it:

    $python
    >>> from nl4dv import NL4DV

  11. Enjoy, NL4DV is now available for use as a Python package!

Credits

NL4DV was created by Arpit Narechania, Arjun Srinivasan, Rishab Mitra, Alex Endert, John Stasko of the Georgia Tech Visualization Lab, along with Subham Sah and Wenwen Dou of the UNC Charlotte Visualization Center.

We thank the members of the Georgia Tech Visualization Lab for their support and constructive feedback.

Citations

2024 IEEE VIS NLVIZ Workshop Track

@misc{sah2024generatinganalyticspecificationsdata,
    title={Generating Analytic Specifications for Data Visualization from Natural Language Queries using Large Language Models},
    author={{Sah}, Subham and {Mitra}, Rishab and {Narechania}, Arpit and {Endert}, Alex and {Stasko}, John and {Dou}, Wenwen},
    year={2024},
    eprint={2408.13391},
    archivePrefix={arXiv},
    primaryClass={cs.HC},
    url={https://arxiv.org/abs/2408.13391},
    howpublished = {Presented at NLVIZ Workshop, IEEE VIS 2024},
    }

2022 IEEE VIS Conference Short Paper Track

@inproceedings{mitra2022conversationalinteraction,
  title = {Facilitating Conversational Interaction in Natural Language Interfaces for Visualization},
  author = {{Mitra}, Rishab and {Narechania}, Arpit and {Endert}, Alex and {Stasko}, John},
  booktitle={2022 IEEE Visualization Conference (VIS)},
  url = {https://doi.org/10.48550/arXiv.2207.00189},
  doi = {10.48550/arXiv.2207.00189},
  year = {2022},
  publisher = {IEEE}
}

2021 IEEE TVCG Journal Full Paper (Proceedings of the 2020 IEEE VIS Conference)

@article{narechania2021nl4dv,
title = {{NL4DV}: A {Toolkit} for Generating {Analytic Specifications} for {Data Visualization} from {Natural Language} Queries},
shorttitle = {{NL4DV}},
author = {{Narechania}, Arpit and {Srinivasan}, Arjun and {Stasko}, John},
journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)},
doi = {10.1109/TVCG.2020.3030378},
year = {2021},
publisher = {IEEE}
}

Contact Us

If you have any questions, feel free to open a GitHub issue or contact Arpit Narechania.

License

The software is available under the MIT License.