Demo¶
Setup¶
- CRISP-T is a python package that can be installed using pip and used from the command line. Your system should have python 3.11 or higher and pip installed. You can download and install Python for your operating system here. Optionally, CRISP-T can be imported in python scripts or jupyter notebooks, but this is not covered in this demo. See documentation for more details.
- Install Crisp-T with
pip install crisp-t[ml]oruv pip install crisp-t[ml] - (Optional) Download covid narratives data to
crisp_sourcefolder in home directory or current directory usingcrisp --covid covidstories.omeka.net --source crisp_source. You may use any other source of textual data (e.g. journal articles, interview transcripts) in .txt or .pdf format in thecrisp_sourcefolder or the folder you specify with --source option. - (Optional) Download Psycological Effects of COVID dataset to
crisp_sourcefolder. You may use any other numeric dataset in .csv format in thecrisp_sourcefolder or the folder you specify with --source option. - Create a
crisp_inputfolder in home directory or current directory for keeping imported data for analysis.
Import data¶
- Run the following command to import data from
crisp_sourcefolder tocrisp_inputfolder. --sourcereads data from a directory (reads .txt, .pdf and a single .csv) or from a URL
crisp --source crisp_source --out crisp_input
- Ignore warnings related to pdf files.
Perform Exploratory tasks using NLP¶
- Run the following command to perform a topic modelling and assign topics(keywords) to each narrative.
- --inp crisp_input below is optional as it defaults to
crisp_inputfolder.
crisp --inp crisp_input --assign --out crisp_input
- The results will be saved in the same
crisp_inputfolder, overwriting the corpus file. - You may run several other analyses (see documentation for details) and tweak parameters as needed.
- Hints will be provided in the terminal.
From now on, we will use crisp_input folder as input folder unless specified otherwise as that is the default.
Explore results¶
crisp --print "documents 10"
- Notice that we have omitted --inp as it defaults to
crisp_inputfolder. If you want to use a different folder, use --inp to specify it. The --out option helps to save intermediate results in a different folder. -
The above command prints first 10 documents in the corpus.
-
Next, let us see the metadata assigned to each document.
crisp --print "documents metadata"
- Notice keywords/topics assigned to each narrative.
- You will notice interviewee and interviewer keywords. These are assigned based on the presence of these words in the narratives and may not be useful.
- You may remove these keywords by using --ignore with assign and check the results again.
crisp --clear --assign --ignore interviewee,interviewer --out crisp_input
crisp --print "documents metadata"
- --clear option clears the cache before running the analysis.
⚠️ While analysing multiple datasets, use
crisp --clearoption to clear cache before switching datasets. ⚠️ - Now you will see that these keywords are removed from the results.
- It prints the first 5 documents by default.
crisp --print "metadata clusters"
- Prints the clusters assigned to each document based on keywords.
- There are many other options to explore the results. See documentation for details.
- Let us choose narratives that contain 'mask' keyword and show the concepts/topics in these narratives.
crisp --inp crisp_input --clear --filters keywords=mask --topics
- The above results will not be saved as --out is not specified.
- Notice time, people as topics in this subset of narratives.
- If --filters is used, only the filtered documents are used for the analysis. When using filters you should explicitly specify --inp and --out options with different folders to avoid overwriting the input data.
Quantitative exploratory analysis¶
- Let us see do a kmeans clustering of the csv dataset of covid data.
crisp --include relaxed,self_time,sleep_bal,time_dp,travel_time,home_env --kmeans
- Notice 3 clusters with different centroids. (number of clusters can be changed with --num option).
Confirmation¶
- Let us add a relationship between numb:self_time and text:work in the corpus for future confirmation with LLMs.
crispt --add-rel "text:work|numb:self_time|correlates"
- Let us do a regression analysis to see how
relaxedis affected by other variables.
crisp --include relaxed,self_time,sleep_bal,time_dp,travel_time,home_env --regression --outcome relaxed
- self_time has a positive correlation with relaxed.
- What about a decision tree analysis?
crisp --include relaxed,self_time,sleep_bal,time_dp,travel_time,home_env --cls --outcome relaxed
- Relaxed is converted to binary variable internally for classification.
- Ideally, you should do the binary conversion externally based on domain knowledge.
- Notice that self_time is the most important variable in predicting relaxed.
Topological Data Analysis Rudkin, S., & Dlotko, P. (2024)¶
- Let us do a TDA analysis to see the shape of the data.
- parameters to --tdabm are specified as follows: outcome:varables:radius
crispt --tdabm relaxed:self_time,sleep_bal,time_dp,travel_time:0.6 --out crisp_input
- Let us visualize the TDA network.
crispviz --tdabm --out viz_out/
Sense-making by triangulation¶
Now let us try out a csv dataset with text and numeric data.¶
- Download SMS Smishing Collection Data Set from Kaggle and convert the text file to csv adding the headers id, CLASS and SMS. Convert CLASS to numeric 0 and 1 for ham and smish respectively and add id as serial numbers.
- Place the csv file in a new
crisp_sourcefolder. - Import the csv file to
crisp_inputfolder using the following command.
crisp --source crisp_source/ --unstructured SMS
- Notice that the text column SMS is specified with --unstructured option. This creates CRISP documents from the text column.
- Now assign topics to the documents. Note that this also assigns clusters.
crisp --assign
- Now print the results to examine.
crisp --print "metadata clusters"
- Let us choose the cluster 1 and see the SMS classes in this cluster. (0=ham, 1=smish)
crisp --filters cluster=1 --print "dataframe stats"
- Next, let us check if the SMS texts converge towards predicting the CLASS (ham/ smish) variable with LSTM model.
crisp --lstm --outcome CLASS
MCP Server for agentic AI. (Optional, but LLMs may be better at sense-making!)¶
Try out the MCP server with the following command. (LLMs will offer course corrections and suggestions)¶
- load corpus from /Users/your-user-id/crisp_input
- use available tools
- What are the columns in df?
- Do a regression using time_bp,time_dp,travel_time,self_time with relaxed as outcome
- Interpret the results
- Is self_time or related concepts occur frequently in documents?
- can you ignore "interviewer,interviewee" and assign topics again? Yes.
- What are the topics in documents with keyword "work"?
Visualization¶
Let's visualize the clusters in 2D space using PCA.¶
crispviz --ldavis --out viz_out/
- The visualization will be saved in
viz_outfolder. Open the html file in a browser to explore.
Let's generate a word cloud of keywords in the corpus.¶
crispviz --wordcloud --out viz_out/
- The word cloud will be saved in
viz_outfolder.
More examples — comprehensive CLI usage¶
The following grouped examples show common and advanced usage patterns for the three CLIs: crisp, crispt, and crispviz. These are practical, copy-pasteable command lines that demonstrate option combinations and formats discussed in this demo and cheatsheet.
A. Data import & basic workflow (crisp)¶
Import a folder with text files and a CSV; specify unstructured text column¶
crisp --source ./raw_data --unstructured "comments" --out ./crisp_input
Import but limit text files and CSV rows when ingesting large sources¶
crisp --source ./raw_data --out ./crisp_input --num 10 --rec 500
Import CSV placed in the source folder; ignore specific stopwords/columns¶
crisp --source ./survey --unstructured "comments" --ignore "interviewer,interviewee" --out ./survey_corpus
B. Filtering and linking (crisp + crispt) — examples¶
Exact-match filters (both = and : separators supported)¶
crisp --inp ./crisp_input --filters category=Health --topics
crisp --inp ./crisp_input --filters category:Health --topics
Special link filters (text→df and df→text)¶
# Filter dataframe rows that are linked from documents via embeddings
crispt --inp ./crisp_input --filters embedding:text --out ./linked_by_embedding
# Filter documents that are linked from dataframe rows via temporal links
crispt --inp ./crisp_input --filters temporal:df --out ./linked_docs
Legacy shorthand mappings — both map to embedding:text or temporal:text¶
crispt --inp ./crisp_input --filters =embedding
crispt --inp ./crisp_input --filters :temporal
ID linkage: filter to a single ID, or sync remaining docs↔rows with blank value¶
# Filter to specific id
crisp --inp ./crisp_input --filters id=12345 --nlp
# Sync documents and dataframe rows by ID after other filters
crisp --inp ./crisp_input --filters id: --out ./synced_output
C. Text analysis quick examples (crisp)¶
Topic modeling and then assign topics to documents¶
crisp --inp ./crisp_input --topics --assign --out ./crisp_input_analyzed
Run sentiment and summary together¶
crisp --inp ./crisp_input --sentiment --summary --num 5
Run all NLP analyses (coding dictionary, topics, categories, summary, sentiment)¶
crisp --inp ./crisp_input --nlp
D. Machine learning & cross-modal examples (crisp)¶
Run k-means clustering on numeric CSV columns¶
crisp --inp ./survey_corpus --kmeans --num 4 --include age,income,score
Classification (SVM + Decision Tree) using a DataFrame outcome column¶
crisp --inp ./survey_corpus --cls --outcome satisfaction_binary --include a,b,c --aggregation majority
Neural net (requires crisp-t[ml])¶
crisp --inp ./survey_corpus --nnet --outcome target_col --include feat1,feat2
LSTM using text documents aligned by id column in CSV¶
crisp --inp ./survey_corpus --lstm --outcome CLASS
E. Corpus management & inspection (crispt) — examples¶
Create a new corpus and add documents¶
crispt --id my_corpus --name "Study A" --doc "1|Intro|This is the first document" --out ./my_corpus
Add metadata and a relationship¶
crispt --inp ./my_corpus --meta "source=field" --add-rel "text:work|numb:self_time|correlates" --out ./my_corpus
Remove a document and clear relationships¶
crispt --inp ./my_corpus --remove-doc 1 --clear-rel --out ./my_corpus
Inspect dataset columns, row counts, or specific rows¶
crispt --inp ./my_corpus --df-cols
crispt --inp ./my_corpus --df-row-count
crispt --inp ./my_corpus --df-row 12
Print usage: two supported formats¶
# Multi-flag form
crispt --inp ./my_corpus --print documents --print 10
# Single-string form
crispt --inp ./my_corpus --print "dataframe metadata"
F. Semantic & embedding features (crispt) — examples¶
Semantic search for similar documents (requires embedding backend)¶
crispt --inp ./my_corpus --semantic "patient anxiety" --num 8 --rec 0.45
Find documents similar to a list of document IDs¶
crispt --inp ./my_corpus --similar-docs "1,2,3" --num 5
Semantic-chunks: search within specific document chunks (use with --doc-id)¶
crispt --inp ./my_corpus --doc-id 5 --semantic-chunks "query phrase" --rec 0.6
Embedding linking and stats¶
crispt --inp ./my_corpus --embedding-link "cosine:3:0.7" --embedding-stats --out ./emb_links
crispt --inp ./emb_links --filters embedding:df --out ./docs_linked_to_rows
G. Temporal utilities (crispt) — examples¶
Link by time (nearest, window with seconds, or sequence)¶
crispt --inp ./my_corpus --temporal-link "nearest:timestamp"
crispt --inp ./my_corpus --temporal-link "window:timestamp:300" # ±300 seconds
Temporal summaries, sentiment trends, and topics over periods¶
crispt --inp ./my_corpus --temporal-summary W
crispt --inp ./my_corpus --temporal-sentiment W:mean
crispt --inp ./my_corpus --temporal-topics W:5
H. Visualization examples (crispviz)¶
Word frequency + topic wordcloud + LDA interactive visualization¶
crispviz --inp ./crisp_input_analyzed --out viz_out --freq --wordcloud --ldavis
Top terms with custom top-n and bins¶
crispviz --inp ./crisp_input --out viz_out --top-terms --top-n 30 --bins 80
Correlation heatmap with selected numeric columns¶
crispviz --inp ./survey_corpus --out viz_out --corr-heatmap --corr-columns "age,income,score"
Graph visualization filtered by node types and a different layout¶
crispviz --inp ./my_corpus --out viz_out --graph --graph-nodes document,keyword --graph-layout circular
I. Small tips & parameter semantics¶
--recforcrisptsemantic commands can be a similarity threshold (float, default 0.4), while--recfor somecrispcommands is used as an integer count — check the command context.--numdefaults differ by context (e.g.,crisptsearch default is 5;crispanalysis default is 3).--aggregationacceptsmajority|mean|first|modeand controls how multiple documents map to one numeric row are aggregated for ML tasks.
J. Full-run example (import → analyze → visualize)¶
# 1) Import
crisp --source ./raw_data --unstructured "comments" --out ./crisp_input
# 2) Run NLP + sentiment + save
crisp --inp ./crisp_input --topics --assign --sentiment --out ./crisp_input_analyzed
# 3) Link by embedding and run regression on linked set
crispt --inp ./crisp_input_analyzed --embedding-link "cosine:1:0.7" --out ./linked
crisp --inp ./linked --outcome satisfaction_score --regression --out ./final_results
# 4) Create visualizations
crispviz --inp ./final_results --out viz_out --ldavis --wordcloud --corr-heatmap