{
"cells": [
{
"cell_type": "markdown",
"id": "60bb467d-861d-4b07-a48d-8e5aa177c969",
"metadata": {},
"source": [
"# Chat Extraction\n",
"\n",
"This benchmark combines classification, summarization, and extraction in one a combined task. The model is\n",
"expected to output formatted json in the expected schema."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "47de0d20-d20b-44be-9e41-d2275f0866e8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# %pip install -U --quiet langchain langchain_benchmarks\n",
"# %pip install -U openai rapidfuzz fireworks-ai anthropic"
]
},
{
"cell_type": "markdown",
"id": "af75ce4b-f159-4917-9249-01ee88b1b8fc",
"metadata": {},
"source": [
"For this code to work, please configure LangSmith environment variables with your credentials,\n",
"in addition to your LLM providers' API keys."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c401de19-814e-4bd7-bb9c-7ea6e217985c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"import uuid\n",
"\n",
"uid = uuid.uuid4().hex[:4] # Avoid conflicts in project names\n",
"\n",
"# Get your API key from https://smith.langchain.com/settings\n",
"api_keys = [\n",
" \"LANGCHAIN_API_KEY\",\n",
" \"OPENAI_API_KEY\",\n",
" \"ANTHROPIC_API_KEY\",\n",
" \"FIREWORKS_API_KEY\",\n",
"]\n",
"for key in api_keys:\n",
" if key not in os.environ:\n",
" os.environ[key] = getpass.getpass(f\"Enter your {key}: \")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "60f22779-a948-4833-8e8c-ace9ef17f56f",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset Chat Extraction already exists. Skipping.\n",
"You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6.\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"Name | Chat Extraction |
\n",
"Type | ExtractionTask |
\n",
"Dataset ID | 00f4444c-9460-4a82-b87a-f50096f1cfef |
\n",
"Description | A dataset meant to test the ability of an LLM to extract and infer\n",
"structured information from a dialogue. The dialogue is between a user and a support\n",
"engineer. Outputs should be structured as a JSON object and test both the ability\n",
"of the LLM to correctly structure the information and its ability to perform simple \n",
"classification tasks. |
\n",
"\n",
"
"
],
"text/plain": [
"ExtractionTask(name='Chat Extraction', dataset_id='https://smith.langchain.com/public/00f4444c-9460-4a82-b87a-f50096f1cfef/d', description='A dataset meant to test the ability of an LLM to extract and infer\\nstructured information from a dialogue. The dialogue is between a user and a support\\nengineer. Outputs should be structured as a JSON object and test both the ability\\nof the LLM to correctly structure the information and its ability to perform simple \\nclassification tasks.', schema=, instructions=ChatPromptTemplate(input_variables=['dialogue'], messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='You are a helpdesk assistant responsible with extracting information and generating tickets. Dialogues are between a user and a support engineer.')), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['dialogue'], template='Generate a ticket for the following question-response pair:\\n\\n{dialogue}\\n'))]))"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_benchmarks import clone_public_dataset, registry\n",
"\n",
"task = registry[\"Chat Extraction\"]\n",
"\n",
"# Clone the dataset to your tenant\n",
"clone_public_dataset(task.dataset_id, dataset_name=task.name)\n",
"\n",
"\n",
"task"
]
},
{
"cell_type": "markdown",
"id": "86f1378a-9a62-477e-bdb8-a7fd10915b62",
"metadata": {},
"source": [
"#### Schema\n",
"\n",
"Each extraction task has an expected output schema defined in a Pydantic BaseModel object, which we can use to\n",
"get a JSON schema object."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "12e302e6-9b3d-42a4-b612-d672c591e8f0",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'title': 'GenerateTicket',\n",
" 'description': 'Generate a ticket containing all the extracted information.',\n",
" 'type': 'object',\n",
" 'properties': {'issue_summary': {'title': 'Issue Summary',\n",
" 'description': 'short (<10 word) summary of the issue or question',\n",
" 'type': 'string'},\n",
" 'question': {'title': 'Question',\n",
" 'description': 'Information inferred from the the question.',\n",
" 'allOf': [{'$ref': '#/definitions/QuestionCategorization'}]},\n",
" 'response': {'title': 'Response',\n",
" 'description': 'Information inferred from the the response.',\n",
" 'allOf': [{'$ref': '#/definitions/ResponseCategorization'}]}},\n",
" 'required': ['issue_summary', 'question', 'response'],\n",
" 'definitions': {'QuestionCategory': {'title': 'QuestionCategory',\n",
" 'description': 'An enumeration.',\n",
" 'enum': ['Implementation Issues',\n",
" 'Feature Requests',\n",
" 'Concept Explanations',\n",
" 'Code Optimization',\n",
" 'Security and Privacy Concerns',\n",
" 'Model Training and Fine-tuning',\n",
" 'Data Handling and Manipulation',\n",
" 'User Interaction Flow',\n",
" 'Technical Integration',\n",
" 'Error Handling and Logging',\n",
" 'Customization and Configuration',\n",
" 'External API and Data Source Integration',\n",
" 'Language and Localization',\n",
" 'Streaming and Real-time Processing',\n",
" 'Tool Development',\n",
" 'Function Calling',\n",
" 'LLM Integrations',\n",
" 'General Agent Question',\n",
" 'General Chit Chat',\n",
" 'Memory',\n",
" 'Debugging Help',\n",
" 'Application Design',\n",
" 'Prompt Templates',\n",
" 'Cost Tracking',\n",
" 'Other'],\n",
" 'type': 'string'},\n",
" 'Sentiment': {'title': 'Sentiment',\n",
" 'description': 'An enumeration.',\n",
" 'enum': ['Negative', 'Neutral', 'Positive'],\n",
" 'type': 'string'},\n",
" 'ProgrammingLanguage': {'title': 'ProgrammingLanguage',\n",
" 'description': 'An enumeration.',\n",
" 'enum': ['python', 'javascript', 'typescript', 'unknown', 'other'],\n",
" 'type': 'string'},\n",
" 'QuestionCategorization': {'title': 'QuestionCategorization',\n",
" 'type': 'object',\n",
" 'properties': {'question_category': {'$ref': '#/definitions/QuestionCategory'},\n",
" 'category_if_other': {'title': 'Category If Other',\n",
" 'description': \"question category if the category above is 'other'\",\n",
" 'type': 'string'},\n",
" 'is_off_topic': {'title': 'Is Off Topic',\n",
" 'description': 'If the input is general chit chat or does not pertain to technical inqueries about LangChain or building/debugging applications with LLMs/AI, it is off topic. For context, LangChain is a library and framework designed to assist in building applications with LLMs. Questions may also be about similar packages like LangServe, LangSmith, OpenAI, Anthropic, vectorstores, agents, etc.',\n",
" 'type': 'boolean'},\n",
" 'toxicity': {'title': 'Toxicity',\n",
" 'description': 'Whether or not the input question is toxic',\n",
" 'default': 0,\n",
" 'exclusiveMaximum': 6,\n",
" 'minimum': 0,\n",
" 'type': 'integer'},\n",
" 'sentiment': {'$ref': '#/definitions/Sentiment'},\n",
" 'programming_language': {'$ref': '#/definitions/ProgrammingLanguage'}},\n",
" 'required': ['question_category',\n",
" 'is_off_topic',\n",
" 'sentiment',\n",
" 'programming_language']},\n",
" 'ResponseType': {'title': 'ResponseType',\n",
" 'description': 'An enumeration.',\n",
" 'enum': ['resolve issue',\n",
" 'provide guidance',\n",
" 'request information',\n",
" 'give up',\n",
" 'none',\n",
" 'other'],\n",
" 'type': 'string'},\n",
" 'ResponseCategorization': {'title': 'ResponseCategorization',\n",
" 'type': 'object',\n",
" 'properties': {'response_type': {'$ref': '#/definitions/ResponseType'},\n",
" 'response_type_if_other': {'title': 'Response Type If Other',\n",
" 'type': 'string'},\n",
" 'confidence_level': {'title': 'Confidence Level',\n",
" 'description': 'The confidence of the assistant in its answer.',\n",
" 'exclusiveMaximum': 6,\n",
" 'minimum': 0,\n",
" 'type': 'integer'},\n",
" 'followup_actions': {'title': 'Followup Actions',\n",
" 'description': 'Actions the assistant recommended the user take.',\n",
" 'type': 'array',\n",
" 'items': {'type': 'string'}}},\n",
" 'required': ['response_type', 'confidence_level']}}}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"task.schema.schema()"
]
},
{
"cell_type": "markdown",
"id": "b462f7b8-fd42-4613-ab5f-5f3cbbc37d28",
"metadata": {},
"source": [
"## Define an extraction chain\n",
"\n",
"Let's build the extraction chain that we can use to get structured information from the emails."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ade7077c-4602-4e5b-ad6d-3eb43cbd0247",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4-1106-preview\", temperature=0).bind_functions(\n",
" functions=[task.schema],\n",
" function_call=task.schema.schema()[\"title\"],\n",
")\n",
"\n",
"\n",
"def format_run(dialogue_input: dict):\n",
" question = dialogue_input[\"question\"]\n",
" answer = dialogue_input[\"answer\"]\n",
" return {\n",
" \"dialogue\": f\"\\n{question}\\n\\n\"\n",
" f\"\\n{answer}\\n\"\n",
" }\n",
"\n",
"\n",
"output_parser = JsonOutputFunctionsParser()\n",
"extraction_chain = (\n",
" format_run\n",
" | task.instructions\n",
" | llm\n",
" | output_parser\n",
" # Wrap as 'output' so to be unified for the evaluators\n",
" | (lambda x: {\"output\": x})\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f66ed218-e1db-49b5-bde3-40ebec961723",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'output': {'issue_summary': 'Running Llama 2 Locally',\n",
" 'question': {'question_category': 'Implementation Issues',\n",
" 'is_off_topic': False,\n",
" 'sentiment': 'Neutral',\n",
" 'programming_language': 'unknown'},\n",
" 'response': {'response_type': 'provide guidance', 'confidence_level': 1}}}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"extraction_chain.invoke(\n",
" {\"question\": \"how do i run llama 2 locally?\", \"answer\": \"Llama.cpp of course.\"}\n",
")"
]
},
{
"cell_type": "markdown",
"id": "87a64f76-65ae-4367-b43f-f2be3431e7af",
"metadata": {},
"source": [
"Now it's time to measure our chain's effectiveness!"
]
},
{
"cell_type": "markdown",
"id": "3821e4b0-8e67-418a-840c-470fcde42df0",
"metadata": {},
"source": [
"## Evaluate\n",
"\n",
"Let's evaluate the chain now."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "aab7514e-a6ef-4c21-b90f-d9cbefcf5af1",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'gpt-4-1106-preview-5689' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=0c022691-a7ac-4545-b2bc-58aab2d476e8\n",
"\n",
"View all tests for Dataset Chat Extraction at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6\n",
"[------------------------------------------------->] 27/27"
]
},
{
"data": {
"text/html": [
"Experiment Results:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" feedback.toxicity_similarity | \n",
" feedback.sentiment_similarity | \n",
" feedback.confidence_level_similarity | \n",
" feedback.question_category | \n",
" feedback.off_topic_similarity | \n",
" feedback.programming_language_similarity | \n",
" error | \n",
" execution_time | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 27.000000 | \n",
" 27.0 | \n",
" 27.0 | \n",
" 27.0 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 0 | \n",
" 27.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 0.283000 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.940741 | \n",
" 0.555556 | \n",
" 0.888889 | \n",
" 0.592593 | \n",
" NaN | \n",
" 6.949585 | \n",
"
\n",
" \n",
" std | \n",
" 0.181282 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.093064 | \n",
" 0.506370 | \n",
" 0.320256 | \n",
" 0.500712 | \n",
" NaN | \n",
" 1.639494 | \n",
"
\n",
" \n",
" min | \n",
" 0.049430 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.800000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 4.248728 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.104149 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.800000 | \n",
" 0.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 5.679244 | \n",
"
\n",
" \n",
" 50% | \n",
" 0.336343 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" NaN | \n",
" 6.558088 | \n",
"
\n",
" \n",
" 75% | \n",
" 0.378270 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" NaN | \n",
" 8.300396 | \n",
"
\n",
" \n",
" max | \n",
" 0.594255 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" NaN | \n",
" 10.123084 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance feedback.json_schema \\\n",
"count 27.000000 27.0 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.283000 1.0 \n",
"std 0.181282 0.0 \n",
"min 0.049430 1.0 \n",
"25% 0.104149 1.0 \n",
"50% 0.336343 1.0 \n",
"75% 0.378270 1.0 \n",
"max 0.594255 1.0 \n",
"\n",
" feedback.toxicity_similarity feedback.sentiment_similarity \\\n",
"count 27.0 27.0 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.0 1.0 \n",
"std 0.0 0.0 \n",
"min 0.0 1.0 \n",
"25% 0.0 1.0 \n",
"50% 0.0 1.0 \n",
"75% 0.0 1.0 \n",
"max 0.0 1.0 \n",
"\n",
" feedback.confidence_level_similarity feedback.question_category \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.940741 0.555556 \n",
"std 0.093064 0.506370 \n",
"min 0.800000 0.000000 \n",
"25% 0.800000 0.000000 \n",
"50% 1.000000 1.000000 \n",
"75% 1.000000 1.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
" feedback.off_topic_similarity \\\n",
"count 27.000000 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 0.888889 \n",
"std 0.320256 \n",
"min 0.000000 \n",
"25% 1.000000 \n",
"50% 1.000000 \n",
"75% 1.000000 \n",
"max 1.000000 \n",
"\n",
" feedback.programming_language_similarity error execution_time \n",
"count 27.000000 0 27.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.592593 NaN 6.949585 \n",
"std 0.500712 NaN 1.639494 \n",
"min 0.000000 NaN 4.248728 \n",
"25% 0.000000 NaN 5.679244 \n",
"50% 1.000000 NaN 6.558088 \n",
"75% 1.000000 NaN 8.300396 \n",
"max 1.000000 NaN 10.123084 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langsmith.client import Client\n",
"\n",
"from langchain_benchmarks.extraction.tasks.chat_extraction import get_eval_config\n",
"\n",
"client = Client()\n",
"\n",
"eval_config = get_eval_config()\n",
"\n",
"test_run = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=extraction_chain,\n",
" evaluation=eval_config,\n",
" verbose=True,\n",
" project_name=f\"gpt-4-1106-preview-{uid}\",\n",
" project_metadata={\n",
" \"arch\": \"openai-functions\",\n",
" \"model\": \"gpt-4-1106-preview\",\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d9828990-f498-4d3f-9e51-76d72bf8f4e9",
"metadata": {},
"source": [
"## Compare to Claude-2\n",
"\n",
"Let's compare our results to Anthropic's Claude-2. We will mimic the function calling interface."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1be9d1cb-b9b6-4d77-b0d5-63a6784626d6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from typing import Any, Dict, Type\n",
"\n",
"from langchain.chat_models import ChatAnthropic\n",
"from langchain.output_parsers.xml import XMLOutputParser\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.pydantic_v1 import BaseModel\n",
"\n",
"claude_prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond \"\n",
" \"only with XML based on the following JSON schema:\\n{schema}\",\n",
" ),\n",
" (\n",
" \"user\",\n",
" \"Generate a ticket from the following question-response pair:\\n\"\n",
" \"\\n{dialogue}\\n\\n\"\n",
" \"Remember, respond directly with this format:\\n\"\n",
" \"<{function_call}>\\n...\\n{function_call}>\"\n",
" \"RESPOND ONLY IN XML THEN STOP.\",\n",
" ),\n",
" ]\n",
")\n",
"prompt = claude_prompt.partial(\n",
" schema=task.schema.schema_json(), function_call=task.schema.schema()[\"title\"]\n",
")\n",
"\n",
"claude = ChatAnthropic(model=\"claude-2\", temperature=0, max_tokens_to_sample=2048)\n",
"\n",
"\n",
"class MergeSchema:\n",
" \"\"\"Merge the XML Output Parser schema into the output.\"\"\"\n",
"\n",
" def __init__(self, schema: Type[BaseModel]):\n",
" self.schema = schema\n",
"\n",
" @property\n",
" def _func_name(self) -> str:\n",
" return self.schema.__name__\n",
"\n",
" def _merge_schema(self, parsed_output: Any, schema: Type[BaseModel]):\n",
" merged_output = {}\n",
" if isinstance(parsed_output, dict):\n",
" items = parsed_output.items()\n",
" elif isinstance(parsed_output, list):\n",
" items = [(k, v) for item in parsed_output for k, v in item.items()]\n",
" else:\n",
" return parsed_output\n",
"\n",
" for key, value in items:\n",
" if key in schema.__fields__:\n",
" field_info = schema.__fields__[key]\n",
" if isinstance(value, list):\n",
" if issubclass(field_info.type_, (BaseModel, dict)):\n",
" result = self._merge_schema(value, field_info.type_)\n",
" elif all(\n",
" isinstance(item, dict) and item.keys() == {\"item\"}\n",
" for item in value\n",
" ):\n",
" result = [next(iter(item.values())) for item in value]\n",
" else:\n",
" result = value\n",
" else:\n",
" result = value\n",
" else:\n",
" result = value\n",
" if key in merged_output:\n",
" if isinstance(merged_output[key], list):\n",
" merged_output[key].append(result)\n",
" else:\n",
" merged_output[key] = [merged_output[key], result]\n",
" else:\n",
" merged_output[key] = result\n",
"\n",
" return merged_output\n",
"\n",
" def __call__(self, parsed_output: dict) -> Dict[str, Any]:\n",
" merged_output = {}\n",
" if self._func_name not in parsed_output:\n",
" return parsed_output\n",
" return {\n",
" self._func_name: self._merge_schema(\n",
" parsed_output[self._func_name], self.schema\n",
" )\n",
" }\n",
"\n",
"\n",
"def try_parse(llm_output, config):\n",
" try:\n",
" output_chain = XMLOutputParser() | MergeSchema(task.schema)\n",
" parsed = output_chain.invoke(llm_output, config)\n",
" # Wrap as 'output' so to be unified for the evaluators\n",
" return {\"output\": parsed.get(\"GenerateTicket\")}\n",
" except Exception as e:\n",
" return {\"output\": llm_output, \"error\": str(e)}\n",
"\n",
"\n",
"claude_extraction_chain = format_run | prompt | claude | try_parse"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "cea759e7-a51a-4abd-9869-f928bea80da2",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'output': {'issue_summary': 'How to run Llama locally',\n",
" 'question': {'question_category': 'Implementation Issues',\n",
" 'is_off_topic': 'false',\n",
" 'toxicity': '0',\n",
" 'sentiment': 'Neutral',\n",
" 'programming_language': 'unknown'},\n",
" 'response': {'response_type': 'provide guidance',\n",
" 'confidence_level': '3',\n",
" 'followup_actions': ['Ask clarifying questions about the specific issue',\n",
" 'Provide documentation or examples for running Llama locally']}}}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = claude_extraction_chain.invoke(\n",
" {\"question\": \"how do i run llama 2 locally?\", \"answer\": \"Llama.cpp of course.\"}\n",
")\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7723e6f4-b214-46a8-9286-93116fe893d8",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'claude-2-json-schema-to-xml-5689' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=3f590999-a9d1-48be-83dd-e84acb99a195\n",
"\n",
"View all tests for Dataset Chat Extraction at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6\n",
"[------------------------------------------------->] 27/27"
]
},
{
"data": {
"text/html": [
"Experiment Results:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" feedback.toxicity_similarity | \n",
" feedback.sentiment_similarity | \n",
" feedback.confidence_level_similarity | \n",
" feedback.question_category | \n",
" feedback.off_topic_similarity | \n",
" feedback.programming_language_similarity | \n",
" error | \n",
" execution_time | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.0 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.0 | \n",
" 27.000000 | \n",
" 0 | \n",
" 27.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 0.371950 | \n",
" 0.777778 | \n",
" 1.0 | \n",
" 0.925926 | \n",
" 0.970370 | \n",
" 0.481481 | \n",
" 0.0 | \n",
" 0.444444 | \n",
" NaN | \n",
" 10.556105 | \n",
"
\n",
" \n",
" std | \n",
" 0.108628 | \n",
" 0.423659 | \n",
" 0.0 | \n",
" 0.181007 | \n",
" 0.072403 | \n",
" 0.509175 | \n",
" 0.0 | \n",
" 0.506370 | \n",
" NaN | \n",
" 1.790352 | \n",
"
\n",
" \n",
" min | \n",
" 0.105033 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.500000 | \n",
" 0.800000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" 8.435542 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.312445 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" 9.077631 | \n",
"
\n",
" \n",
" 50% | \n",
" 0.390000 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" 10.059124 | \n",
"
\n",
" \n",
" 75% | \n",
" 0.462694 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
" 1.000000 | \n",
" NaN | \n",
" 11.795210 | \n",
"
\n",
" \n",
" max | \n",
" 0.537678 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
" 1.000000 | \n",
" NaN | \n",
" 15.072743 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance feedback.json_schema \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.371950 0.777778 \n",
"std 0.108628 0.423659 \n",
"min 0.105033 0.000000 \n",
"25% 0.312445 1.000000 \n",
"50% 0.390000 1.000000 \n",
"75% 0.462694 1.000000 \n",
"max 0.537678 1.000000 \n",
"\n",
" feedback.toxicity_similarity feedback.sentiment_similarity \\\n",
"count 27.0 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 1.0 0.925926 \n",
"std 0.0 0.181007 \n",
"min 1.0 0.500000 \n",
"25% 1.0 1.000000 \n",
"50% 1.0 1.000000 \n",
"75% 1.0 1.000000 \n",
"max 1.0 1.000000 \n",
"\n",
" feedback.confidence_level_similarity feedback.question_category \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.970370 0.481481 \n",
"std 0.072403 0.509175 \n",
"min 0.800000 0.000000 \n",
"25% 1.000000 0.000000 \n",
"50% 1.000000 0.000000 \n",
"75% 1.000000 1.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
" feedback.off_topic_similarity \\\n",
"count 27.0 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 0.0 \n",
"std 0.0 \n",
"min 0.0 \n",
"25% 0.0 \n",
"50% 0.0 \n",
"75% 0.0 \n",
"max 0.0 \n",
"\n",
" feedback.programming_language_similarity error execution_time \n",
"count 27.000000 0 27.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.444444 NaN 10.556105 \n",
"std 0.506370 NaN 1.790352 \n",
"min 0.000000 NaN 8.435542 \n",
"25% 0.000000 NaN 9.077631 \n",
"50% 0.000000 NaN 10.059124 \n",
"75% 1.000000 NaN 11.795210 \n",
"max 1.000000 NaN 15.072743 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"claude_test_run = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=claude_extraction_chain,\n",
" evaluation=eval_config,\n",
" verbose=True,\n",
" project_name=f\"claude-2-json-schema-to-xml-{uid}\",\n",
" project_metadata={\n",
" \"arch\": \"claude-json-schema-xml-output\",\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5d34455c-e9d3-4fb0-b8d7-a3ee4a4b6ae0",
"metadata": {},
"source": [
"So it looks like edit distance is pretty good, but the schema validation leaves something to be desired.\n",
"\n",
"We're defining the schema in JSON then requesting XML. Let's try keeping it unified."
]
},
{
"cell_type": "markdown",
"id": "a9612d56-08a1-4f24-a961-af7f7916997d",
"metadata": {},
"source": [
"## Try with XSD Schema Definition\n",
"\n",
"In this variant, let's see if Claude performs better if we keep our structure consistent."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "b9914571-d3f2-4f48-bdbb-2dfcfb03f26d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from typing import Any, Dict, Type\n",
"\n",
"from langchain.chat_models import ChatAnthropic\n",
"from langchain.output_parsers.xml import XMLOutputParser\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.pydantic_v1 import BaseModel\n",
"\n",
"# This is the schema the model will populate\n",
"xsd = \"\"\"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
"\"\"\"\n",
"\n",
"prompt = claude_prompt.partial(schema=xsd, function_call=task.schema.schema()[\"title\"])\n",
"\n",
"claude_extraction_chain = format_run | prompt | claude | try_parse"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "26dc6d70-b745-4fd3-9592-1a13a3f2751f",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'output': {'issue_summary': 'How to run Llama locally',\n",
" 'question': {'question_category': 'LLM Integrations',\n",
" 'is_off_topic': 'false',\n",
" 'toxicity': '0',\n",
" 'sentiment': 'Neutral',\n",
" 'programming_language': 'unknown'},\n",
" 'response': {'response_type': 'provide guidance',\n",
" 'confidence_level': '3',\n",
" 'followup_actions': ['Install Llama locally', 'Add Llama to path']}}}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = claude_extraction_chain.invoke(\n",
" {\n",
" \"question\": \"how do i run llama 2 locally?\",\n",
" \"answer\": \"Llama.cpp of course. Afterwords remember to install it, then add it to your path!\",\n",
" }\n",
")\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f8d58656-108d-48d2-ba16-815fc9bdebcc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'claude-2-xsd-to-xml-5689' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=dc7656d8-00ef-4048-9ce5-38ef72af593c\n",
"\n",
"View all tests for Dataset Chat Extraction at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6\n",
"[------------------------------------------------->] 27/27"
]
},
{
"data": {
"text/html": [
"Experiment Results:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" feedback.toxicity_similarity | \n",
" feedback.sentiment_similarity | \n",
" feedback.confidence_level_similarity | \n",
" feedback.question_category | \n",
" feedback.off_topic_similarity | \n",
" feedback.programming_language_similarity | \n",
" error | \n",
" execution_time | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.0 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.0 | \n",
" 27.000000 | \n",
" 0 | \n",
" 27.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 0.394232 | \n",
" 0.518519 | \n",
" 1.0 | \n",
" 0.907407 | \n",
" 0.970370 | \n",
" 0.370370 | \n",
" 0.0 | \n",
" 0.518519 | \n",
" NaN | \n",
" 11.128319 | \n",
"
\n",
" \n",
" std | \n",
" 0.117880 | \n",
" 0.509175 | \n",
" 0.0 | \n",
" 0.197924 | \n",
" 0.072403 | \n",
" 0.492103 | \n",
" 0.0 | \n",
" 0.509175 | \n",
" NaN | \n",
" 4.845637 | \n",
"
\n",
" \n",
" min | \n",
" 0.116608 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.500000 | \n",
" 0.800000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" 7.833285 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.332400 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" NaN | \n",
" 8.888438 | \n",
"
\n",
" \n",
" 50% | \n",
" 0.380435 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 1.000000 | \n",
" NaN | \n",
" 9.629613 | \n",
"
\n",
" \n",
" 75% | \n",
" 0.456592 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
" 1.000000 | \n",
" NaN | \n",
" 11.143679 | \n",
"
\n",
" \n",
" max | \n",
" 0.644007 | \n",
" 1.000000 | \n",
" 1.0 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
" 1.000000 | \n",
" NaN | \n",
" 32.068304 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance feedback.json_schema \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.394232 0.518519 \n",
"std 0.117880 0.509175 \n",
"min 0.116608 0.000000 \n",
"25% 0.332400 0.000000 \n",
"50% 0.380435 1.000000 \n",
"75% 0.456592 1.000000 \n",
"max 0.644007 1.000000 \n",
"\n",
" feedback.toxicity_similarity feedback.sentiment_similarity \\\n",
"count 27.0 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 1.0 0.907407 \n",
"std 0.0 0.197924 \n",
"min 1.0 0.500000 \n",
"25% 1.0 1.000000 \n",
"50% 1.0 1.000000 \n",
"75% 1.0 1.000000 \n",
"max 1.0 1.000000 \n",
"\n",
" feedback.confidence_level_similarity feedback.question_category \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.970370 0.370370 \n",
"std 0.072403 0.492103 \n",
"min 0.800000 0.000000 \n",
"25% 1.000000 0.000000 \n",
"50% 1.000000 0.000000 \n",
"75% 1.000000 1.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
" feedback.off_topic_similarity \\\n",
"count 27.0 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 0.0 \n",
"std 0.0 \n",
"min 0.0 \n",
"25% 0.0 \n",
"50% 0.0 \n",
"75% 0.0 \n",
"max 0.0 \n",
"\n",
" feedback.programming_language_similarity error execution_time \n",
"count 27.000000 0 27.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.518519 NaN 11.128319 \n",
"std 0.509175 NaN 4.845637 \n",
"min 0.000000 NaN 7.833285 \n",
"25% 0.000000 NaN 8.888438 \n",
"50% 1.000000 NaN 9.629613 \n",
"75% 1.000000 NaN 11.143679 \n",
"max 1.000000 NaN 32.068304 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"claude_xsd_test_run = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=claude_extraction_chain,\n",
" evaluation=eval_config,\n",
" verbose=True,\n",
" project_name=f\"claude-2-xsd-to-xml-{uid}\",\n",
" project_metadata={\n",
" \"arch\": \"claude-xml\",\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3df7ce82-73a7-4913-9569-1066d982b528",
"metadata": {},
"source": [
"The json schema metric went down, meaning that the output counter-intuitively is less friendly to our parser than before.\n",
"\n",
"\n",
"Let's try with an open source model: `llama-v2-34b-code-instruct`."
]
},
{
"cell_type": "markdown",
"id": "102df41d-2c93-4ffc-a09a-4198ea5b6acc",
"metadata": {},
"source": [
"## Try with Llama 2\n",
"\n",
"`llama-v2-34b-code-instruct` is an open source model that is meant to be good at both code-gen and other tasks.\n",
"Let's benchmark it."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "27cc37f1-2dc3-4d8e-a380-3c8296bf105a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import json\n",
"\n",
"from langchain.chat_models import ChatFireworks\n",
"from langchain.output_parsers.json import parse_json_markdown\n",
"\n",
"llama_prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\n",
" \"system\",\n",
" \"You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond \"\n",
" \"only with json based on the following JSON schema:\\n{schema}\",\n",
" ),\n",
" (\n",
" \"user\",\n",
" \"Generate a ticket from the following question-response pair:\\n\"\n",
" \"\\n{dialogue}\\n\\n\"\n",
" \"Remember, respond directly with this format:\\n\"\n",
" '{{\"{function_call}\": ...}}\\n'\n",
" \"RESPOND ONLY IN JSON THEN STOP.\",\n",
" ),\n",
" ]\n",
")\n",
"\n",
"prompt = llama_prompt.partial(\n",
" schema=task.schema.schema_json(), function_call=task.schema.schema()[\"title\"]\n",
")\n",
"\n",
"llm = ChatFireworks(\n",
" model=\"accounts/fireworks/models/llama-v2-34b-code-instruct\",\n",
" temperature=0,\n",
" model_kwargs={\"max_tokens\": 4000},\n",
")\n",
"\n",
"\n",
"def parse_output(ai_message):\n",
" content = ai_message.content\n",
" parser = lambda x: json.loads(x, strict=False)\n",
" try:\n",
" parsed = parse_json_markdown(content, parser=parser)\n",
" if \"GenerateTicket\" in parsed:\n",
" return {\"output\": parsed[\"GenerateTicket\"]}\n",
" return {\"output\": parsed}\n",
" except json.JSONDecodeError:\n",
" return {\"output\": content}\n",
"\n",
"\n",
"fireworks_extraction_chain = format_run | prompt | llm | parse_output"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "266e2273-2fd7-42c2-986b-c08a07cbcc96",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'output': {'issue_summary': 'How to run Llama 2 locally',\n",
" 'question': {'question_category': 'Implementation Issues',\n",
" 'is_off_topic': False,\n",
" 'toxicity': 0,\n",
" 'sentiment': 'Neutral',\n",
" 'programming_language': 'cpp'},\n",
" 'response': {'response_type': 'Resolve Issue',\n",
" 'confidence_level': 5,\n",
" 'followup_actions': ['Please provide more information about the environment (OS, versions, etc.) and the specific issue you are experiencing.']}}}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result = fireworks_extraction_chain.invoke(\n",
" {\"question\": \"how do i run llama 2 locally?\", \"answer\": \"Llama.cpp of course.\"}\n",
")\n",
"result"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "9f4f4b39-d1b0-4f89-aa09-4fe261296dbc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for project 'llama-v2-34b-code-instruct-5689' at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=dc2e0648-7e65-4d60-a149-15c24bca943b\n",
"\n",
"View all tests for Dataset Chat Extraction at:\n",
"https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6\n",
"[------------------------------------------------->] 27/27"
]
},
{
"data": {
"text/html": [
"Experiment Results:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" feedback.toxicity_similarity | \n",
" feedback.sentiment_similarity | \n",
" feedback.confidence_level_similarity | \n",
" feedback.question_category | \n",
" feedback.off_topic_similarity | \n",
" feedback.programming_language_similarity | \n",
" error | \n",
" execution_time | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 17.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 27.000000 | \n",
" 0 | \n",
" 27.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 0.399687 | \n",
" 0.333333 | \n",
" 0.444444 | \n",
" 0.444444 | \n",
" 0.540741 | \n",
" 0.074074 | \n",
" 0.518519 | \n",
" 0.222222 | \n",
" NaN | \n",
" 4.738518 | \n",
"
\n",
" \n",
" std | \n",
" 0.097771 | \n",
" 0.480384 | \n",
" 0.506370 | \n",
" 0.423659 | \n",
" 0.439632 | \n",
" 0.266880 | \n",
" 0.509175 | \n",
" 0.423659 | \n",
" NaN | \n",
" 3.162978 | \n",
"
\n",
" \n",
" min | \n",
" 0.197279 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 3.224190 | \n",
"
\n",
" \n",
" 25% | \n",
" 0.325069 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 3.595067 | \n",
"
\n",
" \n",
" 50% | \n",
" 0.413203 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.500000 | \n",
" 0.800000 | \n",
" 0.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 3.744033 | \n",
"
\n",
" \n",
" 75% | \n",
" 0.471366 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 4.211040 | \n",
"
\n",
" \n",
" max | \n",
" 0.552430 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" 1.000000 | \n",
" NaN | \n",
" 18.660901 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance feedback.json_schema \\\n",
"count 17.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.399687 0.333333 \n",
"std 0.097771 0.480384 \n",
"min 0.197279 0.000000 \n",
"25% 0.325069 0.000000 \n",
"50% 0.413203 0.000000 \n",
"75% 0.471366 1.000000 \n",
"max 0.552430 1.000000 \n",
"\n",
" feedback.toxicity_similarity feedback.sentiment_similarity \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.444444 0.444444 \n",
"std 0.506370 0.423659 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 0.000000 \n",
"50% 0.000000 0.500000 \n",
"75% 1.000000 1.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
" feedback.confidence_level_similarity feedback.question_category \\\n",
"count 27.000000 27.000000 \n",
"unique NaN NaN \n",
"top NaN NaN \n",
"freq NaN NaN \n",
"mean 0.540741 0.074074 \n",
"std 0.439632 0.266880 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 0.000000 \n",
"50% 0.800000 0.000000 \n",
"75% 1.000000 0.000000 \n",
"max 1.000000 1.000000 \n",
"\n",
" feedback.off_topic_similarity \\\n",
"count 27.000000 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 0.518519 \n",
"std 0.509175 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 1.000000 \n",
"75% 1.000000 \n",
"max 1.000000 \n",
"\n",
" feedback.programming_language_similarity error execution_time \n",
"count 27.000000 0 27.000000 \n",
"unique NaN 0 NaN \n",
"top NaN NaN NaN \n",
"freq NaN NaN NaN \n",
"mean 0.222222 NaN 4.738518 \n",
"std 0.423659 NaN 3.162978 \n",
"min 0.000000 NaN 3.224190 \n",
"25% 0.000000 NaN 3.595067 \n",
"50% 0.000000 NaN 3.744033 \n",
"75% 0.000000 NaN 4.211040 \n",
"max 1.000000 NaN 18.660901 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"llama_v2_test_run = client.run_on_dataset(\n",
" dataset_name=task.name,\n",
" llm_or_chain_factory=fireworks_extraction_chain,\n",
" evaluation=eval_config,\n",
" verbose=True,\n",
" project_name=f\"llama-v2-34b-code-instruct-{uid}\",\n",
" project_metadata={\"arch\": \"claude-xml\", \"model\": \"llama-v2-34b-code-instruct\"},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1b039225-01cf-481a-87a6-4e880e9b1dcd",
"metadata": {},
"source": [
"## Compare Results\n",
"\n",
"Here, we'll take a look at the underlying results a little bit. You can review the results to see relative performance in aggregate and on a per-example basis."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "6eb19db1-43b8-4866-a3d2-f211ba92ab8b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df = (\n",
" test_run.to_dataframe()\n",
" .join(claude_test_run.to_dataframe(), rsuffix=\"_claude\")\n",
" .join(claude_xsd_test_run.to_dataframe(), rsuffix=\"_claude_xsd\")\n",
" .join(llama_v2_test_run.to_dataframe(), rsuffix=\"_llama_v2\")\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "c292b4ed-8331-4068-82fa-7cea2725e24d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" inputs.answer | \n",
" inputs.question | \n",
" outputs.output | \n",
" reference.output | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" feedback.toxicity_similarity | \n",
" feedback.sentiment_similarity | \n",
" feedback.confidence_level_similarity | \n",
" feedback.question_category | \n",
" ... | \n",
" feedback.json_edit_distance_llama_v2 | \n",
" feedback.json_schema_llama_v2 | \n",
" feedback.toxicity_similarity_llama_v2 | \n",
" feedback.sentiment_similarity_llama_v2 | \n",
" feedback.confidence_level_similarity_llama_v2 | \n",
" feedback.question_category_llama_v2 | \n",
" feedback.off_topic_similarity_llama_v2 | \n",
" feedback.programming_language_similarity_llama_v2 | \n",
" error_llama_v2 | \n",
" execution_time_llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 23a81130-2ad9-46cf-ad27-46589bcea94a | \n",
" Pour joindre les deux outputs, vous pouvez uti... | \n",
" je travail sur python. je souhaite joindre ces... | \n",
" {'issue_summary': 'Joining two outputs in Pyth... | \n",
" {'question': {'toxicity': 0, 'sentiment': 'Neu... | \n",
" 0.089219 | \n",
" 1 | \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1 | \n",
" ... | \n",
" 0.552239 | \n",
" 1 | \n",
" 0.0 | \n",
" 0.5 | \n",
" 0.8 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" None | \n",
" 3.981128 | \n",
"
\n",
" \n",
" 598316ec-f5e2-4b4d-83a8-36adb18e12fe | \n",
" Hmm, I'm not sure. | \n",
" example for dalle agent | \n",
" {'issue_summary': 'Example for DALL-E Agent', ... | \n",
" {'question': {'toxicity': 0, 'sentiment': 'Neu... | \n",
" 0.171103 | \n",
" 1 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.8 | \n",
" 0 | \n",
" ... | \n",
" NaN | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" None | \n",
" 10.942758 | \n",
"
\n",
" \n",
" d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 | \n",
" To run Llama2 using pandas, you can follow the... | \n",
" how do I run llama2 using pandas | \n",
" {'issue_summary': 'Running Llama2 with Pandas'... | \n",
" {'question': {'toxicity': 0, 'sentiment': 'Neu... | \n",
" 0.594255 | \n",
" 1 | \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0 | \n",
" ... | \n",
" NaN | \n",
" 0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" None | \n",
" 3.628600 | \n",
"
\n",
" \n",
" 140a4819-0046-469d-b4df-8e747ddae112 | \n",
" To clear the conversation in ConversationalRet... | \n",
" if Im useing ConversationalRetrievalChain how ... | \n",
" {'issue_summary': 'Clearing Conversation in Co... | \n",
" {'question': {'toxicity': 0, 'sentiment': 'Neu... | \n",
" 0.353261 | \n",
" 1 | \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0 | \n",
" ... | \n",
" 0.393643 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.5 | \n",
" 0.8 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" None | \n",
" 3.711707 | \n",
"
\n",
" \n",
" 7b0a9dd9-68ce-41a1-9f9d-067d93175477 | \n",
" To perform the task of creating an app that in... | \n",
" I want to create an app which:\\n- chats with u... | \n",
" {'issue_summary': 'Building an app with Langch... | \n",
" {'question': {'toxicity': 0, 'sentiment': 'Neu... | \n",
" 0.562950 | \n",
" 1 | \n",
" 0 | \n",
" 1.0 | \n",
" 0.8 | \n",
" 1 | \n",
" ... | \n",
" 0.436747 | \n",
" 1 | \n",
" 1.0 | \n",
" 0.5 | \n",
" 1.0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" None | \n",
" 4.410890 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 56 columns
\n",
"
"
],
"text/plain": [
" inputs.answer \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a Pour joindre les deux outputs, vous pouvez uti... \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe Hmm, I'm not sure. \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 To run Llama2 using pandas, you can follow the... \n",
"140a4819-0046-469d-b4df-8e747ddae112 To clear the conversation in ConversationalRet... \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 To perform the task of creating an app that in... \n",
"\n",
" inputs.question \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a je travail sur python. je souhaite joindre ces... \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe example for dalle agent \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 how do I run llama2 using pandas \n",
"140a4819-0046-469d-b4df-8e747ddae112 if Im useing ConversationalRetrievalChain how ... \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 I want to create an app which:\\n- chats with u... \n",
"\n",
" outputs.output \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a {'issue_summary': 'Joining two outputs in Pyth... \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe {'issue_summary': 'Example for DALL-E Agent', ... \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 {'issue_summary': 'Running Llama2 with Pandas'... \n",
"140a4819-0046-469d-b4df-8e747ddae112 {'issue_summary': 'Clearing Conversation in Co... \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 {'issue_summary': 'Building an app with Langch... \n",
"\n",
" reference.output \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a {'question': {'toxicity': 0, 'sentiment': 'Neu... \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe {'question': {'toxicity': 0, 'sentiment': 'Neu... \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 {'question': {'toxicity': 0, 'sentiment': 'Neu... \n",
"140a4819-0046-469d-b4df-8e747ddae112 {'question': {'toxicity': 0, 'sentiment': 'Neu... \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 {'question': {'toxicity': 0, 'sentiment': 'Neu... \n",
"\n",
" feedback.json_edit_distance \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0.089219 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0.171103 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0.594255 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0.353261 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0.562950 \n",
"\n",
" feedback.json_schema \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 1 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 1 \n",
"140a4819-0046-469d-b4df-8e747ddae112 1 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1 \n",
"\n",
" feedback.toxicity_similarity \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0 \n",
"\n",
" feedback.sentiment_similarity \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1.0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 1.0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 1.0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 1.0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1.0 \n",
"\n",
" feedback.confidence_level_similarity \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1.0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0.8 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 1.0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 1.0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0.8 \n",
"\n",
" feedback.question_category ... \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1 ... \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 ... \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 ... \n",
"140a4819-0046-469d-b4df-8e747ddae112 0 ... \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1 ... \n",
"\n",
" feedback.json_edit_distance_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0.552239 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe NaN \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 NaN \n",
"140a4819-0046-469d-b4df-8e747ddae112 0.393643 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0.436747 \n",
"\n",
" feedback.json_schema_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1 \n",
"\n",
" feedback.toxicity_similarity_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0.0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0.0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0.0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 1.0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1.0 \n",
"\n",
" feedback.sentiment_similarity_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0.5 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0.0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0.0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0.5 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0.5 \n",
"\n",
" feedback.confidence_level_similarity_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0.8 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0.0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0.0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0.8 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1.0 \n",
"\n",
" feedback.question_category_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 0 \n",
"\n",
" feedback.off_topic_similarity_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 0 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 1 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1 \n",
"\n",
" feedback.programming_language_similarity_llama_v2 \\\n",
"23a81130-2ad9-46cf-ad27-46589bcea94a 1 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe 0 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 0 \n",
"140a4819-0046-469d-b4df-8e747ddae112 0 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 1 \n",
"\n",
" error_llama_v2 execution_time_llama_v2 \n",
"23a81130-2ad9-46cf-ad27-46589bcea94a None 3.981128 \n",
"598316ec-f5e2-4b4d-83a8-36adb18e12fe None 10.942758 \n",
"d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 None 3.628600 \n",
"140a4819-0046-469d-b4df-8e747ddae112 None 3.711707 \n",
"7b0a9dd9-68ce-41a1-9f9d-067d93175477 None 4.410890 \n",
"\n",
"[5 rows x 56 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(5)"
]
},
{
"cell_type": "markdown",
"id": "da665f3c-4ef6-474d-8ab5-284434060bec",
"metadata": {},
"source": [
"#### Here, we compare the aggregate metrics side-by-side"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "b5b936c2-d676-4931-bb13-ec06ab55d401",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df = (\n",
" test_run.get_aggregate_feedback()\n",
" .add_suffix(\".gpt-4\")\n",
" .join(claude_test_run.get_aggregate_feedback(), rsuffix=\".claude\")\n",
" .join(claude_xsd_test_run.get_aggregate_feedback(), rsuffix=\".claude_xsd\")\n",
" .join(llama_v2_test_run.get_aggregate_feedback(), rsuffix=\".llama_v2\")\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "1a151781-9c69-43c3-84d7-5617ee0e7d63",
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML, display\n",
"\n",
"feedback_columns = sorted(\n",
" {col.rsplit(\".\", 1)[0] for col in df.columns if col.startswith(\"feedback.\")}\n",
")\n",
"\n",
"\n",
"def render_metric(df, metric):\n",
" sub_cols = [col for col in df.columns if col.startswith(metric)]\n",
" display(HTML(f\"{metric.split('.')[-1]}
\"))\n",
" display(df[sub_cols][df.index.isin([\"mean\", \"std\"])])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "97892d06-ac72-43fa-8e1e-ff33b284940d",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"['feedback',\n",
" 'feedback.confidence_level_similarity',\n",
" 'feedback.json_edit_distance',\n",
" 'feedback.json_schema',\n",
" 'feedback.off_topic_similarity',\n",
" 'feedback.programming_language_similarity',\n",
" 'feedback.question_category',\n",
" 'feedback.sentiment_similarity',\n",
" 'feedback.toxicity_similarity']"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feedback_columns"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "090284d7-29b6-4ea7-b193-ebc159fae143",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"execution_time
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" execution_time.gpt-4 | \n",
" execution_time | \n",
" execution_time.claude_xsd | \n",
" execution_time.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 6.949585 | \n",
" 10.556105 | \n",
" 11.128319 | \n",
" 4.738518 | \n",
"
\n",
" \n",
" std | \n",
" 1.639494 | \n",
" 1.790352 | \n",
" 4.845637 | \n",
" 3.162978 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" execution_time.gpt-4 execution_time execution_time.claude_xsd \\\n",
"mean 6.949585 10.556105 11.128319 \n",
"std 1.639494 1.790352 4.845637 \n",
"\n",
" execution_time.llama_v2 \n",
"mean 4.738518 \n",
"std 3.162978 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"render_metric(df, \"execution_time\")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "8f4cf5f5-dd75-4318-9bf4-25b63fa1b895",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"feedback
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance.gpt-4 | \n",
" feedback.json_schema.gpt-4 | \n",
" feedback.toxicity_similarity.gpt-4 | \n",
" feedback.sentiment_similarity.gpt-4 | \n",
" feedback.confidence_level_similarity.gpt-4 | \n",
" feedback.question_category.gpt-4 | \n",
" feedback.off_topic_similarity.gpt-4 | \n",
" feedback.programming_language_similarity.gpt-4 | \n",
" feedback.json_edit_distance | \n",
" feedback.json_schema | \n",
" ... | \n",
" feedback.off_topic_similarity.claude_xsd | \n",
" feedback.programming_language_similarity.claude_xsd | \n",
" feedback.json_edit_distance.llama_v2 | \n",
" feedback.json_schema.llama_v2 | \n",
" feedback.toxicity_similarity.llama_v2 | \n",
" feedback.sentiment_similarity.llama_v2 | \n",
" feedback.confidence_level_similarity.llama_v2 | \n",
" feedback.question_category.llama_v2 | \n",
" feedback.off_topic_similarity.llama_v2 | \n",
" feedback.programming_language_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.283000 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.940741 | \n",
" 0.555556 | \n",
" 0.888889 | \n",
" 0.592593 | \n",
" 0.371950 | \n",
" 0.777778 | \n",
" ... | \n",
" 0.0 | \n",
" 0.518519 | \n",
" 0.399687 | \n",
" 0.333333 | \n",
" 0.444444 | \n",
" 0.444444 | \n",
" 0.540741 | \n",
" 0.074074 | \n",
" 0.518519 | \n",
" 0.222222 | \n",
"
\n",
" \n",
" std | \n",
" 0.181282 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.093064 | \n",
" 0.506370 | \n",
" 0.320256 | \n",
" 0.500712 | \n",
" 0.108628 | \n",
" 0.423659 | \n",
" ... | \n",
" 0.0 | \n",
" 0.509175 | \n",
" 0.097771 | \n",
" 0.480384 | \n",
" 0.506370 | \n",
" 0.423659 | \n",
" 0.439632 | \n",
" 0.266880 | \n",
" 0.509175 | \n",
" 0.423659 | \n",
"
\n",
" \n",
"
\n",
"
2 rows × 32 columns
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance.gpt-4 feedback.json_schema.gpt-4 \\\n",
"mean 0.283000 1.0 \n",
"std 0.181282 0.0 \n",
"\n",
" feedback.toxicity_similarity.gpt-4 feedback.sentiment_similarity.gpt-4 \\\n",
"mean 0.0 1.0 \n",
"std 0.0 0.0 \n",
"\n",
" feedback.confidence_level_similarity.gpt-4 \\\n",
"mean 0.940741 \n",
"std 0.093064 \n",
"\n",
" feedback.question_category.gpt-4 feedback.off_topic_similarity.gpt-4 \\\n",
"mean 0.555556 0.888889 \n",
"std 0.506370 0.320256 \n",
"\n",
" feedback.programming_language_similarity.gpt-4 \\\n",
"mean 0.592593 \n",
"std 0.500712 \n",
"\n",
" feedback.json_edit_distance feedback.json_schema ... \\\n",
"mean 0.371950 0.777778 ... \n",
"std 0.108628 0.423659 ... \n",
"\n",
" feedback.off_topic_similarity.claude_xsd \\\n",
"mean 0.0 \n",
"std 0.0 \n",
"\n",
" feedback.programming_language_similarity.claude_xsd \\\n",
"mean 0.518519 \n",
"std 0.509175 \n",
"\n",
" feedback.json_edit_distance.llama_v2 feedback.json_schema.llama_v2 \\\n",
"mean 0.399687 0.333333 \n",
"std 0.097771 0.480384 \n",
"\n",
" feedback.toxicity_similarity.llama_v2 \\\n",
"mean 0.444444 \n",
"std 0.506370 \n",
"\n",
" feedback.sentiment_similarity.llama_v2 \\\n",
"mean 0.444444 \n",
"std 0.423659 \n",
"\n",
" feedback.confidence_level_similarity.llama_v2 \\\n",
"mean 0.540741 \n",
"std 0.439632 \n",
"\n",
" feedback.question_category.llama_v2 \\\n",
"mean 0.074074 \n",
"std 0.266880 \n",
"\n",
" feedback.off_topic_similarity.llama_v2 \\\n",
"mean 0.518519 \n",
"std 0.509175 \n",
"\n",
" feedback.programming_language_similarity.llama_v2 \n",
"mean 0.222222 \n",
"std 0.423659 \n",
"\n",
"[2 rows x 32 columns]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"confidence_level_similarity
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.confidence_level_similarity.gpt-4 | \n",
" feedback.confidence_level_similarity | \n",
" feedback.confidence_level_similarity.claude_xsd | \n",
" feedback.confidence_level_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.940741 | \n",
" 0.970370 | \n",
" 0.970370 | \n",
" 0.540741 | \n",
"
\n",
" \n",
" std | \n",
" 0.093064 | \n",
" 0.072403 | \n",
" 0.072403 | \n",
" 0.439632 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.confidence_level_similarity.gpt-4 \\\n",
"mean 0.940741 \n",
"std 0.093064 \n",
"\n",
" feedback.confidence_level_similarity \\\n",
"mean 0.970370 \n",
"std 0.072403 \n",
"\n",
" feedback.confidence_level_similarity.claude_xsd \\\n",
"mean 0.970370 \n",
"std 0.072403 \n",
"\n",
" feedback.confidence_level_similarity.llama_v2 \n",
"mean 0.540741 \n",
"std 0.439632 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"json_edit_distance
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_edit_distance.gpt-4 | \n",
" feedback.json_edit_distance | \n",
" feedback.json_edit_distance.claude_xsd | \n",
" feedback.json_edit_distance.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.283000 | \n",
" 0.371950 | \n",
" 0.394232 | \n",
" 0.399687 | \n",
"
\n",
" \n",
" std | \n",
" 0.181282 | \n",
" 0.108628 | \n",
" 0.117880 | \n",
" 0.097771 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_edit_distance.gpt-4 feedback.json_edit_distance \\\n",
"mean 0.283000 0.371950 \n",
"std 0.181282 0.108628 \n",
"\n",
" feedback.json_edit_distance.claude_xsd \\\n",
"mean 0.394232 \n",
"std 0.117880 \n",
"\n",
" feedback.json_edit_distance.llama_v2 \n",
"mean 0.399687 \n",
"std 0.097771 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"json_schema
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.json_schema.gpt-4 | \n",
" feedback.json_schema | \n",
" feedback.json_schema.claude_xsd | \n",
" feedback.json_schema.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 1.0 | \n",
" 0.777778 | \n",
" 0.518519 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" std | \n",
" 0.0 | \n",
" 0.423659 | \n",
" 0.509175 | \n",
" 0.480384 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.json_schema.gpt-4 feedback.json_schema \\\n",
"mean 1.0 0.777778 \n",
"std 0.0 0.423659 \n",
"\n",
" feedback.json_schema.claude_xsd feedback.json_schema.llama_v2 \n",
"mean 0.518519 0.333333 \n",
"std 0.509175 0.480384 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"off_topic_similarity
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.off_topic_similarity.gpt-4 | \n",
" feedback.off_topic_similarity | \n",
" feedback.off_topic_similarity.claude_xsd | \n",
" feedback.off_topic_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.888889 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.518519 | \n",
"
\n",
" \n",
" std | \n",
" 0.320256 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.509175 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.off_topic_similarity.gpt-4 feedback.off_topic_similarity \\\n",
"mean 0.888889 0.0 \n",
"std 0.320256 0.0 \n",
"\n",
" feedback.off_topic_similarity.claude_xsd \\\n",
"mean 0.0 \n",
"std 0.0 \n",
"\n",
" feedback.off_topic_similarity.llama_v2 \n",
"mean 0.518519 \n",
"std 0.509175 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"programming_language_similarity
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.programming_language_similarity.gpt-4 | \n",
" feedback.programming_language_similarity | \n",
" feedback.programming_language_similarity.claude_xsd | \n",
" feedback.programming_language_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.592593 | \n",
" 0.444444 | \n",
" 0.518519 | \n",
" 0.222222 | \n",
"
\n",
" \n",
" std | \n",
" 0.500712 | \n",
" 0.506370 | \n",
" 0.509175 | \n",
" 0.423659 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.programming_language_similarity.gpt-4 \\\n",
"mean 0.592593 \n",
"std 0.500712 \n",
"\n",
" feedback.programming_language_similarity \\\n",
"mean 0.444444 \n",
"std 0.506370 \n",
"\n",
" feedback.programming_language_similarity.claude_xsd \\\n",
"mean 0.518519 \n",
"std 0.509175 \n",
"\n",
" feedback.programming_language_similarity.llama_v2 \n",
"mean 0.222222 \n",
"std 0.423659 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"question_category
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.question_category.gpt-4 | \n",
" feedback.question_category | \n",
" feedback.question_category.claude_xsd | \n",
" feedback.question_category.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.555556 | \n",
" 0.481481 | \n",
" 0.370370 | \n",
" 0.074074 | \n",
"
\n",
" \n",
" std | \n",
" 0.506370 | \n",
" 0.509175 | \n",
" 0.492103 | \n",
" 0.266880 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.question_category.gpt-4 feedback.question_category \\\n",
"mean 0.555556 0.481481 \n",
"std 0.506370 0.509175 \n",
"\n",
" feedback.question_category.claude_xsd \\\n",
"mean 0.370370 \n",
"std 0.492103 \n",
"\n",
" feedback.question_category.llama_v2 \n",
"mean 0.074074 \n",
"std 0.266880 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"sentiment_similarity
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.sentiment_similarity.gpt-4 | \n",
" feedback.sentiment_similarity | \n",
" feedback.sentiment_similarity.claude_xsd | \n",
" feedback.sentiment_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 1.0 | \n",
" 0.925926 | \n",
" 0.907407 | \n",
" 0.444444 | \n",
"
\n",
" \n",
" std | \n",
" 0.0 | \n",
" 0.181007 | \n",
" 0.197924 | \n",
" 0.423659 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.sentiment_similarity.gpt-4 feedback.sentiment_similarity \\\n",
"mean 1.0 0.925926 \n",
"std 0.0 0.181007 \n",
"\n",
" feedback.sentiment_similarity.claude_xsd \\\n",
"mean 0.907407 \n",
"std 0.197924 \n",
"\n",
" feedback.sentiment_similarity.llama_v2 \n",
"mean 0.444444 \n",
"std 0.423659 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"toxicity_similarity
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feedback.toxicity_similarity.gpt-4 | \n",
" feedback.toxicity_similarity | \n",
" feedback.toxicity_similarity.claude_xsd | \n",
" feedback.toxicity_similarity.llama_v2 | \n",
"
\n",
" \n",
" \n",
" \n",
" mean | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.444444 | \n",
"
\n",
" \n",
" std | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.506370 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feedback.toxicity_similarity.gpt-4 feedback.toxicity_similarity \\\n",
"mean 0.0 1.0 \n",
"std 0.0 0.0 \n",
"\n",
" feedback.toxicity_similarity.claude_xsd \\\n",
"mean 1.0 \n",
"std 0.0 \n",
"\n",
" feedback.toxicity_similarity.llama_v2 \n",
"mean 0.444444 \n",
"std 0.506370 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for metric in feedback_columns:\n",
" render_metric(df, metric)"
]
},
{
"cell_type": "markdown",
"id": "d1641d5b-362d-4aae-9f42-ccb4726b8229",
"metadata": {},
"source": [
"## Next Steps\n",
"\n",
"Try it out yourself! You can see some additional experiments on Open Source models in [this repo](https://github.com/hinthornw/llama-extraction)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}