Video: Making the Most of Your AI Tools: How to Build Effective AI Prompts and Playbooks | Duration: 1752s | Summary: Making the Most of Your AI Tools: How to Build Effective AI Prompts and Playbooks | Chapters: Welcome and Introduction (0s), Introduction to Prompting (0.9793554132486975s), New Chapter (54.904352413248695s), Effective AI Prompting (76.13435241324868s), Benchmarking AI Playbooks (480.67935241324864s), Refining Community Screens (1254.6194524132488s), Closing Q&A Session (1626.1792524132488s)
Transcript for "Making the Most of Your AI Tools: How to Build Effective AI Prompts and Playbooks": Hi, everyone. Thanks for joining today. As Jennifer said, my name is Maya Lesh, and I'm a senior legal knowledge engineer at Agiloft. Prior to making the move into legal engineering, I practice as a corporate lawyer, primarily representing start ups and high growth companies from inception through to capital raising and exits. Today, as a legal knowledge engineer, I work to build out AI powered tools and features within the Agiloft system. So today, like Jennifer said, our focus is going to be on how to prompt effectively, and I'll demonstrate this using two of our tools, which are PromptLab and Screens. I will leave some time for questions at the end, so please feel free to send them through as they come up. And if there are any that we can't get to today, you can always send them through to us over at Agiloft, and we will get back to you after the webcast. And on that note, we will get started. Okay. So right here, we are in the setup page of our prompt log tool, which is where you can first create and test prompts or view and edit prebuilt templates. Now, Prompthub is our tool that enables users to create and deploy custom prompts or agents that automate high volume contract tasks. One really important point that I like to highlight is, most often, if we're seeing poor results from AI, it's not because of the technology, but it's typically because the AI isn't given rules that it can actually follow. So what we're going to cover today is how to write rules that the AI can use. I will navigate to what we'll call our basic prompt. And a basic prompt might be something like summarize the risks in this contract. Now if we want to use this for a risk summary, there's nothing inherently wrong with it, but it is vague and it's unspecific. So a prompt like this means that the AI will have to figure out how we want it to answer the question and what we're expecting. So over a broad, larger scale contract review, we generally won't see consistent outputs using this sort of, prompt. So I will run the prompt here so that we can see on one of our documents what this looks like. And looking at our results, you can see that we do get a summary, but exactly what this summary includes is not always going to be consistent. So it does involve the AI figuring out what it should summarize for us, what it thinks the key risks are, and what format we'll want it on. And, again, while this might not be a problem on a one off basis, as you run this across more and more contracts, the results generally won't be consistent and will probably be less practical or fit for use. So now if we move to a stronger prompt, we can have a look here. And we see that in this version, we give a clear task, an expected output format, and context. You may have heard the term context engineering, and this means giving the AI the right background so that it can apply your rules consistent consistently. As you work with prompting, these levers can be played around with, and you can see and understand how each of these levers will affect your results. And, again, that's the task, the expected output format, and the context that become really important. This time, if we run the prompt on our document and generate the results here, we can see that the results are more actionable, and they're specifically what we asked for, both in terms of format and content. So prompting in this way with specific instructions and giving an expected output leads to us seeing consistent and reliable results. When instructions are broad, like in our previous example, the output will be broad. And when instructions are precise, the output will be precise. And you may have also heard the term garbage in garbage out, meaning the output that you get from the AI will only be as good as the input that it gets. So when thinking about effective prompting, a good rule of thumb is to write your prompt the way that you'd instruct a junior team member. And prompting to build agents or using LLM will always involve you including some of these elements at least to some degree, making sure that the task, the context, and the format are clearly included and articulated in the prompt. Now that's a general kind of statement when it comes to what should be included, and the specifics might differ depending on the use case. But it's a good thing to keep in mind whenever we are prompting an l o m, whether it's agents or for anything else. So an easy way to see how those elements, being adjusted will affect our results is looking at prompting for different audiences. Here, we have a similar prompt as the one before, which is giving a asking for a risk summary. But this time, we are asking for a summary for a procurement team rather than for a legal team. So as you can see in this prompt, we're asking for a risk summary for the different audience. Again, we had a legal team here before, and now we have a procurement manager. And we're also asking it to use plain business language, avoiding legal jargon, and we're asking for a three bullet summary instead of a five bullet summary. And so this can really target what kind of output we wanna see. And we'll run that same document through once again on our prompt, and now we can see the output here reflects exactly what we asked for once again. So here, it's a different tone and a different focus than before, and we have three bullets instead of five. But you can see how this kind of tailored prompting can really lead to these tailored results. And once you've refined your prompt, you can save it as an agent and set up actions to run based on specific triggers. That way, instead of rewriting prompts each time, your team can reuse these prompt templates across contracts on a large scale, which is where the time savings really come in. The last piece I want to briefly cover while we're still in prompt lab is benchmarking, which is how you will validate and improve your AI results. So we saw how I tested a prompt against a given contract, and we started with a custom prompt here, but we also have the ability to start with an out of the box template. And we modify those prompts as needed, constantly testing the prompts against our contracts and assessing results. If I was benchmarking here, all I continue to do is to add various contracts, run the same prompt against those contracts, and see how it performs, all the while keeping track of scores and performance. When benchmarking, the key thing is to start with contracts that you know well, test on a mix of easy and tricky examples with varied wording and edge cases, and keep refining your prompts until the results are consistent. So we can see in our output here that we have the result on the new document, and what we'd be looking for here is just to check that our prompt is still performing as expected, comparing it to how it performed previously, and continuing this, as we go throughout the process until we are happy with our results. When it comes to actual scoring, generally, overall accuracy scores can be enough for most use cases, and that would just be looking at whether the AI was right or wrong. But if you want to go deeper, you can also look at precision, which is when the AI says something, is what it said correct, and recall, which is whether the AI caught everything that it should. Depending, again, on your use case, overall accuracy might be all that you need, but you can also measure precision and recall to get more granular detail. Often, recall does tend to be more important to ensure that the AI isn't missing anything. So looking more specifically at these sort of measures is a good way to narrow down where the accuracy needs improvement. If you notice that it's often missing things, you'll want to tailor your prompt a different way than if you notice it's constantly getting things wrong. If it's constantly getting things wrong, maybe you need to explain the concept in more detail in your prompt. Now I will jump over to building AI playbooks in screens, and we can look at how prompting works in this setting. One important thing to note is that anyone with subject matter expertise can build a playbook here. And just like prompt lab, it all works off of natural language prompting, and no technical expertise is necessary. So here, I'm in our screens web app, which is where we can create and iterate on our playbooks, which we, within Agiloft, call our screens. We do also have our word add in, which allows you to take advantage of a ton of other features. And I should add, this is the third part of our Agiloft series on smarter contracts. If you haven't already seen parts one and parts two of this series, we go into a lot more detail of the capability of both prompt lab and screens and all of the things that you can do with that. So if you haven't seen those already, I highly recommend checking them out following this webcast. And back to screens. So for today, I've created a simple screen with two standards. There's a basic standard and a stronger version of that standard. Once we have our entire screen created, which, again, is our AI playbook, We can run this against our contracts and assess what requires negotiation, and there's a lot other a lot of other great features that we can do once we run a screen on our documents, like redlining and risk summaries and so on. But in terms of the prompting piece, within a screen, we have both standards and questions. Standards are pass fail criteria that a contract must need, while questions are open ended and give us additional insight or information about the contract. So a standard might be something like the governing law of the contract must be Delaware, while a question might be what is the governing law of the contract. And if we look at the standards that are set up in the screen, here we have a basic standard for limitation of liability, where it says that the contract should have a reasonable liability cap. This is vague. And because reasonable is subjective, it can work, but the information we've given the AI is very basic and leaves a lot of room for interpretation. This is similar to the basic risk summarization prompt that we discussed earlier in prompt lab. If we now compare this to a stronger standard where we're very specific about the liability cap that we want, We say exactly what that must be, must be at least one times the annual fees payable. And I also added additional guidance and instructions here in the standard detail section to further explain to the AI what we're looking for. When prompting, clarity is key, so adding additional guidance to steer the AI can really help to improve your prompts. And as you can see here, I added in exactly what I consider to be a passing standard and exactly what would constitute a fail. Each specific scenario is bucketed in the appropriate pass or fail category, so the AI will know exactly what action to take should it come across any of these scenarios. Effective prompting for AI playbook building is to keep each rule focused on a single issue, use these sort of objective pass fail thresholds, and keep instructions short and structured. Now we can look at two contracts with varying liability caps and how they perform when the screen is run on them. Because I ran this screen in bulk on both my contracts, we can see the results for both for each standard. And we can see here in this first standard, both contracts pass as indicated by this symbol here. And in the second standard, which is our more detailed specific standard, one contract passes while the other fails. Now if we go in and look at the results, we can see that both contracts passed, and we can go in and audit the results to see the AI's reasoning and also to see the source language in the contract that the reasoning comes from and why it answered that way. So if we look at the reason here, we see that the AI said that the liability cap is set at 50% of the fees or a $100,000, whichever is greater. And it is now saying that it's reasonable, and so it passed this contract. Looking at our other document, we can see the reasoning again. Very similarly, it says that it specifies the liability cap. This time, the liability cap is equal to the fees paid, and this is also reasonable. So both contracts passed because both contracts liability cap were considered reasonable. Now, again, it's not that there's anything necessarily wrong with this, and it might be that you agree these are both considered reasonable. So it might work something like this, and, again, it is a simple example. But when running a more vague prompt like this across a larger set of documents, there will likely be a chance at one point that you might not agree with the AI's assessment of reasonableness, which is why it's important to be entirely clear on exactly what you want and leave no room for interpretation or subjectivity. If we go back and look at our stronger standard, and again, we can see that one contract failed and the other one passed. And looking at that reasoning again, we see a clear reason that the document failed, because here the liability cap, again, it's the 50% of the fees paid. And we've given specific AI guidance that the liability cap must be at least one times the annual fees paid. So it doesn't meet this requirement and it fails. And, again, this one will pass because it aligns with that specific requirement that we set out at the start. So when spelling out exactly what's acceptable and what isn't, you can imagine how you get consistent consistency when you're looking across a wide range of documents and on a larger scale of contract review. So, again, being clear and direct when prompting for building AI playbooks is critical to getting the best results because the clearer and more measurable your criteria, the more useful the results will be. A helpful framework for thinking about what makes a good prompt for a standard is to ensure that you have clear spotting, evaluation, and decision rules for the AI to follow. Spotting is being clear on what the issue or clause itself is. So here, we're looking at the liability cap, then the evaluation piece is to guide the AI on exactly how it should evaluate the clause by giving it the relevant context, telling it what's considered a pass or a fail. And then, finally, your standard needs to clearly allow the AI to decide whether to pass or fail that clause based on the rules you've given it. So using this framework, this will help you create solid AI playbooks by structuring how you want clauses to be spotted, evaluated, and decided. Consistency in formatting between standards also helps to more easily benchmark results and will facilitate easier playbook building. Now for benchmarking within screens, the same general rules that we discussed in prompt lab will apply across all the tools, but the manner of benchmarking may be different. So as you saw, when I was checking our responses, we looked and expanded and audited each result to see exactly where the result comes from in the contract. Once we look at the reason and the source text, we can say whether it's right or wrong with a simple thumbs up or thumbs down. This process will be how you move from experimenting with the AI to developing final playbooks that are reliable. And you will run your contracts, check all of your outputs, refine as needed, and continue to test. Again, I know we said the same rules apply, but just remembering that it is important to choose contracts that are representative of both pass and fail cases for different standards, as well as including in your benchmarking set contracts with varied language when it comes to answering questions. Doing this when you benchmark will help you to properly assess all of your results. One other great feature of screens is our community, which is a library of expert playbooks that you can run directly as is, or you can customize them. On this page, you can see a list of currently available screens. You can just scroll scroll through and look at the different screens under each of the various categories, and you would select one that fits your needs or the purpose that you're using it for. So maybe that's our screen here, eight deal breakers in tech contracts. We can have a look at the description. And if, if it meets our needs, we can use it. All we have to do is click get screen, and it'll load into our workspace, which only takes a few seconds. So having a look through this screen, we know that it includes eight standards, and we can have a look through what those are. Maybe we're happy with most of them, but maybe we come to this IP infringement indemnity standard, and this is something that we want to modify. So in order to modify a standard from a community screen, it's very simple. We just have to go in and edit the, specific standard and then include whatever modifying language we want. So maybe instead of what we have here, we'll say indemn defender. Maybe we want to include a defense obligation as well. So that would be how we could update our standard, and then we would, again, want to include specific more clarifying guidance. So So we have a pass, under these circumstances, and we would say fail if for the contract excludes defense obligations. So very simple to modify there. And you can see here, we also have a ability to instruct the AI on how to handle the standard when the contract is silent. In our case, we want to ensure that the contract includes this language, so we would hit always fail. We have our risk settings as well. I won't change that for the purposes here, but we would just save that modified prompt. And then all we have to do if we're happy with everything else would be to run the playbook on our contract. And while this runs again, what what I included in that prompt was all keeping in mind that framework where I'm clearly telling the AI what it's looking for, how to evaluate it against my rules, setting those rules, and then telling it to decide based on those rules whether to pass or fail the standard. And like I did just here, we can run the screen on single contract at a time, or like we saw before, we can do a bulk run across a larger number of contracts. Either way, you get all of the results for every single standard and question on every document that you run the screen on, and that's how you can start your benchmarking. So when we're looking at our results, we see all of the results for the standards as well as our modified standard. And once you refine a community screen, it's the same next step. It would be testing it against real contracts to ensure that it's reliable before you roll that out. And the auditing feature, like I spoke about before, really helps with this. I just have to quickly check, at the reasoning, the source language. And if it looks correct, I would just give it a thumbs up. And you can see when I do that, the accuracy scores automatically update, and we also see the number of documents that it's been validated against. If it was wrong and I hit the thumbs down, we're then prompted to refine the standard, and we could go in and make those edits really easily in this window that appears. And maybe that's adding more detail into the screen or splitting it into separate standards. We can make those changes right in here, and then we just have to save the prompt and rerun. So when benchmarking, I would just go through, read all of my standards, again, checking the reasoning, checking the source text, and hitting that thumbs up or thumbs down, whether it's right or wrong. So the key takeaway from today is that the clearer you are, the better the AI performs. Whether we're using screens here or prompt lab, it comes down to the same principle, which is to give the AI clear rules so that it can spot, evaluate, and decide the same way that you would. Prompts should be precise, measurable, and broken into smaller digestible steps. And before you roll content widely, always test and audit those to make sure that they perform as expected. This process makes contract reviews faster, more consistent, and more reliable. So I will stop sharing, and I will now have a look at, some of your questions. Okay. I see one here that says, are these example prompts shared on community or Wiki? These aren't prompts that are shared. These are just ones that I created for the purposes of the demo. They do they are similar to a bunch of the templates that we do have out of the box. So our risk summarization template, would be very equivalent to the prompts that I've used today. I have another question here, which is how many contracts to benchmark. So this really, again, depends on your use case. What I would recommend starting with is somewhere between five to 10 contracts and seeing the results, seeing how the prompt is performing. What's more important than the number is the variety of documents and making sure that you are getting that wide variety of language to be able to properly assess whether your prompt works across standard language and more edge case language. It looks like those are the only questions that we have. So unless there are any others that, anyone wants to fire off real quickly, I will, yeah, I will close this off. And thank you everyone for coming. If any questions do come up after the fact, feel free to send them through to us at Agiloft, and we will get back to you after this webinar. Thanks again for joining, and I hope you have a wonderful rest of your day.