Artificial Intelligence For Enterprise Applications

Bill Appleton
13 min readAug 24, 2023

Millions of people are using the Chat GPT web service from OpenAI. Businesses use it to write emails, and High School kids use it to write essays. But when it comes to enterprise software, I think most companies are having trouble getting tangible benefits and consistent results from this new technology. Artificial Intelligence looks incredibly promising, but enterprise software must be robust, flexible, compliant, secure, and reliable. The new generation of AI services can perform real work, but how exactly?

As an early adopter of AI and a veteran software developer I have travelled down this road and have some practical suggestions on how to integrate AI into an existing enterprise applications. This whitepaper discusses my experience building a Prompt Engineering Platform with OpenAI for our Salesforce org management product Metazoa Snapshot. I cover the various pitfalls and practical strategies discovered along the way.

OpenAI Timeline

Starting out in 2020, GPT 3.0 was using the Completion API. But in 2022 GPT 3.5 introduced the Chat API, which was better at steering the language model, and could produce more useful work. The maximum token limits also increased from 4K to 16K. The upcoming GPT 4.0 has a 32K limit. The number of tokens places a fundamental limit on the tasks that a model can perform. A single word might consume a few tokens, so 16K is not a lot of room. Both the chat question and answer must fit inside the token limit. For our project we started with GPT 3.5, upgraded to the 16K version, and are currently testing GPT 4.0 as the next step.

Sequential Prediction

Much has been written about the ability of Large Language Models (LLM) to predict the next word in a sentence. The basic strategy is to train a neural net on lots of text so that you have a probability distribution that can predict the next word. And then the next word, and the next. This strategy worked way better than anyone initially expected: suddenly the Completion API could finish sentences, create paragraphs, and even write entire documents when started off in the right direction.

When you construct a prompt, each subsequent word is surfing along on this probability wave. If the probabilities are high, then the prompt will be delivering high quality results. But if you choose a word like alligator in the example above, the wave will crash. The model will start casting about for words with lower probability. This is where hallucinations can happen. Some topics are more obscure than others, and if your prompt is really about alligators, then that’s the right word to use. At any rate, the probability wave is a good model to have in mind when writing prompts. This also explains why the LLM is so good with code. Simply put, code is more structured than language.

Another ramification of LLM architecture is that prompts are inherently sequential. If you instruct the language model to do something, and then tell it to do something else, and then add an additional requirement about the first thing, the model will likely become confused and ignore that last statement. Organize your prompt to be sequential in time for the best results. In the example prompt below, Step 3 is likely to be ignored because the LLM has already completed Steps 1 and 2.

Step 1) Create the XML element “firstName”
Step 2) Create the XML element “lastName”
Step 3) Don’t Create Elements If They Already Exist

Thinking about the LLM architecture, you can also see why AI is so much more powerful than search. When you search, a single phrase is matched against the Internet, and prioritized links are returned. But when the language model generates a response, every single word is matched against the Internet, and the probability distribution guides the result. There are more search results baked into the results generated by AI.

Steering

Because of the sequential nature of a prompt, the Chat API starts with a “system” message that sets the context for the rest of the prompt. After that, the “user” and “assistant” messages alternate back and forth. This reminds me of the Socratic method for reaching true knowledge through dialectic and debate. The system message should be used to provide the “steering” for your prompt. For example, in our Org Management application we often use the following system message: “You are a helpful expert on the Salesforce Metadata API.” The rest of the prompt will continue to flow in a natural way from that starting point.

The screen shot above shows the Metadata Studio prompt engineering platform we developed for our Snapshot product. The end user can select a shared prompt at left and click the purple submit button at the bottom of the screen. The results are presented after that. A prompt engineer can click on the next tab to design a new prompt. This picture illustrates the system message in red followed by an example question and an example answer. The real user question is the open panel below all that with the green icon. This is an example of “Few-Shot Learning” which we discuss next.

Few-Shot Learning

You might have heard of Few-Shot Learning (FSL). People also talk about Zero-Shot or One-Shot Learning, which is where the term comes from. The dialectic design of the Chat API opens the possibility of using artificially constructed questions and answers to train the model. FSL enables a pre-trained model to generalize over new categories of data with meta-learning. In effect, you set the context, and then give the model an example question, an example answer, and then the real question.

If you can give the language model a puzzle and all the pieces, then meta-learning becomes an effective way to solve entirely new problems. In my experience this is a very effective technique for enterprise applications especially when the LLM needs to return results in a specific format. Otherwise, the LLM will try to figure out what kind of answer you are looking for and return random results such as writing reports, generating code, or returning data, etc.

Be aware that in situations where the LLM already has a lot of training, FSL can be counterproductive. For example, GPT 3.5 already knows a lot about Apex Classes. If you provide FSL examples of Apex Classes, then the LLM will follow those provided patterns and may neglect other more comprehensive training about Apex. Be sure that is what you want. Always choose your FSL examples carefully.

Prompt Integration

Another ramification of the LLM architecture is that the system uses freeform text input and output for the questions and the answers. Think of the LLM as an incredible new kind of text parser with amazing flexibility. Because it leverages massive training from the Internet, you can throw lots of junk at the parser and it still works. The main reason Chat GPT became so popular so fast is that everybody can work with text.

But in the enterprise, the text only model presents a lot of challenges. Companies need to present structured data as inputs and outputs. Sometimes enterprise data is gigantic in size, and therefore frustrated by the token limits. Another issue is that the results generated by the LLM are probabilistic in nature. The Chat API has a “temperature” parameter that controls how random the results are. Even with the temperature set to zero, there can be unexpected results at times.

Any enterprise implementation of AI is going to have to deal with this issue. You will need the ability to seamlessly include enterprise data in the prompt. In fact, you probably need to give the end user the interactive ability to select targeted enterprise data sources for inclusion. This is especially true if the prompt is separate from any specific application context. Then when the LLM returns results, that text will need to be analyzed for new documents that have been created. These documents should be validated and potentially committed back to selected services and data bases.

In our product, we ended up integrating OpenAI with Salesforce both coming and going. First off, Salesforce metadata can be seamlessly integrated into the prompt depending on design requirements from the Prompt Engineer. Then the LLM performs all kinds of transformations, modifications, and reporting on that metadata. After that, the results are scanned for new documents, and these can be validated against the Salesforce org with a test deployment. If there are errors, then these can be returned to the chat interface for resolution. If there are not any errors, then the newly created documents can be deployed to the org or saved to the desktop as a metadata file for Salesforce DX.

Markdown Blocks

Right about now you might be wondering how all this structured information is somehow magically combined with the raw text required by the Chat API. The answer is Markdown Blocks. Markdown is a lightweight markup language that you can use to add formatting elements to plain text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular (and simple) markup languages. In example prompts this is why you sometimes see the triple backticks followed by a data type. Here is an example.

```xml
<?xml version=”1.0" encoding=”UTF-8"?>
<CustomObject>

</CustomObject>
```

In this case we are using triple backticks and the “xml” data type to delimit some Custom Object information from Salesforce. When working with the Chat API the returned data type is often wrong. For example, Apex Classes are often marked as “java” and Apex Pages might be marked as “html.” This brings up a related requirement of being able to identify the actual type of information returned. In our application, we have custom handlers to identify (for example) Apex Triggers from Apex Classes and Apex Pages, etc. This issue is one of many complications caused by unstructured input and output.

If you are going to use Markdown in your enterprise prompt management system, then you need to do it consistently. If there are FSL examples or other grounding data, they need to follow the same markdown pattern that you are expecting to receive. If you do not follow the pattern them the LLM will probably not follow it either. You can also explicitly tell the LLM about the pattern in the system message. For example, we add the command: “You must enclose any code or data that is returned in a markdown block.
The start of the block must have three backticks followed by the appropriate data type and a newline.”

Taking a step back, the use of Markdown is an extremely lightweight mechanism compared to other types of structured documents used in enterprise software. For example, the WSDL specification for SOAP has extremely complex requirements for basically doing the same thing: fencing off the different data types in a text file. However, the LLMs are very good at understanding both JSON and XML, and so Markdown may endure as a standard way to build prompts.

Token Limits

The maximum token limits have increased from 4K in 2020 to 16K in 2022, and the upcoming GPT 4.0 has a 32K limit. The number of tokens places a fundamental limit on how much information the model can handle. A single word might consume a few tokens, so 16K is not a lot of room. Both the chat question and answer must fit inside the token limit. The token limit is related to the size of the input layer on the neural net. This limit is not easy or cheap to increase. The next word cannot be predicted once you have reached the token limit.

In our use case for Metazoa Snapshot we work with Salesforce metadata. But the metadata in an enterprise org can be hundreds of megabytes in size. That is orders of magnitude larger than the number of available tokens. Because of similar problems, various methods have been invented to deal with token limits. These strategies are discussed below.

Function Calling

OpenAI has recently introduced Function Calling. In an API call, you can describe functions and the model will return a JSON object containing the arguments. At that point, the client application can call the function. This is a way to mine the prompt for structured arguments that can be used by the caller. Function Calling is mainly useful when the LLM knows the universe of potential arguments to return. Because of this, Function Calling has limited potential to identify Grounding Data. Function Calling may also require multiple round trips to the API for some scenarios.

OpenAI Plugins

Another potential solution to the token limit problem is OpenAI Plugins. For example, if the user’s question includes a URL, the LLM could use a plugin to fetch live content from a web page. This is nice but be aware that the content generated by the plugin also consumes your token limits, so there is nothing magical about it. There is a security issue here. The plugin is basically a REST API service available on the web. Plugins are not a secure solution for enterprise data that requires authentication.

Grounding Data

If the user’s question includes a URL, why couldn’t you just include that information for reference in the question anyway? Grounding data simply includes relevant data along with the user’s question. In our experience, grounding data is an effective strategy to extending the capabilities of the LLM. In many cases this requires parsing the user’s question for hints about what they are interested in. For example, if there is a URL in the question, then they want information on that web page. If they mention a certain custom object or metadata type, then they want information about that asset.

Dynamic Data

In many cases the grounding data will overflow any available token limit. For example, in our product, the user might ask about the Account object. In some orgs, that object can be many megabytes in size. What if they ask about user Profiles? This is another potential data size explosion. In our case, we have developed structures that can dynamically shrink until they fit within the current token limit. As the limit expands in the future, we can supply more and more information without any change in the application. This strategy requires detailed knowledge about what parts of the metadata are more important than others. The priority information is used to dynamically shrink assets to provide the best grounding data possible.

Prompt Structure

We developed an XML structure that merges functionality from the Chat API with support for Salesforce metadata. In our prompt engineering platform, each Prompt is made up of several Messages. The first message must be a System Message, and after that, we alternate between User Messages and Assistant Messages.

Each Message is made up of any number of Content elements. Each Content element can be Text or Data. There are also settings to help guide the end user when a metadata asset is selected for the prompt. The dynamically selected metadata ends up being classified as either Text Content or Data Content as well. Here is a simplified hierarchy of the structure.

  • Prompt (XML Structure)
  • Messages (System, User, or Assistant)
  • Content (Text or Data)

When the Prompt is submitted, the Text Content and Data Content is serialized into the raw text message required by the Chat API, complete with Markdown for fencing. When the answer is returned, the text is deserialized into a new Assistant Message that is added to the XML structure of the prompt. In this manner, the regular question and answer flow of the Chat API is seamlessly blended with Salesforce data and metadata.

Prompt Lifecycle

The XML structure of our Prompt object also supports management through the prompt lifecycle. The Prompt supports stages of life including Design, Engineering, Review, Validation, Deployment, Monitoring, Improvement, and Archive. Each prompt tracks history across the lifecycle through various owners and timeframes. Each prompt has its own description, Help Movie and Rating system. And the Prompts also have a category, for example, Apex, Flows, Rules, etc.

The prompt can be shared with other team members. This allows a prompt engineer to create important new functionality for the administrative staff. The engineer can embed FSL questions and answers in the prompt, grounding data for meta-learning, and best practices stored as text. This capability enforces the best practices and guides the desired results when the prompts are used.

User Interface

The Chat API interface is amazing. You can type text and ask for things, and it understands you! But impressive as this is, an interface like this causes problems in an enterprise environment. There is an analogy here to SQL Injection Attacks. That is a security problem where malicious user input can delete records in a database. If the prompt is hooked up to important enterprise systems, the potential for freeform text entry to cause problems is vast and hard to predict.

Say that you have engineered a prompt to create a new Apex Trigger and Class that works with an existing Trigger Framework according to Best Practices. The end user should simply select the Custom Object that is used to create the Trigger and Class. You don’t want them modifying the Framework or making other decisions. The Best Practices that are used should be included as invisible grounding data which is part of the prompt and enforced as an enterprise policy.

Prompt Engineering for enterprise applications needs to leverage traditional UI frameworks that present picklists and checkboxes for the user to select, and these interfaces should generate textual descriptions for the Chat API that are controlled and predictable. This approach models the generated text for the prompt based on known inputs. The magic still happens behind the scenes, where the LLM writes and transforms and creates. But the interface to the LLM must be controlled and limited.

In the future, AI systems will need to go beyond integration with enterprise data. They need to consume complex prompts as input and make actual decisions as output. These prompts will need to make things happen in the real world, such as selling products, resolving support cases, and purchasing inventory. The text input and output model may eventually morph into something more structured, or it may be up to applications engineers to figure out how to harness this power.

Conclusion

This whitepaper has discussed my experience building our Metadata Studio Prompt Engineering Platform for Snapshot. If you are working on integrating AI with your enterprise application, then you are likely to encounter many of the same issues. I hope this whitepaper proves informative for your work.

support@metazoa.com

1–833-METAZOA
1–833–638–2962

https://www.metazoa.com

Twitter: @metazoa4sf

Facebook: https://www.facebook.com/metazoa4sf

LinkedIn: https://www.linkedin.com/company/18493594/

--

--

Bill Appleton

Bill is an expert on smart clients and API services. He is currently the CTO at Metazoa where he works on the Snapshot Org Management application.