Introduction
In this post I go into some detail of how the
different Copilots in Microsoft 365 operate in practice and show that not all
the Copilots are created equal. This information could be both useful from a
technical perspective but also useful from a staff training perspective. When
you rollout Microsoft Copilot, people within the organisation need to
understand that all the Copilots within the Office applications are not the same and
are all tuned in different ways.
A Copilot is a Copilot is a Copilot?
When I first heard about Microsoft Copilot and saw the
similar looking Copilot frame on the right side of the screen, I figured it was probably just a common interface that could access the data from the application you had open at the time. However, after actually getting the opportunity of playing with Microsoft Copilot in the various apps it has
becomes clear that it is actually a lot more complex than that. Each of the
Copilots within the apps has been tailored to respond in a context that makes
sense for the type of application that you’re using. This has been achieved by the engineers at
Microsoft, using various methods of prompt engineering and orchestration in
the background.
I thought it would be useful to demonstrate the differences
in the way the various Copilots in different apps respond to the exact same prompt. For this demo I have chosen an innocuous
query that is not explicit and could be interpreted in different ways to see
what happens. The query I chose was “Tell me about the weather in Melbourne”.
This is not the kind of prompt you would really use in practice but is instead
something that I’ve chosen to highlight the differences in the way each Copilot responds to the prompt.
Let's start by querying the OpenAI ChatGPT 3.5
model and see how this foundation model interprets this request. This will offer a comparison to see the difference that the exact same prompt will give when asking it of the various Copilots.
1. ChatGPT 3.5
You will see here that the ChatGPT foundational model has
interpreted this question as a request to know the specific
temperature in Melbourne right now. This is because I wasn’t explicit enough in
what I had asked the model and so I didn’t get back any general information about expected temperature ranges in Melbourne.
In setting up the ChatGPT model the OpenAI team appear to have created the system to fail gracefully in these cases where it thinks it's getting asked for data that's more current than it knows about. This is an unfortunate trait of the foundation models, they only know information up to when they were finished being trained. It is interesting that it did not respond with some more generic information about what
the expected temperatures are throughout the year or historical information about
the weather though (keep this in mind when we get to the Word Copilot example).
2. Bing Chat
Bing Chat is geared to behave much more like a web search
engine. You can see in the example above that it reached out to the web and pulled back information from
various websites about what the current temperature, and upcoming temperatures, will be in Melbourne. It also gave references to websites that it got this
information from.
The method used here is called a Retrieval Augmented
Generation (RAG) framework, where it doesn't ask the foundation model
for the answer to the question directly. Instead, Bing will first retrieve some
reputable sources for the kind of information being requested and provide that data as part of the prompt to the foundation model (also often referred to as Grounding the model with
data). The foundation model here has been used to interpret the retrieved data
instead of using its own “knowledge” from the data it was trained on. In this
case, Bing is functioning as an orchestration engine that retrieves data which
it compiles into an expanded prompt that will be sent to the ChatGPT mode in
addition to your original query.
3. M365 Chat
When I asked the M365 Chat interface within Teams this
question, it responded that it couldn’t find the answer to the question and recommended that I use a web search. This
is because the M365 Chat Copilot uses a similar Retrieval Augmented Generation
(RAG) framework to Bing. Rather than searching the Internet for
information on the weather in Melbourne, it attempted a Semantic Index
search (Reference: https://learn.microsoft.com/en-us/microsoftsearch/semantic-index-for-copilot)
across the documents, emails, chats and other data within my Office 365 tenant. I didn’t actually have any information within my tenancy on this topic at the
time. As a result, M365 Chat was unable to get any information to the pass onto the foundation model to provide an answer. What is interesting to me here, is
that it didn’t just ask the foundation model to have a go at telling me about
the weather in Melbourne, but instead apologised for not being able to find any
documents about this.
Note: In this case, the Microsoft 365 Chat
Copilot was configured to only have access to internal documents and was not
enabled for searching the Internet for data. This is a setting that administrators
have control over:
https://learn.microsoft.com/en-us/microsoft-365-copilot/manage-public-web-access
Of course, had I have had documents that contained
information on the weather in Melbourne it would have been able to answer me. Below
is an example of the output when there is a document containing information
about the weather in Melbourne. You will see here that the RAG model has been
used to retrieve the data and the document is referenced below the response:
What is also interesting about the previous response is that
this information was actually generated in Word from a later example that I ran for this blog post. The data being displayed here is actually an interpretation of
information previously generated by the model. I find this to be an interesting,
because when data like this keeps getting recycled through these models over time, will there start to be degradation of the quality
of the information? Like a photocopy of a photocopy. Here’s an interesting
article that goes into some more detail on what could be the result of this in
the long term: (reference: https://cosmosmagazine.com/technology/ai/training-ai-models-on-machine-generated-data-leads-to-model-collapse/).
Always take care to check the information the Copilot outputs before using the
information.
4. Microsoft Word
Microsoft Word is usually used to create longer form
documents, as a result, Microsoft has tuned the way the foundation model is
prompted when you ask it questions in Word. In the example of asking it about the weather in Melbourne, the model responded with more of a Wikipedia style
response, where it attempts to go into depth about what the climate is like in
Melbourne throughout the year.
This is a stark difference to the way the ChatGPT foundation
model tried to answer this question. This happens by design, as Microsoft
realises that this is more likely what you want in a Word document rather than
the wanting to know the temperature right now. The way they do this is by taking
the original query and then adding additional (“system prompt”) information to it before sending it to the foundation model. This allows them to change the output to be more like what you might want in a Word document. It’s not
clear exactly what Microsoft is including in the prompt that it sends to the foundation model, as you never get to see this additional information.
If you play around enough with ChatGPT you can see that adding additional text like
“provide an extended response similar to a reference encyclopaedia” will cause
the model to give outputs more like this. I don’t believe it’s documented
anywhere exactly what Microsoft add to the prompts to get these responses as
the prompt engineering is a bit of secret sauce.
5. PowerPoint
The PowerPoint Copilot is an even more interesting topic as it
doesn’t just produce text, it will also add pictures and make design choices
when producing its output. You can see that for our example weather query it
produced a nice picture of Melbourne’s botanical gardens and skyline, creates a
meaningful heading and some dot points about the weather in Melbourne. It looks
pretty impressive as an output to such a basic query:
This is all the more impressive when you have some
understanding of what’s going on in the background for the PowerPoint Copilot. There
is an interesting paper that I found which is produced by some of the research
staff at Microsoft about how this works. It can be found here: https://arxiv.org/abs/2306.03460
TLDR: For apps like PowerPoint the Copilot needs to be able
to tell the application itself how to style the page in addition to just
generating text. This kind of thing can be done with scripting languages which
the foundation model could be used to produce (like Github Copilot), however,
this method is prone to syntax errors. The researchers at Microsoft found that
it was safer to create a specialised domain specific language for describing the
layout of a document (more like a declarative language like is used for
Terraform or PowerShell Desired State Configuration). The language, in
this case, is called Office Domain Specific Language (ODSL) and is designed to
use a minimal number of tokens (words) and be easily describable as an input to
a foundation model. Here’s an example of the language:
1 # Inserts new "Title and Content"
slides after provided ones.
2 slides = insert_slides(precededBy=slides, layout="Title
and Content")
When the prompt is sent to the model it will include schema information
about what the ODSL language and what the format of the desired response. The
model will then respond with a description of what each slide should look like
in the desired ODSL format. The response is thoroughly checked and validated
to have the right format and then translated into a lower-level language by an
interpreter program which then gets executed by
PowerPoint. This is both very cool and crazy that the foundation models are
powerful enough to do these kinds of things.
6. Outlook
When you write an email your colleagues you don’t really want
to be known as the person that writes the dreaded War and Peace novel length emails. Fortunately,
Microsoft are aware of this and when designing the Outlook Copilot, they took
this into account. The output of this Copilot is designed to produce output
that looks like, in both format and content, like an email. You can
see below that the simple weather in Melbourne prompt actually created what
looks and reads like an email. I must admit it did take a bit of artistic
licence and go on a bit more of a ramble than I would have liked in this case though:
7. Excel
The Excel Copilot is once again quite different than the
other Copilots. Asking it the weather is not exactly what it’s supposed to be
used for, but I asked it anyway, because, why not?:
In excel, the Copilot is more for creating formulas and
reasoning over the data that is in your spreadsheets. In the current preview version,
the Copilot will only work on data that is in a defined table. This is likely
to do with the fact that the data needs to be ordered in such a way to be sent
as a prompt to the foundation model. In doing this the data needs to retain all the column and
row information but also keeps the token count low enough to be processed. I’m
not sure if it’s clear how Microsoft could process an entire very large
spreadsheet (with the potential complexity of multiple pages, and scattered data, etc) through the foundation models give their token limits currently. Until
they figure this out, we may be stuck with only processing data that is in
defined smaller tables for the time being.
If you are wondering what the Excel Copilot can actually do
though, here’s an example of how you could ask the Excel Copilot to reason over
the data in a table and give you an answer:
Also, here’s an example of how you can ask the Excel Copilot
for a formula for producing a Fahrenheit column from a Celsius column:
8. Microsoft Whiteboard
The Microsoft Whiteboard Copilot has another take on what it
produces based on our modest weather question. It produced a bunch of sticky
notes for various things that the weather could be in Melbourne. This is more contextualized
toward a brainstorming type of session which is be common when using a Whiteboard:
This is once again, a fun and different take on how a foundation
model can be used to produce a more context aware output for the application at hand.
The Wrap Up
As you can see, all these Copilots across the
Microsoft Office apps are all very different beasts, and this is something that
people within your organisation should understand in order to get the most out
of Copilot product set. This is certainly something to keep in mind when
training staff on the potential use cases and determining which Copilot is
right for the task at hand. Cheers!