Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation on data flow in Go (and some small fixes for java) #18511

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
369 changes: 369 additions & 0 deletions docs/codeql/codeql-language-guides/analyzing-data-flow-in-go.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
.. _analyzing-data-flow-in-go:

Analyzing data flow in Go
=========================

You can use CodeQL to track the flow of data through a Go program to its use.

About this article
------------------

This article describes how data flow analysis is implemented in the CodeQL libraries for Go and includes examples to help you write your own data flow queries.
The following sections describe how to use the libraries for local data flow, global data flow, and taint tracking.

For a more general introduction to modeling data flow, see ":ref:`About data flow analysis <about-data-flow-analysis>`."

.. include:: ../reusables/new-data-flow-api.rst

Local data flow
---------------

Local data flow is data flow within a single method or callable. Local data flow is usually easier, faster, and more precise than global data flow, and is sufficient for many queries.

Using local data flow
~~~~~~~~~~~~~~~~~~~~~

The ``DataFlow`` module defines the class ``Node`` denoting any element that data can flow through.
The ``Node`` class has a number of useful subclasses, such as ``ExprNode`` for expressions, ``ParameterNode`` for parameters, and ``InstructionNode`` for control-flow nodes.
You can map between data flow nodes and expressions/control-flow nodes/parameters using the member predicates ``asExpr``, ``asParameter`` and ``asInstructionNode``:
owen-mc marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: ql

class Node {
/** Gets the expression corresponding to this node, if any. */
Expr asExpr() { ... }

/** Gets the parameter corresponding to this node, if any. */
Parameter asParameter() { ... }

/** Gets the IR instruction corresponding to this node, if any. */
IR::Instruction asInstruction() { ... }

...
}

or using the predicates ``exprNode``, ``parameterNode`` and ``instructionNode``:

.. code-block:: ql

/**
* Gets the `Node` corresponding to `e`.
*/
ExprNode exprNode(Expr e) { ... }

/**
* Gets the `Node` corresponding to the value of `p` at function entry.
*/
ParameterNode parameterNode(Parameter p) { ... }

/**
* Gets the `Node` corresponding to `insn`.
*/
InstructionNode instructionNode(IR::Instruction insn) { ... }

The predicate ``localFlowStep(Node nodeFrom, Node nodeTo)`` holds if there is an immediate data flow edge from the node ``nodeFrom`` to the node ``nodeTo``. You can apply the predicate recursively by using the ``+`` and ``*`` operators, or by using the predefined recursive predicate ``localFlow``, which is equivalent to ``localFlowStep*``.

For example, you can find flow from a parameter ``source`` to an expression ``sink`` in zero or more local steps:

.. code-block:: ql

DataFlow::localFlow(DataFlow::parameterNode(source), DataFlow::exprNode(sink))

Using local taint tracking
~~~~~~~~~~~~~~~~~~~~~~~~~~

Local taint tracking extends local data flow by including non-value-preserving flow steps. For example:

.. code-block:: go

temp := x;
y := temp + ", " + temp;

If ``x`` is a tainted string then ``y`` is also tainted.
Comment on lines +77 to +82
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential point for confusion with this example is that it may not be clear to readers whether temp := x requires taint flow analysis or just y := ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's an odd example, with the capacity for confusion. I've changed it to just (each language's version of) y := "Hello " + x in all the language guides where it appears.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'd be surprised if we have a language where (the equivalent of) temp := x requires taint-flow analysis, I'd be careful with just changing examples for other languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've advertised the change more widely among the language teams.



The local taint tracking library is in the module ``TaintTracking``. Like local data flow, a predicate ``localTaintStep(DataFlow::Node nodeFrom, DataFlow::Node nodeTo)`` holds if there is an immediate taint propagation edge from the node ``nodeFrom`` to the node ``nodeTo``. You can apply the predicate recursively by using the ``+`` and ``*`` operators, or by using the predefined recursive predicate ``localTaint``, which is equivalent to ``localTaintStep*``.

For example, you can find taint propagation from a parameter ``source`` to an expression ``sink`` in zero or more local steps:

.. code-block:: ql

TaintTracking::localTaint(DataFlow::parameterNode(source), DataFlow::exprNode(sink))

Examples
~~~~~~~~

This query finds the filename passed to ``os.Open(..)``.

.. code-block:: ql

import go

from Function osOpen, CallExpr call
where
osOpen.hasQualifiedName("os", "Open") and
call.getTarget() = osOpen
select call.getArgument(0)

Unfortunately, this only gives the expression in the argument, not the values which could be passed to it. So we use local data flow to find all expressions that flow into the argument:

.. code-block:: ql

import go

from Function osOpen, CallExpr call, Expr src
where
osOpen.hasQualifiedName("os", "Open") and
call.getTarget() = osOpen and
DataFlow::localFlow(DataFlow::exprNode(src), DataFlow::exprNode(call.getArgument(0)))
select src

Then we can make the source more specific, for example an access to a parameter. This query finds where a public parameter is passed to ``os.Open(..)``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of this could be improved with something like this:

Suggested change
Then we can make the source more specific, for example an access to a parameter. This query finds where a public parameter is passed to ``os.Open(..)``:
To restrict sources to only parameters, rather than arbitrary expressions, we can modify this query as follows:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer the original. It has more narrative flow ("start with a simple query, then slowly add restrictions to get something more interesting"). Can you explain what you don't like about it, and how your suggestion is an improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The narrative is good, and I wouldn't change that. The intention with my suggestion was to keep it as well. However, the flow/grammar/style of the first sentence that's currently here isn't great:

  • Since the reader will have looked over / thought about / possibly played with the code before this, the Then ... feels disconnected here as an opening to the paragraph. It implies that we are continuing some line of work, even though the code example accomplished everything the previous paragraph set out to do. This is why, in my suggestion, the sentence begins by setting out what the next goal is.
  • "the source" is ambiguous and possibly misleading. Does it refer to the input to localFlow or (the) source discovered by evaluating the query? If a reader assumes the latter case, "the" implies that this query only ever finds one source.
  • Similarly with "access to a parameter".
  • The "[..], for example [..]" fragment doesn't read right. It might be better if there was a verb in it (e.g. ", for example by constraining [the source(s)] to function parameters")

Also, while going through this again, does Parameter give us parameters or variable accesses to parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter extends DeclaredVariable, so I'm pretty sure it's parameters rather than variable accesses to parameters.


.. code-block:: ql

import go

from Function osOpen, CallExpr call, Parameter p
where
osOpen.hasQualifiedName("os", "Open") and
call.getTarget() = osOpen and
DataFlow::localFlow(DataFlow::parameterNode(p), DataFlow::exprNode(call.getArgument(0)))
select p

This query finds calls to formatting functions where the format string is not hard-coded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This" is potentially ambiguous as it could refer to the query above the text:

Suggested change
This query finds calls to formatting functions where the format string is not hard-coded.
The following query finds calls to formatting functions where the format string is not hard-coded.

I also think that "hard-coded" is a bit ambiguous. One person might understand this to mean "a constant that's provided directly as argument". Your interpretation here is: "a constant that is defined locally in the scope of the function somewhere". Another person might say "a constant that's defined anywhere". We should clarify what the query is intended to find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Lots of the language guides says "This query ...:". Would using a colon at the end of this sentence (and other equivalent ones) solve the problem? Or would you still like "The following"?
  • I agree it could be clearer, but that is a bigger change than I want to do in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using a colon at the end of this sentence (and other equivalent ones) solve the problem? Or would you still like "The following"?

Yes, a colon would help, but resolving the ambiguity at the start of the sentence would still be good as well since it then doesn't require the reader to read/scan the entire sentence to know what it's about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to do this for multiple languages.


.. code-block:: ql

import go

from StringOps::Formatting::Range format, CallExpr call, Expr formatString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there could be a short sentence explaining what StringOps::Formatting::Range is or a link to other documentation?

where
call.getTarget() = format and
formatString = call.getArgument(format.getFormatStringIndex()) and
not exists(DataFlow::Node source, DataFlow::Node sink |
DataFlow::localFlow(source, sink) and
source.asExpr() instanceof StringLit and
sink.asExpr() = formatString
)
select call, "Argument to String format method isn't hard-coded."

Exercises
~~~~~~~~~

Exercise 1: Write a query that finds all hard-coded strings used to create a ``url.URL``, using local data flow. (`Answer <#exercise-1>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment about "hard-coded" as above.


Global data flow
----------------

Global data flow tracks data flow throughout the entire program, and is therefore more powerful than local data flow. However, global data flow is less precise than local data flow, and the analysis typically requires significantly more time and memory to perform.

.. pull-quote:: Note

.. include:: ../reusables/path-problem.rst

Using global data flow
~~~~~~~~~~~~~~~~~~~~~~

The global data flow library is used by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passive voice makes this ambiguous. I would write "We can use global data flow by [..]", but if the agreed on style dictates that the passive must be used, then something like "A query can use the global data flow library by [..]"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

Suggested change
The global data flow library is used by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``:
To use the global data flow library, implement the signature ``DataFlow::ConfigSig`` and apply the module ``DataFlow::Global<ConfigSig>``:

I note that java has this: You use the global data flow library by implementing the signature ``DataFlow::ConfigSig`` and applying the module ``DataFlow::Global<ConfigSig>``: . I don't like it, though it is in the active voice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like using the second person ("You") for this. Third-person, active voice is best in my opinion. Your suggestion with the imperative voice is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, didn't your suggestion use "we", which is first person, rather than third person?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to write "first person plural".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted your suggestion for all language guides where this sentence appears.


.. code-block:: ql

import go

module MyFlowConfiguration implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
...
}

predicate isSink(DataFlow::Node sink) {
...
}
}

module MyFlow = DataFlow::Global<MyFlowConfiguration>;

These predicates are defined in the configuration:

- ``isSource`` - defines where data may flow from.
- ``isSink`` - defines where data may flow to.
- ``isBarrier`` - optional, restricts the data flow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better description than "restricts"? All the predicates "restrict" data flow in some way.

Suggested change
- ``isBarrier`` - optional, restricts the data flow.
- ``isBarrier`` - optional, breaks the data flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to optional, defines where data flow is blocked in all the language guides where it appears.

- ``isAdditionalFlowStep`` - optional, adds additional flow steps.

The data flow analysis is performed using the predicate ``flow(DataFlow::Node source, DataFlow::Node sink)``:

.. code-block:: ql

from DataFlow::Node source, DataFlow::Node sink
where MyFlow::flow(source, sink)
select source, "Data flow to $@.", sink, sink.toString()

Using global taint tracking
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Global taint tracking is to global data flow what local taint tracking is to local data flow. That is, global taint tracking extends global data flow with additional non-value-preserving steps. The global taint tracking library is used by applying the module ``TaintTracking::Global<ConfigSig>`` to your configuration instead of ``DataFlow::Global<ConfigSig>``:

.. code-block:: ql

import go

module MyFlowConfiguration implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
...
}

predicate isSink(DataFlow::Node sink) {
...
}
}

module MyFlow = TaintTracking::Global<MyFlowConfiguration>;

The resulting module has an identical signature to the one obtained from ``DataFlow::Global<ConfigSig>``.

Flow sources
~~~~~~~~~~~~

The data flow library contains some predefined flow sources. The class ``RemoteFlowSource`` (defined in ``semmle.code.java.dataflow.FlowSources``) represents data flow sources that may be controlled by a remote user, which is useful for finding security problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note / recommend ActiveThreatModelSource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's more advanced. This guide gives a simple approach that works. (Also, if we want to do that, we should do it separately for all the languages.)


Examples
~~~~~~~~

This query shows a taint-tracking configuration that uses remote user input as data sources.

.. code-block:: ql

import go

module MyFlowConfiguration implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof RemoteFlowSource
}

...
}

module MyTaintFlow = TaintTracking::Global<MyFlowConfiguration>;

Exercises
~~~~~~~~~

Exercise 2: Write a query that finds all hard-coded strings used to create a ``url.URL``, using global data flow. (`Answer <#exercise-2>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern about "hard-coded".


Exercise 3: Write a class that represents flow sources from ``os.Getenv(..)``. (`Answer <#exercise-3>`__)

Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flows from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Data flows" sounds odd to me. How about "data flow paths" instead?

Suggested change
Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flows from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)
Exercise 4: Using the answers from 2 and 3, write a query which finds all global data flow paths from ``os.Getenv`` to ``url.URL``. (`Answer <#exercise-4>`__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this for all the language guides where it appears.


Answers
-------

Exercise 1
~~~~~~~~~~

.. code-block:: ql

import go

from Function urlParse, Expr arg, StringLit rawURL, CallExpr call
where
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
call.getTarget() = urlParse and
arg = call.getArgument(0) and
DataFlow::localFlow(DataFlow::exprNode(rawURL), DataFlow::exprNode(arg))
select call.getArgument(0)

Exercise 2
~~~~~~~~~~

.. code-block:: ql

import go

module LiteralToURLConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source.asExpr() instanceof StringLit
}

predicate isSink(DataFlow::Node sink) {
exists(Function urlParse, CallExpr call |
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
call.getTarget() = urlParse and
sink.asExpr() = call.getArgument(0)
)
}
}

module LiteralToURLFlow = DataFlow::Global<LiteralToURLConfig>;

from DataFlow::Node src, DataFlow::Node sink
where LiteralToURLFlow::flow(src, sink)
select src, "This string constructs a URL $@.", sink, "here"

Exercise 3
~~~~~~~~~~

.. code-block:: ql

import go

class GetenvSource extends CallExpr {
GetenvSource() {
exists(Function m | m = this.getTarget() |
m.hasQualifiedName("os", "Getenv")
)
}
}

Exercise 4
~~~~~~~~~~

.. code-block:: ql

import go

class GetenvSource extends CallExpr {
GetenvSource() {
exists(Function m | m = this.getTarget() |
m.hasQualifiedName("os", "Getenv")
)
}
}

module GetenvToURLConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof GetenvSource
}

predicate isSink(DataFlow::Node sink) {
exists(Function urlParse, CallExpr call |
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
Comment on lines +345 to +348
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(
urlParse.hasQualifiedName("url", "Parse") or
urlParse.hasQualifiedName("url", "ParseRequestURI")
) and
urlParse.hasQualifiedName("url", ["Parse", "ParseRequestURI"]) and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately didn't do this because I think it's a bit harder to understand. I think this guide should just be a simple approach that is easy to understand and which works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could mention it as a note underneath the example. "For brevity, we could also shorten ... to ...".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my mind, the aim of this guide is not to give the best way to write things, but to help the reader use data flow and to be as clear as possible. I think that introducing a new notation to do the same thing does not help with that, especially when it isn't very easy to see at a glance what it is doing.

call.getTarget() = urlParse and
sink.asExpr() = call.getArgument(0)
)
}
}
}

module GetenvToURLFlow = DataFlow::Global<GetenvToURLConfig>;

from DataFlow::Node src, DataFlow::Node sink
where GetenvToURLFlow::flow(src, sink)
select src, "This environment variable constructs a URL $@.", sink, "here"

Further reading
---------------

- `Exploring data flow with path queries <https://docs.github.com/en/code-security/codeql-for-vs-code/getting-started-with-codeql-for-vs-code/exploring-data-flow-with-path-queries>`__ in the GitHub documentation.


.. include:: ../reusables/go-further-reading.rst
.. include:: ../reusables/codeql-ref-tools-further-reading.rst
Loading