Scala Regex String Extraction

See Threat Stack in Action

Threat Stack secures your cloud infrastructure workloads. See how.

Book Your DemoDemo

Introduction

— Joe Baker, Manager Software Engineering

From time to time the Engineering, Operations, and Security groups at Threat Stack contribute blog posts that share information on techniques and tools we’ve developed so we can do things faster, more accurately, and with fewer resources. These range from tips for using Scala in the real world, to improving our SOC 2 management process using a home-grown tool called sockembot, to insights into how we manage our on-call rotation using another home-built tool called Deputize (which we’ve since made available as open source).

Today’s post is by Alfredo Perez, one of our software engineers, and focuses on Scala Regex String Extraction.

If there’s anything you’d like to hear about, please Tweet us at @threatstack or contact us directly.

One of my favorite Scala patterns that I’ve learned and used here at Threat Stack is Regex String Extraction with pattern matching. It’s a simple pattern but very powerful for extracting parts of a string and very readable. The power comes from the use of regular expression groups combined with the pattern matching of Scala.

I’ve used this pattern to help solve many problems for Threat Stack from web scraping vulnerability data to matching authentication header values. In the following example, I use it to extract some values typically found in request and response headers. Something like the following:

Request URL: https://start.duckduckgo.com/
Request Method: GET
Status Code: 304
Remote Address: 107.20.240.232:443

In this example, I’ve implemented a function that uses a string as input and outputs a case class with our extracted values:

import scala.util.matching.Regex
case class RequestValues(url: String, method: String, status: String, remoteAddress: String)

def extractRequestValues(data: String): Option[RequestValues] = {
      val r = """Request URL: (http.*?)
           |Request Method: (GET|POST|PUT|DELETE)
            |Status Code: ([0-9]{3})
            |Remote Address: (.*?)""".stripMargin.r

      data match {
           case r(url, method, status, remoteAddress) =>
      Some (RequestValues(url, method, status, remoteAddress))
           case _ => None
       }
}

You can can see that the regex pattern has four groups (where a group is enclosed by parenthesis) for the:

  • url: (http.*?)
  • method: (GET|POST|PUT|DELETE)
  • status code: ([0-9]{3})
  • remote address: (.*?)

These then match up with our regex extractors in the pattern match block: r(url, method, status, remoteAddress)

The important thing about the extractors is that they extract from the groups sequentially. The names are up to you, and I could have just named them “a”, “b”, “c”, “d”.

A couple of test runs show us our populated case class and a None for an invalid case:

val data = """Request URL: https://start.duckduckgo.com/
       |Request Method: GET
       |Status Code: 304
       |Remote Address: 107.20.240.232:443""".stripMargin

scala> extractRequestValues(data)
res0:Option[RequestValues]=Some(RequestValues(https://start.duckduckgo.com/,GET,304,107.20.240.232:443))

scala> extractRequestValues("foo data")
res1: Option[RequestValues] = None

Wrapping Up . . .

Hopefully the preceding examples were helpful and gave some insight into how powerful and usable the Regex String Extraction is. Whether you’re conducting searches, making updates, or carrying out validations, you should find it highly usable and a great way to save time and effort.

See Threat Stack in Action

Threat Stack secures your cloud infrastructure workloads. See how.

Book Your DemoDemo