Fast analytics for the Federal Government
Each day, executives in Federal agencies and departments balance the public’s growing need for services with budget discipline -- "doing more with less". They rely on predictive analytics and machine learning to make government work better, ensuring tax compliance, enforcing the law, detecting fraudulent claims, and understanding public needs.
The most demanding analytics work is often ad-hoc and time-sensitive, and requires an ability to scale up quickly. Consider the following scenarios:
- A hurricane looms. Risk analysts at the Federal Emergency Management Agency (FEMA) collaborate with partners in the National Oceanic and Atmospheric Administration (NOAA) to develop risk assessments and disaster plans.
- A plane crashes, killing all aboard. A team from the National Transportation Safety Board collaborates with the air carrier, aircraft manufacturer, and other interested parties to collect massive quantities of data about the incident.
- While considering new legislation, a Congressional committee requests an agency to analyze the impact of three possible scenarios. The agency has thirty days to respond.
While these examples are extraordinary, the demand for analysis in Federal agencies and departments fluctuates greatly over the course of the year. Budget cycles create needs for timely impact analysis and "what-if" scenarios. Seasonal fluctuations, such as tax season or health insurance enrollment season, create similar demands.
Moreover, the examples illustrate a common problem for data scientists in the Federal community: complex and diverse data sources frequently require integration "on the fly". Due to the ad hoc nature of the analysis, data scientists cannot always depend on pre-existing data warehouses; instead they must import, transform, cleanse and integrate data sources "just in time", as they perform the analysis.
This requirement makes Apache Spark a natural choice for Federal data scientists. Apache Spark provides the ability to integrate with a broad range of data sources, including relational databases, NoSQL databases, HDFS and many other data sources. Spark’s in-memory computation supports rapid data cleansing, transformation and analysis. Agencies as diverse as the NASA Jet Propulsion Laboratory and the United States Patent and Trademark Office successfully use Spark today for high-performance analytics.
Implementing a Spark cluster, however, is no easy task; it requires special skills and training. Long procurement cycles for hardware and software and complex purchasing procedures make it difficult to ramp up infrastructure quickly when needs dictate.
Faced with similar challenges, managers in the private sector turn to the cloud. Managed services in the cloud are immediately available, with instant access to virtually unlimited computing power. When special circumstances dictate, a cloud analytics platform may be the only place to turn.
For Federal executives, however, security and compliance are critical concerns. Federal agencies work with highly sensitive data, including Controlled Unclassified Information (CUI), Personally Identifiable Information (PII), financial data, patient records, law enforcement data and many other types of data that must be protected. In 2011, the Office of Management and Budget (OMB) directed Federal agencies to comply with the FedRAMP process to assess, authorize and monitor security in cloud computing.
Fortunately, there are cloud platforms that are FedRAMP-compliant. Since 2011, Amazon Web Services has offered AWS GovCloud, an isolated AWS region designed to host sensitive data and regulated workloads. AWS GovCloud (US) has received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the FedRAMP High baseline. GovCloud has also received Level 3-5 Provisional Authorization under the Defense Information Systems Agency’s (DISA) Cloud Security Model (CSM). This means that DoD agencies can use GovCloud for all but Level 6 Classified workloads.
Responding to this improved security, Federal agencies and departments are expanding their use of public cloud. Analysts at Deutsche Bank estimate that spending on public cloud currently amounts to 1-5% of total government IT spending, but that interest is growing rapidly. And just last month, the U.S. State Department awarded a contract for real-time energy analytics to a service running on AWS GovCloud.
To combine the power of Apache Spark with the speed and agility of the cloud, choose a cloud-based analytics platform. When you evaluate a provider, look for five things:
Apache Spark inside. Apache Spark is a powerful open source data processing engine built for sophisticated analytics, ease of use, and speed. It combines capabilities for SQL processing, streaming analytics, machine learning and graph analytics in a single platform, and supports Java, Scala, Python and R interfaces.Comprehensive security. Look for an analytics platform that runs natively on Amazon Web Services and supports GovCloud. The service should support real-time exploration and advanced analytics on GovCloud.
Access to AWS Data. Your analytics platform should work with structured or unstructured data stored in Amazon S3; relational databases, such as Amazon Aurora, MySQL, PostgreSQL, Oracle and SQL Server; Amazon DynamoDB; Amazon Redshift; Elasticsearch; HDFS files and popular data storage formats, such as CSV, Parquet, Avro, RC, ORC and Sequence files; and streaming data sources, such as Amazon Kinesis or Apache Kafka.
Low maintenance managed service. Your team does not have extra time to spend managing infrastructure. Choose a platform that enables your users to determine the level of computing resources available to a job; secure the required resources when the job executes, and release them when the job is finished. Your analytics team should be able to scale up and scale down as necessary, and your organization should pay only for what it uses.
Integrated workspace. Your platform should support rapid-cycle iterative development and collaboration. Most projects include repeatable modules, such as reading data from a specific data source, or grouping and aggregating granular data into business metrics. Sharing modular code is critical to avoid rework, and to ensure consistent metrics.
Your department or agency has a job to do, with limited time and money. For rapid insight with Big Data, Apache Spark is the tool of choice for data scientists in the Federal community; for speed and agility use a cloud-based managed service for Spark. Choose a managed service that offers comprehensive FedRAMP security, access to all of your data on AWS GovCloud, low maintenance and an integrated workspace for your data scientists.
Photo credit: Andrea Izzotti / Shutterstock