Exploratory Data Analysis Table of Content1. Amazon Athena 1.1. What is Athena? 1.2. Some Examples 1.3. Athena + Glue 1.4. Athena Cost Model 1.5. Athena Security 1.6. Athena anti-patterns 2. Amazon QuickSight 2.1. What is QuickSight? 2.2. QuickSight Data Sources 2.3. SPICE 2.4. QuickSight Use Cases 2.5. Machine Learning Insights 2.6. QuickSight Anti-Patterns 2.7. QuickSight Security 2.8. QuickSight User Management 2.9. QuickSight Pricing 2.10. QuickSight Dashboards 2.11. QuickSight Visual Types 3. EMR (Elastic MapReduce) 3.1. An EMR Cluster 3.2. EMR Usage 3.3. EMR / AWS Integration 3.4. EMR Storage 3.5. EMR promises 3.6. So… what’s Hadoop? 3.7. Apache Spark 3.7.1. How Spark Works 3.7.2. Spark Components 3.7.3. Spark MLLib 3.7.4. Spark Structured Streaming 3.7.5. Spark Streaming + Kinesis 3.7.6. Zeppelin + Spark 3.8. EMR Notebook 3.9. EMR Security 3.10. EMR: Choosing Instance Types 4. SageMaker Ground Truth 4.1. Who Are These Human Labelers? 4.2. Ground Truth Plus 4.3. Other Ways to Generate Training Labels 5. Lab: Preparing Data for TFIDF on Spark and EMR 5.1. TF-IDF 5.2. TF-IDF In Practice 5.3. Unigrams, Bigrams, etc. 5.4. Using TF-IDF
1. Amazon AthenaServerless interactive queries of S3 data 1.1. What is Athena?Interactive query service for S3 (SQL)No need to load data, it stays in S3Presto under the hoodServerless!Supports many data formatsCSV (human readable)JSON (human readable)ORC (columnar, splittable)Parquet (columnar, splittable)Avro (splittable)Unstructured, semi-structured, or structured 1.2. Some ExamplesAd-hoc queries of web logsQuerying staging data before loading to RedshiftAnalyze CloudTrail/CloudFront/VPC/ELB etc logs in S3Integration with Jupyter, Zeppelin, RStudio notebooksIntegration with QuickSightIntegration via ODBC/JDBC with other visualization tools 1.3. Athena + Glue
1.4. Athena Cost ModelPay-as-you-go$5 per TB scannedSuccessful or cancelled queries count, failed queries do not.No charge for DDL (CREATE/ALTER/DROP etc.)Save LOTS of money by using columnar formatsORC, ParquetSave 30-90%, and get better performanceGlue and S3 have their own charges 1.5. Athena SecurityAccess controlIAM, ACLs, S3 bucket policiesAmazonAthenaFullAccess / AWSQuicksightAthenaAccessEncrypt results at rest in S3 staging directoryServer-side encryption with S3-managed key (SSE-S3)Server-side encryption with KMS key (SSE-KMS)Client-side encryption with KMS key (CSE-KMS)Cross-account access in S3 bucket policy possibleTransport Layer Security (TLS) encrypts intransit (between Athena and S3) 1.6. Athena anti-patternsHighly formatted reports / visualizationThat’s what QuickSight is forETLUse Glue instead
2. Amazon QuickSightBusiness analytics and visualizations in the cloud 2.1. What is QuickSight?Fast, easy, cloud-powered business analytics serviceAllows all employees in an organization to:Build visualizationsPerform ad-hoc analysisQuickly get business insights from dataAnytime, on any device (browsers, mobile)Serverless 2.2. QuickSight Data SourcesRedshiftAurora / RDSAthenaEC2-hosted databasesFiles (S3 or on-premises)ExcelCSV, TSVCommon or extended log formatData preparation allows limited ETL 2.3. SPICEData sets are imported into SPICE → Super-fast, Parallel, In-memory Calculation EngineUses columnar storage, in-memory, machine code generationAccelerates interactive queries on large datasetsEach user gets 10GB of SPICEHighly available / durableScales to hundreds of thousands of users 2.4. QuickSight Use CasesInteractive ad-hoc exploration / visualization of dataDashboards and KPI’sAnalyze / visualize data from:Logs in S3On-premise databasesAWS (RDS, Redshift, Athena, S3)SaaS applications, such as SalesforceAny JDBC/ODBC data source 2.5. Machine Learning InsightsAnomaly detectionForecastingAuto-narratives
2.6. QuickSight Anti-PatternsHighly formatted canned reportsQuickSight is for ad-hoc queries, analysis, and visualizationETLUse Glue instead, although QuickSight can do some transformations 2.7. QuickSight SecurityMulti-factor authentication on your accountVPC connectivityAdd QuickSight’s IP address range to your database security groupsRow-level securityPrivate VPC accessElastic Network Interface, AWS Direct Connect 2.8. QuickSight User ManagementUsers defined via IAM, or email signupActive Directory integration with QuickSight Enterprise Edition 2.9. QuickSight PricingAnnual subscriptionStandard: $9 / user /monthEnterprise: $18 / user / monthExtra SPICE capacity (beyond 10GB)$0.25 (standard) $0.38 (enterprise) / GB / monthMonth to monthStandard: $12 / GB / monthEnterprise: $24 / GB / monthEnterprise editionEncryption at restMicrosoft Active Directory integration 2.10. QuickSight Dashboards
2.11. QuickSight Visual TypesAutoGraphBar ChartsFor comparison and distribution (histograms)Line graphsFor changes over timeScatter plots, heat mapsFor correlationPie graphs, tree mapsFor aggregationPivot tablesFor tabular dataStories
3. EMR (Elastic MapReduce)Elastic MapReduceManaged Hadoop framework on EC2 instancesIncludes Spark, HBase, Presto, Flink, Hive & moreEMR NotebooksSeveral integration points with AWS 3.1. An EMR ClusterMaster node: manages the clusterSingle EC2 instanceCore node: Hosts HDFS data and runs tasksCan be scaled up & down, but with some riskTask node: Runs tasks, does not host dataNo risk of data loss when removingGood use of spot instances
3.2. EMR UsageTransient vs Long-Running ClustersCan spin up task nodes using Spot instances for temporary capacityCan use reserved instances on long-running clusters to save $Connect directly to master to run jobsSubmit ordered steps via the consoleEMR Serverless lets AWS scale your nodes automatically 3.3. EMR / AWS IntegrationAmazon EC2 for the instances that comprise the nodes in the clusterAmazon VPC to configure the virtual network in which you launch your instancesAmazon S3 to store input and output dataAmazon CloudWatch to monitor cluster performance and configure alarmsAWS IAM to configure permissionsAWS CloudTrail to audit requests made to the serviceAWS Data Pipeline to schedule and start your clusters 3.4. EMR StorageHDFSEMRFS: access S3 as if it were HDFSEMRFS Consistent View – Optional for S3 consistencyUses DynamoDB to track consistencyLocal file systemEBS for HDFS 3.5. EMR promisesEMR charges by the hourPlus EC2 chargesProvisions new nodes if a core node failsCan add and remove tasks nodes on the flyCan resize a running cluster’s core nodes 3.6. So… what’s Hadoop?
3.7. Apache Spark
3.7.1. How Spark Works
3.7.2. Spark Components
3.7.3. Spark MLLibClassification: logistic regression, naïve BayesRegressionDecision treesRecommendation engine (ALS)Clustering (K-Means)LDA (topic modeling)ML workflow utilities (pipelines, feature transformation, persistence)SVD, PCA, statistics 3.7.4. Spark Structured Streaming
3.7.5. Spark Streaming + Kinesis
3.7.6. Zeppelin + SparkCan run Spark code interactively (like you can in the Spark shell)This speeds up your development cycleAnd allows easy experimentation and exploration of your big dataCan execute SQL queries directly against SparkSQLQuery results may be visualized in charts and graphsMakes Spark feel more like a data science tool! 3.8. EMR NotebookSimilar concept to Zeppelin, with more AWS integrationNotebooks backed up to S3Provision clusters from the notebook!Hosted inside a VPCAccessed only via AWS console 3.9. EMR SecurityIAM policiesKerberosSSHIAM roles 3.10. EMR: Choosing Instance TypesMaster node:m4.large if < 50 nodes, m4.xlarge if > 50 nodesCore & task nodes:m4.large is usually goodIf cluster waits a lot on external dependencies (i.e. a web crawler), t2.mediumImproved performance: m4.xlargeComputation-intensive applications: high CPU instancesDatabase, memory-caching applications: high memory instancesNetwork / CPU-intensive (NLP, ML) – cluster computer instancesSpot instancesGood choice for task nodesOnly use on core & master if you’re testing or very cost-sensitive; you’re risking partial data loss
4. SageMaker Ground TruthWhat is Ground Truth?Sometimes you don’t have training data at all, and it needs to be generated by humans first.Example: training an image classification model. Somebody needs to tag a bunch of images with what they are images of before training a neural networkGround Truth manages humans who will label your data for training purposes But it’s more than thatGround Truth creates its own model as images are labeled by peopleAs this model learns, only images the model isn’t sure about are sent to human labelersThis can reduce the cost of labeling jobs by 70%
4.1. Who Are These Human Labelers?Mechanical TurkYour own internal teamProfessional labeling companies 4.2. Ground Truth PlusTurnkey solution“Our team of AWS Experts” manages the workflow and team of labelersYou fill out an intake formThey contact you and discuss pricingYou track progress via the Ground Truth Plus Project PortalGet labeled data from S3 when done
4.3. Other Ways to Generate Training LabelsRekognitionAWS service for image recognitionAutomatically classify imagesComprehendAWS service for text analysis and topic modelingAutomatically classify text by topics, sentimentAny pre-trained model or unsupervised technique that may be helpful
5. Lab: Preparing Data for TFIDF on Spark and EMR5.1. TF-IDFStands for Term Frequency and Inverse Document FrequencyImportant data for search → figures out what terms are most relevant for a documentTerm Frequency just measures how often a word occurs in a documentA word that occurs frequently is probably important to that document’s meaningDocument Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web pageThis tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.So a measure of the relevancy of a word to a document might be: Term FrequencyDocumentFrequencyOr: TermFrequency*InverseDocumentFrequency That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document 5.2. TF-IDF In PracticeWe actually use the log of the IDF, since word frequencies are distributed exponentially. That gives us a better weighting of a words overall popularityTF-IDF assumes a document is just a “bag of words”Parsing documents into a bag of words can be most of the workWords can be represented as a hash value (number) for efficiencyWhat about synonyms? Various tenses? Abbreviations? Capitalizations? Misspellings?Doing this at scale is the hard partThat’s where Spark comes in! 5.3. Unigrams, Bigrams, etc.An extension of TF-IDF is to not only compute relevancy for individual words (terms) but also for bi-grams or, more generally, n-grams.“I love certification exams”Unigrams: “I”, “love”, “certification”, “exams”Bi-grams: “I love”, “love certification”, “certification exams”Tri-grams: “I love certification”, “love certification exams” 5.4. Using TF-IDFA very simple search algorithm could be:Compute TF-IDF for every word in a corpusFor a given search word, sort the documents by their TF-IDF score for that wordDisplay the results