PDE

Data

Data Lake (no enforcement on structure) vs. Data warehouse (has structured - eventually - data ready to be consumed)
OLTP modifies data vs. OLAP queries data
NoSQL: MongoDB, Cassandra
Batch vs. real-time (volume, continuous)
Data flow: ingestion > storing > processing > visualizing
ETL: extract, transform, load -> ELT: load to storage, transform later
Transformations: formatting, labeling, filtering, validating
Sources and sinks (destination)
Data modeling: conceptual (entities), logical (structures), physical (implementation)

Data Lifecycle

Collect (Ingest) > Store > Process > Explore & Visualize

Data ingestion

Realtime
1. Large number of sources
2. Requires low latency
3. Pub/Sub –> Cloud Dataflow
Path
1. Small number of sources

AIML

AI > ML > Deep Learning (Neural network)

Services

Database types

Relational vs. Non-relational/NoSQL.
SQL: (structured query language) consistency over performance.
1. ACID: atomicity, consistency, isolation, durability.
  1. Atomicity: multiple statements are committed in one transaction.
  2. Consistency: from a valid state to another valid / consistent across tables state.
  3. Isolation: support concurrency (same result if it is sequential)
  4. Durability: ensure the valid state is kept through crash/infra issue
2. (+) Consistent, reliable, data integrity.
3. (-) Scale, performance, unstructured data.
NoSQL: performance over consistency.
1. Non-structured, key-value, wide table.
2. Some have ACID (e.g. datastore).
3. Use cases: IOT, user profiles, high-speed analytics.
4. Eventual consistent.
5. Databases: Redis, MongoDB, Cassandra, HBase, Bigtable, RavenDB.
6. (+) Scale, performance.
7. (-) Consistency, data integrity.

Persistent disk

Cloud Storage

Object storage
Tiers:
1. Multi-regional > Dual region > Regional
2. Standard > Near line > Cold line > Archive (same latency, different retrieval cost)
Versioning + redundancy
Data lake
Control
1. IAM for bulk; ACL for granular level
2. Project > bucket > object level
Data transfer
1. Storage transfer service (sources: S3, HTTP, other GCS buckets)
2. Data transfer appliance (100TB or 480TB - from corporate data center)
Serving
1. Static website
2. Import to databases (e.g. Cloud SQL)
3. Data analysis (e.g. BigQuery, CloudML, Cloud Dataproc)

Logging and Monitoring

Managed databases
1. Built-in features: logs, audits, metrics: disk I/O, CPU, network
2. No additional configuration needed
3. No application-level data
Un-managed database: running on GCE (e.g. MySQL, MongoDB, Cassandra)
Custom features with Agents!
1. Logging agent: stackdriver logging = application logs (configure with fluentd)
2. Monitoring agent: stackdriver monitoring = application performance/metrics/alerts
fluentd: open source data collector, let you unify the data collection and consumption.

Cloud SQL - managed; regional; RR cross-region

Relational database
Regional
1. Regional master/slave with regional persistent disk
2. Regional or cross-regional read-replicas. Can failover/DR and promote read-replica in another region in case of failure.
Binary logs: for point-in-time recovery, replication.
Managed: backup, patches, upgrades (during maintenance window), monitoring and logging, auto disk size scale, failover (master/slave), network access control
10TB - Regional read-replica
Load through GCS
1. Dump (.sql - table/schema)/CSV (.csv) files can’t contain triggers, views, stored procedures.
2. Compress to .gz to save cost
Performance best practices
1. (IOPs) depending on the disk size
2. Select instance size with memory equal to your largest table (page swap?)
SQL best practices
1. INNER JOIN vs. WHERE
2. SELECT A, B, C vs. SELECT *
3. Normalize tables
There is no charge for stopped instance, but there is still charge for data storage.
Transferred data to BigQuery will be done through Cloud Storage.
Cloud SQL
1. PostgreSQL: HA active/standby instances share a regional replicated persistent disk

Memorystore: managed redis

Simple key-value pairs

Filestore: document; no-ops; regional / multi-regional

Serverless - Document store
Collection - nested collection (sub-collection)
1. Collection ‘crew’ document ‘jeanluc’
Reference: to another collection
Using API to manipulate data (.update() .add())
Automatic single-field indexes

Datastore: ACID! Transactional! NoSQL! Replaced by Firestore

Single Datastore database per project
No-Ops (fully managed), highly scalable (NoSQL - multi-region/sharding/replication)
Not suitable for:
1. Analytics (use Bigtable/BigQuery instead)
2. Extreme scale (10M+ read/write: use Bigtable, < 1TB)
3. If you don’t need ACID, use Bigtable
4. RDS or LNS migration (use Cloud SQL)
5. Near zero latency (use Cloud Memorystore)
Multi-regional, regional: CANNOT be changed per project
1. May not need to re-create the project to change this property.
Datastore is transactional database vs. Bigtable is analytical database.
Data organization
1. Kind (category) of entities - Table
2. Entities has hierarchy (entity groups within the same namespaces) - Rows
  1. Parent: with key (e.g. “Key(Users, 12345)") from the parent entity.
3. Each entity has one or more properties, and a Key (auto generated or a column) - Column / Primary Key
Indexes and querying
1. Avoid: exploding indexes!
  1. Narrow index scope (selective, not all properties)
  2. Impact: storage and degraded performance
2. Types: built-in (single property), and composite
  1. Composite indexes: gcloud datastore create-indexes index.yaml
Query
1. Using the console’s GUI
2. Using GQL (Google query language)
3. Only index (per entity) can be used at the query’s WHERE statement
Data consistency: performance vs. accuracy
1. Strongly consistent: parallel processes see changes in order. Slower latency for transactions to complete. e.g. financial transactions (deposit / withdrawal)
2. Eventually consistent: processes see changes may out of order, but eventually correct state.
Modes
1. Cloud Firestore is the next-gen of Cloud Datastore. You can build Firestore database in the Datastore mode.
2. Native: Firestore mode, supports real-time synchronization
3. Datastore: Datastore’s system behavior + Firestore’s storage layer.

Bigtable: NoSQL, Analytical - HBase; Managed; Cross-region clusters

Single key
Originator of NoSQL industry
Petabyte
1. 1TB, 3 nodes ~ 1.5K/mo
2. Up to 10,000 write queries, per second, per node
Use cases
1. < TB: Datastore, > at least TB then consider Bigtable
2. NoSQL, high throughput analytics
3. Huge dataset
Infrastructure
1. Clients > Front-end server pool > Cluster (Nodes: store meta-data) > Colossus (SSTables: sharded) > Shared Log
2. Separate storage (SSTablets) vs. compute (Nodes)
Configuration
1. NOT a no-ops, must configure nodes
2. Entire Bigtable project is called an “instance”, which has nodes and clusters
  1. Nodes are grouped into clusters
  2. 1 or more clusters per instance
    1. To achieve consistent, configure app to use a single cluster
  3. Instance types
    1. DEV: single node, no replication
    2. PROD: 3+ nodes per cluster, replication, guaranteed throughput
    3. Can’t change HDD to SSD, need to copy data to a new instance
3. Storage is auto-scaled
4. Replication: data is synchronized between clusters, across-region
5. Resizing: add/remove nodes/clusters with no downtime
6. Changing disk type (HDD/SSD) requires new instance
  1. SSD has better IOPS, HDD has better throughput.
7. Interaction: cbt (cloud bigtable) tool or HBase shell
8. Use Key Visualizer to identify hotspot node
Data organization
1. One big table: can have thousands of columns / billions of rows
2. Table is sharded across tablets
3. Table components
  1. Row Key: first column
  2. Columns are grouped into column families
  3. Each column stores {row key, cf1:cq1}
4. Indexing and queries
  1. Only the row key is indexed
  2. Field promotion: move fields from column data to row key
5. Schema design
  1. Per table: only the row key is indexed
  2. Keep all entity info in a single row. Related entities should be in adjacent rows.
  3. Well defined row keys: distribute load across nodes, not single node (i.e. hotspotting)
    1. Combine all queried parameters into the single row key. Use only the information from the row key for search
  4. Best practices
    1. Reverse domain name: larger scope/qualifier > smaller scope/qualifier
    2. String identifier e.g. userID
    3. Timestamp (YYYYMMDD): should not be at front/alone
  5. For time series data: use tall+narrow instead of short+wide
gcloud commands

gcloud components update
gcloud components install cbt
-- need to have a .cbtrc file with project name and instance name
cbt createtable TABLE-NAME
cbt ls
cbt createfamily TABLE-NAME

Cloud Spanner: managed; region/multi-region

Challenge CAP theorem (Consistency, Availability, Partition Tolerance) thus 3 spanners
1. May sacrifice Availability for Consistency/Partition Tolerance
Is called NewSQL: No-Compromise Relational Database Service
Regional / Multi-regional / Global
1. 5 9’s SLA (5 minutes)
Support standard SQL format (ANSI 2011)
Horizontal: add more instances (nodes) to share the load. Vertical: upgrade single instance with more CPU/RAM.
Is NOT MySQL/PostgreSQL compatible. Migration to/from requires work!
Architecture: similar to Bigtable
1. Node < separate > Storage; storage can spread across zones/regions
  1. Node: 10K reads; 2K write; 2TB storage
  2. Add nodes if CPU utilization is too high (recommendation is < 45%)
2. Data is replicated across zones/regions consistently
3. Avoid hotspotting: colocate write-heavy workloads in the same region as the leader
IAM
1. Project, Instance or Database level
2. roles/spanner.PERMISSION
  1. Admin: full access
  2. Database Admin: create/edit/delete databases, grant access to databases
  3. Database Reader: read/execute database/schema
  4. Viewer: view instances, database - CANNOT modify or read from database
Data organization and schema
1. RDBMS: table, row, column. Supports joins, queries, etc.
2. Tables are handled differently
  1. Tables are parent/child relationship
  2. Interleave data layout: nested tables: Singers(1) > Albums(1,1) > Songs(1,2,1). Interleave ID is FK?
  3. Primary key: used in child table, and is carried along in grandchildren.
3. Avoid hotspotting: ‘cos nodes are distributed by key space (key range)
  1. No sequential numbers
  2. No timestamp (‘cos sequential)
Data export from Spanner to Cloud Storage is done through Dataflow

Cloud Pub/Sub: no-ops; global

Realtime: unbounded data vs. batch (bounded data)
1. Continuous flow of data
2. Often requires real-time collection, processing, storing in 1-step for low latency.
Tight vs. loosely couple systems
1. Intermediary / buffer / shock absorber
Global-scale, no-ops, global availability, auto-scaling
Guaranteed at-least-once delivery
m-to-n asynchronous: more than 1 subscriptions per topic.
Components: Topics, Messages, Publishers, Subscribers, Message Store (queue per subscriber)
Support both push and pull
1. Push: lower latency, real-time like. Require subscribers have Webhook endpoints that accept POST/HTTPS
2. Pull: for large volume of messages - batch
IAM: project, topic or subscription level
1. Admin, Editor, Publisher, Subscriber
Pricing: data volume (per GB per month)
Message order: not by default. Can be helped with Dataflow or timestamp attribute in the message.
vs Kafka
1. Global support
2. Hybrid workload: connect Kafka (on premises) with Pub/Sub (GCP)
  1. Using connector: open source plug-ins that connect Kafka to GCP (Pub/Sub, Dataflow, BigQuery)
  2. Source connector: upstream connector (to Kafka)
  3. Sink connector: downstream connector (from Kafka)
Monitoring subscriber health: using Stackdriver
1. Backlog in Pub/Sub queue (subscriptions): slow subscribers or error in acknowledging messages.
Pub/Sub Lite: zonal service.
Backup: point-in-time recover with Snapshot. Good for deploying new application version with option to rollback.
Advance
1. Seek: allows for replaying / discarding of messages
2. Retain: messages stay in Topic after being ACKed.
3. Snapshot: point-in-time rollback
Limitations
1. Message ordering
2. Message size: <10MB
3. Message retention duration: 10 minutes to 7 days

Cloud Dataflow: auto-scale, no-ops, stream and batch, regional

Process both batch and real-time data. Solution: Beam + Cloud Dataflow
Built on top of Apache Beam. Configuration is 100% code-based
Integrates natively with other tools: Pub/Sub, BigQuery, AI Platform. Connectors: Bigtable, Apache Kafka
IAM
1. Only at the project level (i.e. all pipeline or none)
2. Pipeline data access separate from pipeline access
3. Dataflow Admin / Developer / Viewer / Worker (used for service accounts): without IAM, these roles won’t be able to view data.
4. Workers use controller service account (GCE instance service account)
  1. Retrieve meta-data
  2. Can be replaced by user-managed controller service account
Dataproc: managed Apache Hadoop/Spark service
1. Dataproc: migrating Hadoop OR prefer DevOps vs. Serverless
2. Dataproc can’t process both stream and batch
Key concepts
1. Element: single entry of data (e.g. data row)
2. PCollection: distributed data set, input / output
3. Transform: data processing operation (step) in pipeline
4. ParDo: parallel processing - transform applied to individual elements (e.g. extracting elements)
  1. DoFn: user-defined function used by ParDo
5. Aggregation: aggregate e.g. CoGroupByKey: merge tables
Dealing with late/out of order data
1. Resolved with: Windows + Watermarks + Triggers
  1. Windows: logically, by time span
  2. Watermarks: timestamp - event time, processing time (source and/or pub/sub provided timestamp)
  3. Triggers: determine when results in window are emitted. Allow late-arriving data to be included.
Best practices
1. Handling errors: try-catch block. Output errors to new PCollection > Pub/Sub/GCS. Recycle ‘bad’ data.
2. Know your window types:
  1. Global
  2. Fixed time
  3. Sliding time (overlapping fixed time windows)
  4. Session
3. Can specify endpoint region: good for keeping data in a region (wrt regulatory requirements)
4. MapReduce workload can be migrated to Dataflow
5. Use DirectRunner to run jobs locally (good for DEV)
Update Dataflow pipelines
1. Update job (clone): creates new job with the same name/new jobId. Map old to new job transforms with transform mapping. Transition buffed data transferred to new job using transform mapping.
Cloud Dataflow SQL: Apache Bean SQL

Cloud Dataproc: managed (must configure cluster); zonal

Hadoop ecosystem: Hadoop, Spark (ML service), Pig, Hive (Data Warehouse / Analysis)
MapReduce: take big data, distribute (map) to many workers, stitch (reduce) back results from individual/parallel processes/workers.
Managed Apache Hadoop/Spark clusters (create cluster, deployment, job submission, manual scaling, monitoring)
On-demand, managed but not no-ops
1. Must configure cluster (manual scaling)
Use case: migrate existing Hadoop/Spark system
HDFS is available (but may not be optimal)
IAM
1. Project level only (primitive and predefined roles)
2. Dataproc
  1. Editor: full
  2. Viewer: read only
  3. Worker: assigned to service accounts: read/write GCS, write to Cloud Logging.
Configuration
1. Master Node > Worker Nodes > PVM Worker Nodes »> HDFS
  1. Master node contains YARN resource manager
2. Autoscaling
3. Optional secondary worker nodes with preemptible VMs
  1. Dataproc manage preemptible VMs adding/removing if the VMs are preempted.
    1. Requires at least 2 nodes before preemptible nodes.
    2. Cost optimization: add preemtiple nodes, high memory > using SSD
  2. More worker nodes + secondary nodes > faster processing
  3. Performance improvement: with local SSD (during cluster creation)
4. Common initialization scripts through public GCS bucket (e.g. Kafka initialization)
Pricing: underlying GCE instances (master nodes, worker nodes)
Dataproc version: Debian 2.0; Ubuntu: include pre-packaged and optional libraries (Hadoop, Spark, Hive, Pig, Flink)
1. Support custom Dataproc image with customized libraries
To access the Dataproc Hadoop web GUI (port 8088, 9870).
1. Open firewall port
2. Using SOCKS proxy
Migrating and optimizing
1. Migrate data first, through GCS, Apache HBase to Bigtable, Apache Impala/Hive to BigQuery.
2. POC with small data set first
3. Optimizing cost:
  1. delete the cluster once the jobs are done. Move data to BigQuery (Hive, Impala), Bigtable (HBase), HDFS (GCS) before deleting the cluster.
    1. Copy data to GCS: by installing connector or copy manually
    2. Update file prefix in scripts: from hdfs:// to gs://
    3. Use Dataproc, and run against/output to GCS.
    4. Consider migrating jobs to Dataflow/BigQuery/etc
  2. preemptible, faster processing, delete (compute) once done. Dataproc separates storage and compute.
4. Optimise performance
  1. Use regional cloud storage bucket: data is nearer to the cluster -> better performance
  2. Use larger persistent disk; SSD over HDD
  3. Allocate more VM’s
Autoscaling
1. Is NOT compatible with: HA clusters, HDFS storage, Spark structured streaming
Native output connectors: GCS, Bigtable, BQ

BigQuery: serverless, no-ops, regional/muti-regional. Hive

Data warehouse: near real-time analysis of massive amount of data (PB)
Storage and Analysis. Can process data on BQ or GCS
Location: multi-region (US, EU), regional
Support standard SQL (2011) or legacy SQL
“3rd wave” of computing - Google Big Data Stack 2.0
Storage (Colossus) and compute (Dremel) are separated, connected by petabit (Jupiter)
How BQ works
1. Jobs can scale up to thousands of CPU across many nodes.
  1. Types: Load, Export, Query, Copy
  2. Priority: Interactive, Batch
2. Columnar data store (column data is stored in the same storage volume, so it is faster then going across different storage volumes). Not designed for update.
IAM
1. Project, dataset, view, table
  1. Authorized view: allow users to access certain subset of data.
  2. Table is ACID compliant
2. Admin: full
3. Data Owner: full to owned datasets
4. Data Editor: edit dataset
5. Data Viewer: view datasets and tables
6. Job User: run jobs
7. User: run queries, and create datasets (but not tables)
  1. Use view to work around access to tables
8. Dataset can be shared, or public (allUsers, allAuthenticatedUsers)
Cache query: avoid querying original table again
1. Per user only
Standard SQL project.dataset.table vs. legacy SQL [project:dataset.table]
User defined function (UDF)
1. Combine SQL with JavaScript/SQL functions / programming logic (e.g. loop)
2. WebUI: only usable with Legacy SQL
Load and export data: BigQuery transfer service
1. Batch load: GCS
  1. Set Schema: “automatically detect”
2. Streaming insert: Dataflow
3. Read as external source: Drive, GCS, Bigtable
4. Dataproc to BQ: pre-installed BigQuery connector (through GCS for staging)
5. Export:
  1. To GCS
  2. Copy table from dataset to another dataset or clone
  3. Us wildcard for exporting file > 1 GB
Query GCS:
1. Support: CSV, JSON, Avro, Parquet, Datastore backups, Filestore backups.
2. Can load multiple files as once using wildcard*
BigQuery Transfer Service
1. Import to BQ from other Google’s services such as Ads, DoubleClick, YouTube reports
Best practices
1. Avoid SELECT *
2. De-normalize data: group data into single table (good for reach, not for write); nested JSON
3. LIMIT does NOT effect cost (! data scanned)
4. Filter early with big WHERE clause
5. Do biggest joins first
6. Data format for import: best Avro compressed
7. Partition: why not tables (limit 1000 tables per dataset) - usually with timestamp
  1. By ingestion time
  2. By custom column (timestamp or date)
  3. Clustered table: only for partitioned table with a clustering key
    1. Order of clustering key (combined fields) is important
    2. Further reduce scanning data since the clustering key are excluded
8. Clustered tables (requires partitioned table): (like index), divides table reads by s specified column field
9. Using slots (limit 2000 per project). Use flat-rate (pre-paid for slots) if you need more than 2000 slots.
10. Multi-region dataset -> more reliable, but higher cost
11. Backup: BQ has built-in snapshot schedule (Point in time) 7 day history. Restore using @(time) in ms. e.g. @-3600000
12. Use approximation function instead of aggregation function (APPROX_COUNT_DISTINCT() vs. COUNT(DISTINCT))
13. Preview first; price check first
Query processing details
1. Wait (yellow), Read (purple), Compute (joins)(orange), Write (blue)
2. Using view: WITH abc AS
Schema
1. Support nested table (table in a table) with RECORD type (JSON)
Streaming inserts
1. Data in streaming buffer for 90 mins before visible in the GUI, but data can be queried immediately.
Monitoring and Logging
1. Monitoring: resource, performance, capacity, alerts (for long queries), slot utilization
2. Logging: has audit trails; contains query executed.
Pricing
1. Storage, query, streaming inserts
2. Pay-as-you-go, flat rate (2000 slots), slots
3. Data older than 90 days will automatically moved to long-term storage. No need to move to GCS

Machine learning

Supervised learning: data with labels
1. Regression: continuous, numeric variables (e.g. stock price)
2. Classification: categorized values (e.g. gender, negative/positive, yes/no)
Unsupervised learning: clustering, finding patterns
Reinforced learning: use negative/positive reward/reinforcement to complete a task
Data
1. Label: e.g. housing price
2. Features: e.g. size, #rooms, location Feature Column
MSE: mean squared error: 1/n sum (yi -y’i)2
Transfer learning: (i.e. AutoML)
z = w(x)+y : w -> weight ; loss = L(w) - ML optimize loss
Feature engineering: process data
1. Type: MCAR, MAR, NMAR
2. Imputation: replaced by mean/median, fixed constant, most frequent values, K-nearest neighbor
3. Outliers and feature clipping
4. One-hot encoding
Google knowledge graph search API
AutoML Edge: export custom trained models, optimized for edge devices.
1. Firebase for mobile device deployment
AutoML table vs. BQML
1. Takes longer time, process input data, try with different models
Operating ML model
1. Data drift / Concept drift
IAM
1. Project and models
2. Models only

Neural network:

An epoc = a pass through data
Neural network adjusts
1. Weights: multiplication to input values
2. Bias: output value given 0 weight
Data: training, test, evaluation
ML adjusts these parameters automatically.
Parameters: variables adjusted by training with data input
1. Hyper-parameters: applied to the process (hidden layer?) as well as the data input
2. Types
  1. Model hyperparameters
  2. Algorithm hyperparameters
3. Examples: batch size, training epochs, #hidden layers, regularization, learning rate
Features: input types (e.g. x2). Feature engineering: combination of features / inputs
Loss: error for training or testing data
Learning rate: speed of weight/parameter adjustment vs. loss. Define how slow/fast accurate training process is.
Deep and wide
1. Wide - memorization: many features, fewer hidden layers
2. Deep - generalization: many hidden layers
3. Deep and wide - recommendation
Over-fitting: too specific to training data
1. Causes: not enough training data; too many features; model fitted to unique features to training data (aka Noise)
2. Solution:
  1. more training data; variety of training data
  2. make model less complex: less but more relevant features (combine relevant features to one)
  3. remove noise: increase ‘regularization’ parameter
  4. Options
    1. Regularization
    2. Increase training data
    3. Feature selection
    4. Early stopping
    5. Cross-validation
    6. Dropout layers: NL - weight occasionally set to 0
3. Regularization:
  1. Add penalty if the model gets more complex
  2. Types
    1. L1: lasso regression - assign greater importance to more influential features. Good for large features
    2. L2: ridge regression - good for models with features with roughly equal sizes
4. Under-fitting: error is high on the training set compared to the test set
Data source: recommendation GCS - can use BigQuery.
Platform: master node and worker nodes
Prediction: single, batch
ML API: encode input in base64 format
AI platform
1. Scaled tier: master, worker, parameter
2. Basic, Standard, Premium, and Custom
Local training: gcloud ml-engine local train

Datalab: replaced by Notebooks

Built on top of Jupyter (formerly iPython): interactive python
Run on GCE (+VPC and Source Repository)
1. Share notebook between users (i.e. share code)

Dataprep: offered by third-party Trifecta

Backed by Cloud Dataflow
Process
1. Import
2. Transform with recipes
3. Run
4. Export results (GCS, BQ)
Intelligent suggestions: for simple tasks
IAM
1. Dataprep User
2. Dataprep Service Agent: gives Trifecta necessary access to project resources
Pricing: 1.16 * Dataflow job
Can click on the black bar to select all missing values
Join datasets: add a new Join step

Data Studio

Web based BI - drag and drop report builder
Use GWS identity for access/sharing permission, not IAM
Files saved in Google Drive
Connect to many services (Google, Google Cloud, and others)
1. BQ, GCS, Cloud SQL, Spanner
2. YouTube, Analytics, Sheets, AdWords
3. Local upload
Caching
1. Query cache: remembers queries by reports components.
  1. Can NOT be turned off
  2. Will get from pre-fetch if not found
2. Pre-fetch cache: ‘smart cache’, predicts that ‘might’ be requested.
  1. Can be turned off
  2. Will get from live data set if not found
  3. Only active for data sources tha use owner’s credentials for data access
3. Cache expires in 12h
4. Refresh: Refresh Data button; update the connector freshness
Embed
1. To website: using iframe
  1. Public report can be viewed by everyone. Private report can be viewed by users logged in with Google account
Pricing
1. Free, but data query (e.g. BigQuery query) is chargeable

Cloud Composer

Fully managed Apache Airflow implementation
1. Programmatically create, schedule, and monitor data workflow
Cloud Composer = Apache Airflow + GKE + GCS
workflows: Cloud Composer creates workflows using DAG files
1. Orchestrate data pipelines
2. Format: DAG (Direct Acyclic Graph) - written in Python
Setup
1. Configure environment variables

GAPS

L1, L2 regularization
1. Combining co-dependent and redundant features allows you to reduce the total number of features trained without sacrificing accuracy.
2. L2 regularization is used to reduce weights for overfitted training models. L2 regularization is more relevant when all features have relatively equal weights/influence
3. L1 regularization is able to reduce the weights of less important features to zero or near zero