writing
Field Notes

Efficient DB Structure and Cost Management in BigQuery

Efficient database management isn't just about technical proficiency—it's also about optimizing cost, scalability, and performance. As the co-founder of BAN LLC, a data-driven venture specializing in analytics for Magic: the Gathering markets, I've developed a robust database structure in Google BigQuery, specifically tailored for extensive daily analytical insights.

Daily Structured Tables Created via R

Our workflow at BAN LLC involves daily automated uploads of datasets through meticulously crafted R scripts. These scripts not only standardize and simplify the database structure but also facilitate easy querying, analysis, and management. We maintain strict naming conventions for clarity and effective governance:

{dataset_name}.YYYY_MM_DD_table_name

Examples include:

  • premiums.2025_03_30_TCG_CK_Data
  • buylist_growth.2025_03_30_buylist_growth
  • vendor_growth.2025_03_30_vendor_growth
  • ck_velocity.2025_03_30_CK_VELOCITY

Leveraging BigQuery CLI for Accurate Monitoring

To efficiently manage and monitor database growth, storage costs, and performance metrics, I frequently utilize the BigQuery command-line interface (CLI). Recently, I executed a CLI command to precisely extract and present the size and creation date of tables generated within the last seven days:

# Ensure you have GCP CLI Installed: - Continually amazed how easy this set up is
# brew install jq
# brew install gawk
# brew install --cask google-cloud-sdk
# gcloud init

PROJECT_ID="gaeas-cradle"
ONE_WEEK_AGO=$(date -v -7d '+%s')

bq ls --project_id=$PROJECT_ID --format=json \
| jq -r '.[].datasetReference.datasetId' \
| while read DATASET; do
    bq ls --max_results=10000 --format=json $PROJECT_ID:$DATASET \
    | jq -r --arg DATASET "$DATASET" '.[] | select(.creationTime != null) | [$DATASET, .tableReference.tableId, .creationTime] | @tsv'
done \
| gawk -v one_week_ago="$ONE_WEEK_AGO" '($3/1000) >= one_week_ago' \
| while read DATASET TABLE CREATION; do
    json_output=$(bq show --format=json $PROJECT_ID:$DATASET.$TABLE)

    creation_time=$(echo "$json_output" | jq -r '(.creationTime | tonumber / 1000 | strftime("%Y-%m-%d %H:%M:%S"))')
    size_bytes=$(echo "$json_output" | jq -r '.numActivePhysicalBytes // "0"')

    size_human=$(echo "$size_bytes" | awk '{
        size_bytes=$1+0
        if(size_bytes >= 1073741824)
          size=sprintf("%.2f GB", size_bytes/1073741824)
        else if(size_bytes >= 1048576)
          size=sprintf("%.2f MB", size_bytes/1048576)
        else if(size_bytes >= 1024)
          size=sprintf("%.2f KB", size_bytes/1024)
        else
          size=sprintf("%d B", size_bytes)
        print size
    }')

    echo "$DATASET $TABLE $creation_time $size_human"
done \
| sort -k3,3r

{{ end }}

This explicit and reliable method ensures accurate tracking of storage utilization, providing insights critical for cost management and infrastructure planning. Could I have written a .py for this? Yes. I wanted CLI fun on the terminal.

A Detailed Snapshot of Recent Data Activity

Here's a comprehensive look at recent datasets and tables managed over the past week, clearly illustrating the substantial volume and diversity of daily analytics:

Dataset Table Name Creation Date Size
buylist_growth 2025_03_30_buylist_growth 2025-03-30 00:50:33 1.26 MB
ck_funny_money 2025_03_30_CK_Credit 2025-03-30 00:49:39 253.62 KB
premiums 2025_03_30_TCG_CK_Data 2025-03-30 00:49:30 2.84 MB
tcgplayer 2025_03_30_TCGPLAYER 2025-03-30 00:49:36 734.16 KB
vendor_growth 2025_03_30_vendor_growth 2025-03-30 00:51:46 1.15 MB
ck_velocity 2025_03_30_CK_VELOCITY 2025-03-30 02:29:59 4.66 MB

Such structured and consistent data handling supports rapid analytics and reporting capabilities, integral to strategic decision-making. Naming conventions were not my strong suite back in the day when I originally put this together. As a side note as well, __funny money__ references the trade in bonus that sites like Card Kingdom offer, usually a 15-30% "bump" in value, which is an incredible resource for growing asset value & avoiding fees. Never sleep on credit bumps.

Long-term Scalability and Impressive Data Growth

Consistent daily uploads have led to the substantial growth of our analytical data repositories. Annually, BAN LLC handles terabytes of structured data, spanning more than five years of detailed transactional and market data. This extensive historical dataset provides unparalleled insights into market dynamics, pricing trends, and consumer behavior, enabling us and our community of over 425 paying members to stay ahead of industry trends and leverage data-driven investment opportunities.

With more than five years of consistent data accumulation, we've built one of the industry's most comprehensive datasets, which continues to grow exponentially each day.

SQL Proficiency and Strategic Cost Efficiency

At the core of our efficient database management strategy is a robust proficiency in SQL. By optimizing queries and adopting daily partitioning, we significantly minimize the cost of queries by ensuring that only relevant data partitions are accessed. Additionally, our explicit scripting practices within R streamline storage utilization, further reducing operational costs while maximizing performance.

Beyond technical optimization, our database strategy underpins the operational excellence of BAN LLC. By automating daily processes, we've been able to scale effectively, enhancing analytical capabilities without proportionally increasing costs.

Leveraging Advanced Analytics and Automation

Through BAN LLC, we also integrate advanced predictive modeling and machine learning to forecast market trends and identify arbitrage opportunities. Daily web scraping scripts, written in R and Python, continuously gather market data from over 20 global websites, ensuring our datasets remain comprehensive and timely.

Additionally, dozens of stored procedures automate the generation of insightful reports, seamlessly feeding real-time information to our community via MTGBan and Discord integrations.

Conclusion

Our database management approach, rooted in clear structuring, automation via R scripting, and diligent CLI-based monitoring, demonstrates a deep commitment to technical excellence and operational efficiency. By continuously optimizing SQL strategies, effectively scaling data repositories, and integrating advanced analytics, we've created a highly sustainable, cost-effective, and scalable data environment. This approach not only underscores my technical capabilities but also emphasizes my strategic vision for long-term data-driven success.