機械学習Tips保管庫

VARモデル

2025-12-27T05:56:06+00:00

状態空間モデル

2025-12-27T05:47:11+00:00

状態空間モデル(線形ガウス)の解説とコード実装例(python)

非ガウス・非線形モデルは未記載

LangGraph

2025-05-03T04:20:04+00:00

　　・以下の書籍/Udemyでの学習内容の備忘録

　　　　・『LangChainとLangGraphによるRAG・AIエージェント実践入門』

　　　　・『Complete Agentic AI Bootcamp With LangGraph and Langchain』

・基本操作

　　・LangGraphの概念・用語

　　・LangGraphにより基本的なマルチエージェント構築例

　　・チェックポイントについて

　　・マルチエージェント構築例2

　　・リデューサーとツール呼び出しの基本（ToolNodeの利用）

　　・複数ツールの利用（最適なツールを一つ選択して実行）

　　・React型エージェントとメモリ機能について（create_react_agent関数を使わない版）

　　・create_react_agent関数によるReact型エージェント作成

　　・ストリーム出力について

　　・ヒューマンフィードバックの適用

・Agentic RAG

　　・エージェントによる最適なベクトルDBの選択

　　・ベクトル検索結果に対するセルフリフレクションの適用

　　・クエリ内容を分析して、動的にRAGのクエリを最適化

・エージェントデザインパターン

　　・主要なエージェントデザインパターン

　　・Passive Goal Creator：ユーザーからの具体的な目標を抽出するパターン

　　・Prompt Response Optimizer：生成された目標やユーザー要求を、より効果的なプロンプトに変換しLLMからより質の高い回答を得るパターン

　　・Single-Path Plan Generator：設定された目標を達成するための一連の具体的なステップを生成するパターン

　　　・Multi-Path Plan Generator：タスク分解時に複数の選択肢を同時に生成し、実行時のコンテキストに応じて実行エージェント自身に都度適切な選択をさせるパターン

　　・Self-Reflection and Cross-Reflection：タスクの実行結果をエージェント自身で振り返ることで、その実行内容を自己改善させる

　　・Role-Based Cooperation：AIエージェントが協調してタスク実行

　　・サブクラスに分解して順番に処理

　　・複数ノードが独立して動くことができるときに、並列化を適用

　　・条件分岐による様々なノードへのルーティング

　　・管理エージェントがサブタスクに分解して、作業用エージェントに依頼

　　・LLM出力結果を評価用サブエージェントが評価

・アプリ構築例

　　・Streamlitを用いたLangGraphアプリ

　　　　　▶ notebook

　　・APIによるLangGraphの呼び出し(FastAPI)とLangGraph Studioによるデバッグ

　　　　　▶ notebook

LangChain & RAG

2025-04-27T12:00:00+00:00

・LCEL(LangChain Expression Language)について
　▶ 解説notebook

・RAG基本

・Advanced RAG

・LangSmith

CNN, TCN

2024-08-29T10:47:14+00:00

The structure of CNN

　- Convolution Layer / Max Pooling / Padding / Stride and so on

▶ explanation_notebook

Channel in CNN

▶ explanation_notebook

Difference between Kernel and channel

▶ explanation_notebook

Parameters of nn.Conv1d of PyTorch

▶ explanation_notebook

Predicting Energy Production by TCN

▶ explanation_notebook

Analyzing codes of TCN class in Darts library

▶ explanation_notebook

Other sample codes of TCN

- This sample code doesn't do left padding so data leakage occurs

▶ explanation_notebook

Radom Seed & Initial value dependece

2024-08-28T12:05:47+00:00

Following answers are created by GPT

1. Influence of Random Seed in PyTorch

When you're running a PyTorch model for time series forecasting, the random seed is used to initialize the random numer generator, which impacts varios parts of your model training process.

This includes:

1. Weight Initialization: The initial weights of your neural network are randomly assigned. Different random seeds lead to different initial weights, which can result in different convergence paths and, consequenctly, different final models.

2. Batch Sampling: If you're using stochastic gradient descent (SGD) or any variant, the order in which data batches are fed into the model can change with different random seeds. This can also lead to different model training dynamics.

3. Dropout: If your model uses dropout, the specific neurons dropped during training are chosen randomly. Changing the random seed will change which neurons are dropped at each step.

Because of these factors, even with the same dataset and model architecture, changing the random seed can lead to different model parameters and, thus, different results.

How Popular Data Scientists Deal with Result Variability:

1. Multiple Runs with Different Seeds: A common approach is to train the model multiple times using different random seeds and then average the results. This helps in estimating the expected performance of the model more robustly.

2. Seed Fixation: To ensure reporducibility, data scientists often fix the random seed across the entire pipeline, including model initialization, data loading, and any other random processes.

3. Cross-Validation: Instead of relying on a single train-test split, they use cross-validation to assess the model's performance. This helps in understanding how the model performance across different subsets of the data, which can reduce the variability due to random seed changes.

4. Model Ensembling: By training multiple models with different random seeds and combining their predictions, they can create a more robust predictor that's less sensitive to any single model's random initialization.

5. Reporting Variance: When presenting results, they often report not just the mean performance but alose the variance (or standard deviation) across different seeds. This provides a clearer picture of the model's stability.

By understanding and controlling for the effects of random seeds, you can make your models more robust and your results more reliable.

2. Initail Value Dependent Problem

The "Initial value dependent problem" in Deep Learning refers to the sensitivity of a neural network's training process and final performance to the initial values of the model's parameters, especially the weights. This is particularly important because deep learning models are typically trained using iterative optimization methods like gradient descent, where the starting point (initial values of weights) can significantly influence the optimization path and the resulting model.

Key Aspects of the Initial Value Dependent Problem:

1. Convergence to Local Minima: Neural networks have highly non-convex loss landscapes with many local minima and saddle points. Depending on the initial weights, the optimization process might converge to a different local minimum. Some of these minima might be good (low loss and generalizez well), while others might be suboptimal.

2. Tranining Dynamics: The trajectory taken during training, including how fast the model converges and whether it gets stuck in certain regions of the loss landscape, can vary based on the initial weights. Poor initialization can lead to slow convergence or even divergence.

3. Symmetric Breaking: In some network architectures, especially those with layers having identical neurons (like in fully connected layers or convolutional layers), initializing all weights to the same value can cause symmetry issues. if all neurons starts with the same weights, they will receive the same gradient updates, resulting in them learning the same features, which limits the expressiveness of the model.

4. Vanishing / Exploding Gradients: Certain initial values can exacerbate the vanishing or exploding gradient problem, especially in deeper networks. If weights are initialized too small, gradients can vanish, making it difficult for the network to learn. If initialized too large, gradients can explode, leading to unstable updates.

Strategies to Mitigate Inital Value Dependent Problems:

1. Careful Weight Initialization: Modern initialization techniques like Xavier (Glorot) initialization, He initialization, and others are designed to set the initial weights in a way that maintains a good balance of the variance of the inputs and outpus across layers, reducing the chances of vanishing / exploding gradients.

2. Batch Normalization: This technique helps mitigate the effects of poor initializationby normalizing the inputs to each layer, which can stabilize and accelerate training.

3. Adaptive Learning Rates: Optimizers like Adam, RMSprop, and others adjust learning rates dynamically during training, helping the model recover from poor initializations.

4. Multiple Trials with Dfferent Seeds: As mentioned before, running the training process multiple times with different initial values and then averaging the results or choosing the best-performing model can reduce the dependency on any particular initialization.

The initial value dependent problem highlights the importance of good practices in weight initialization and the need for robust optimization techniques to ensure that the deep learning models perform well regardless of the initial values.

3. The relationship between Initial value dependence and setting Random Seed

There is a direct relationship between the initial value dependent problem and the setting of the random seed in deep learning.

Connection Between Inital Value Dependence and Random Seed:

1. Weight Initialization: When a random seed is set in your code, it controls the random number generator that is responsible for initializaing the weights of the neural network. Dfferent random seeds will result in different initial weight values. Since the initial weights heavily influence the training process and the eventual model performance, changing the random seed can lead to variations in the outcome due to this initial value dependency.

2. Reproducibility: Setting a random seed allows you to cotrol and reproduce the initial conditions of your model. This means that every time you run your training process with the same random seed, you should get the same sequence of random numbers, leading to the same weight initialization and other random processes (like dropout or data shuffling). This makes the experiments reproducible, which is crucial for debugging, tuning, and reporting results.

3. Addressing Initla Value Dependency: By experimenting with different random seeds, you can explore how sensitive your model is to initializze conditions. If your model's performance varies significantly with different seeds, it suggests a strong initial value dependency. This can be mitigated by using techniques like better weight initialization strategies, ensumbling, or simply averaging the results across multiple seeds.

Summary of the Relationship:

・Initial value dependence refers to the sensitivity of model performance to the starting weights, which are determined by random initialization.

・Setting a random seed controls this initialization process and other stochastic elements in the model, allowing for reproducibility and enabling you to study the effects of different initial conditions on your model's performance.

By understanding and managing the relationship between random seed settings and initial value dependence, you can make your deep learning models more robust and consistent.

シーケンスデータでの、データの独立性について

2024-08-26T12:30:12+00:00

以下は、シニアデータサイエンティストとの会話で解らなかった点をgptに聞いた回答

Question:

現在、以下のような日次データがあります。

・為替取引データ

・1年の取引回数が255回なので、1年で255レコードある

上記データで、sequence lengthが60日のLSTMモデルを作成しようとしている所、あるデータサイエンティストから次のように言われました。

「sequence lengthが60日なので、独立したデータがあるのは1年に4日程度しかないなあ・・・」

私には、この言葉の意味がよく解りません。このデータサイエンティストは何を言いたかったのか推測してください。

Answer by GPT:

このデータサイエンティストが言っていることを推測すると、LSTMモデルにおけるシーケンスの重複に関する問題を指摘している可能性が高いです。具体的にはsequence lengthが60日という設定でLSTMモデルを構築する場合、以下のようなデータの扱いになります。

1. シーケンスの重複：

LSTMモデルでは、例えば最初の60日のデータを使用して予測を行い、次にその次の日のデータを追加して再び60日分を使って予測を行います。この方法だと、1日分だけ新しいデータが追加され、残りの59日分は前のシーケンスと重複します。

2. 独立したシーケンスの数：

この考え方でいくと、完全に独立したデータシーケンス（すなわち、重複のないシーケンス）は、1年（255日）のデータに対して4つしか得られません。これは255日÷60日 ≈ 4.25と計算されます。つまり、独立したデータシーケンスを取得するには、255日間のデータを60日ずつ区切っていくと、最初の60日間、次の60日間、さらにその次の60日間、そして最後の75日間が残るため、実質的に4つの独立したシーケンスしか得られないということです。

3. データの独立性の問題：

データサイエンティストが指摘しているのは、重複したシーケンスが多くなると、LSTMモデルが同じパターンを何度も学習してしまう可能性があり、モデルの性能に影響を与えるという点です。実際には、過剰な重複はモデルが過学習しやすくなる原因にもなります。

このため、「独立したデータがあるのは1年に4日程度しかない」という言葉は、重複するシーケンスが多いことを懸念していることを示していると思われます。

データの独立性の問題とは

データの独立性の問題は、特に時系列データやLSTMなどのリカレントニューラルネットワーク(RNN)モデルを扱う際に重要な考慮事項です。この問題は、データサンプルがどれだけ「独立」しているか、つまり、あるサンプルが他のサンプルに影響を与えずに学習されるかどうかに関わります。

1. データの独立性とは？

データの独立性は、あるデータポイントやシーケンスが他のデータポイントやシーケンスと関連していないことを指します。具体的には、あるサンプルを使ってモデルを訓練しているとき、そのサンプルが他のサンプルとどれだけ異なり、独自の情報を提供しているかが重要です。

2. 時系列データにおける独立性の問題

時系列データでは、データポイントが時間軸に沿って連続しているため、自然と前後のデータポイントとの間に強い関連性が生まれます。たとえば、為替データの場合、今日の為替レートは昨日のレートと非常に関連しています。

LSTMモデルは、データの時間的な依存関係を考慮して学習を行うため、この依存性を利用して予測を行います。しかし、問題は次のようなケースで生じます。

・シーケンスの重複：前述のように、LSTMモデルにおいてsequence lengthを60日に設定すると、例えば1年分のデータに対して次のようなシーケンスが生成されます。

　　・シーケンス1：日次データの1日目から60日目

　　・シーケンス2：日次データの2日目から61日目

　　・…と続きます。

　この場合、シーケンス1とシーケンス2は59日分のデータが重複しています。

　つまり、シーケンス1とシーケンス2の独立性が低くなります。

・モデルの学習における影響：重複したデータが多くなると、モデルは似たようなパターンを何度も学習することになります。この結果、モデルは過学習に陥りやすくなります。過学習とは、モデルが訓練データに対して非常に良い成績を示す一方で、未知のデータ（テストデータ）に対する汎化性能が低下する現象です。

3. データの独立性を確保する方法

・サンプリングの工夫：シーケンスが重複しないように、データのサンプリングを工夫することが考えられます。例えば、シーケンスを1日ずつずらすのではなく、60日間隔でサンプリングすることで、独立したシーケンスを作成します。

・データ拡張（Data Augmentation）：新たなデータを生成することで、独立性の高いデータセットを作ることも一つの方法です。時系列データの特性を保ちながら、微妙に異なるデータを作り出します。

・異なる期間のデータを使用する：データを異なる期間で区切ってシーケンスを作成することで、異なる市場環境や条件下のシーケンスを学習させることができます。これにより、モデルがより汎用的なパターンを学習しやすくなります。

4. 結論

データの独立性は、LSTMモデルの学習において、モデルの汎化性能を維持するために重要な要素です。独立性の低いデータを使用すると、モデルが特定のパターンに過度に適応し、道のデータに対してうまく機能しなくなるリスクがあります。シーケンスの重複を避けるための工夫や、独立性の高いデータセットを作成する方法を検討することが重要です。

sequence lengthについて

時系列データを扱うモデルで「sequence length」の考え方が登場するのは、主にディープラーニング（特にリカレントニューラルネットワーク(RNN)、LSTM、GRU、Transformerなど）ですが、これはディープラーニングに限った概念ではありません。以下のように、他の手法でも似たような概念が存在します。

1. 従来の統計的時系列モデル

従来の統計的時系列モデルでも、過去のデータを使用する際に、どの程度の過去のデータを使うかを決めることが重要です。具体的な例を挙げると：

・自己回帰モデル(ARモデル)：自己回帰(AR)モデルでは、現在の値を過去の一定期間（ラグ）の値の線形結合として表現します。ここでの「ラグ」の数は、ディープラーニングにおけるsequence lengthに相当するものと考えることができます。AR(1)モデルでは1つの過去の値を使いますが、AR(2)モデルでは2つの過去の値を使います。

・移動平均モデル(MAモデル)：移動平均(MA)モデルでは、過去の誤差項の一定期間（ラグ）を使って現在の値を説明します。ここで使用するラグの数もsequence lengthに類似した概念です。

・ARIMAモデル：ARIMA(自己回帰和分移動平均)モデルは、ARとMAの要素を組み合わせたモデルで、データのラグに基づいて予測を行います。

2. フィーチャーエンジニアリングにおけるsequence length

機械学習モデル（例えば、回帰モデルや決定木ベースのモデル）でも、過去のデータを特徴量として扱う際に、sequence lengthに相当する概念が表れます。例えば、株価の予測において、前日の株価だけでなく、前の7日間の株価を特徴量として使う場合、その7日間がsequence lengthに相当します。

3. ディープラーニングモデル

ディープラーニングモデル、特にLSTMやGRUなどのRNN系モデルでは、sequence lengthが特に重要です。これらのモデルは、過去のデータの連続的な依存関係をモデル化するため、過去のデータの「長さ」や「範囲」を決定するsequence lengthが直接的にモデルの性能に影響を与えます。

また、Transformerベースのモデルでも、入力シーケンスの長さを考慮する必要があります。例えば、Transformerモデルにおいても、入力シーケンスが長すぎると計算コストが増加し、短すぎると重要な情報が欠落する可能性があります。

4. 結論

「sequence length」という概念は、ディープラーニングにおいて特に顕著に扱われますが、これはディープラーニングに限定されたものではありません。時系列データを扱う様々なモデリング手法で、過去のデータをどの程度の範囲で使用するかを決定することが重要であり、それぞれの手法において「sequence length」に相当する考え方が存在します。

データの独立性の問題を避けるには？

データの独立性の問題とは、どのモデルを使っても完全に避けることは難しいですが、モデルやデータの取り扱い方によってその影響を最小限に抑えることが可能です。以下に、モデルごとの特性とデータの独立性に関連する問題について説明します。

1. 統計的時系列モデル

統計的時系列モデル（AR, MA, ARIMAなど）は、過去の値に基づいて予測を行います。これらのモデルは、データの依存性を前提にしているため、独立性の問題がある意味で組み込まれています。

・依存性が前提：これらのモデルは、過去のデータとの相関や依存関係を明示的にモデル化しているため、独立性というよりは、依存関係のモデル化が目的です。

・独立性の確保が難しい：このため、データが独立していないこと自体が問題になるわけではなく、むしろ依存性を正確にとらえることが重要です。

2. ディープラーニングモデル（LSTM, GRU, Transformerなど）

ディープラーニングモデルでは、特にLSTMやGRUのようなリカレントニューラルネットワーク(RNN)モデルは、過去のデータシーケンスから情報を学習し、それを未来の予測に活用します。

・データの重複：これらのモデルは、シーケンスが重複することが多く、そのためデータの独立性が問題になることがあります。同じデータポイントが何度もモデルに入力されると、過学習のリスクが高まります。

・データの拡張や工夫が必要：データの独立性を改善するために、データのサンプリング方法を工夫したり、データ拡張（data augmentation）技術を用いたりすることで、この問題を軽減できます。

3. 伝統的な機械学習モデル

決定木やランダムフォレスト、サポートベクターマシン(SVM)などの伝統的な機械学習モデルでは、時系列データを扱う場合、過去のデータポイントを特徴量として使用することがあります。

・特徴量の選択に依存：どの過去のデータポイントを使用するか（例えば、過去1日、3日、7日のデータを使うなど）に依存しているため、データが独立していない場合、モデルの学習に影響を与える可能性があります。

・シャッフルなどの工夫：データが独立していないことが問題となる場合、データをシャッフルしたり、適切に分割して学習データとテストデータに分けるなど、データの工夫が必要です。

4. 結論

データの独立性の問題は、時系列データを扱うすべてのモデルにおいてある程度避けられないものですが、モデルや手法に応じた適切なデータ処理や工夫によって、その影響を最小限に抑えることができます。ディープラーニングモデルでは特に顕著ですが、適切なサンプリング、データ拡張、モデル選択、パラメータ調整によって、データの依存性を考慮しつつも効果的な予測を行うことが可能です。

Transformer

2024-08-24T07:13:16+00:00

・Transformer概要

▶︎ study notebook1

・構造

▶︎ study notebook2

・PyTorchでの実装例

・What are Embedding Layers

▶︎ study notebook1

・Time Series Data Encoding for Deep Learning

▶︎ study notebook1

・Transformer for Time Series

▶︎ study notebook1

・Transformer having features for Time Series Forecasting

▶︎ study notebook2

・Positional Encodingの説明

・Informer

▶︎ study notebook1

・Informerの基本構造

転移学習

2024-08-24T06:57:05+00:00

・転移学習とは

▶︎ study notebook1

・転移学習利用例

▶︎ study notebook1

Many-to-One and Many-to-Many tasks

2024-08-24T06:17:17+00:00

The example provided, both in the original and modified forms, represents a many-to-one task rather than a many-to-many task. Let's clarify the distinctions and why this example fits into the many-to-one category:

Many-to-One Tasks:

Definition:

In many-to-one tasks, the model processes a sequence of inputs to make a single output prediction. These tasks are common in scenarios where you have temporal or sequential data, and you want to predict a single outcome based on the entire sequence. Examples include sentiment analysis from a sequence of text, stock price prediction based on a series of historical data points, or weather forecasting from a sequence of meteorological measurements.

Example Context:

In the provided LSTM example, the task involves predicting a single stock price based on the past 7 days of data, where each day includes 5 features (high price, low price, volume, GDP, CPI). This is a classic example of a many-to-one task because you are using many inputs (the sequential data over 7 days) to predict a single output (the stock price at the end of the sequence).

Many-to-Many Tasks:

Definition:

Many-to-many tasks involve mapping a sequence of inputs to a sequence of outputs. This category includes tasks where each time step in the input sequence is associated with a corresponding output. Examples of many-to-many tasks are machine translation (translating a sentence in one language to another, word by word or phrase by phrase) and video frame prediction (predicting future frames based on past frames).

Differentiation:

The key difference from the many-to-one setup is that in many-to-many, you are interested in the outputs at each time step, not just a summary prediction at the end. An LSTM model for a many-to-many task would be structured to return an output at each time step of the sequence.

Summary of the Provided Example:

Given the objective to predict a single stock price from a sequence of 7 days' data, the task aligns with the many-to-one framework:

Input:

A sequence of data points (each with 5 features) representing 7 consecutive days.

Output:

A single predicted value (the stock price) at the end of the sequence.

The LSTM model is designed to process the input sequence through its recurrent layers, culminating in a single output prediction, which is why it's categorized as a many-to-one LSTM task in this context.

RNN, LSTM

2024-08-24T06:13:45+00:00

・基本

▶︎ study notebook1

・Linear Regression とLogistic Regression

▶︎ study notebook2

・Feed-Forwad Neural Network

・RNN

▶︎ study notebook1

・LSTM

▶︎ study notebook1

・LSTM基礎

▶︎ study notebook2

・LSTMの時系列データへの適用例1

▶︎ study notebook3

・LSTMの時系列データへの適用例2

▶︎ study notebook4

・LSTM with Features

▶︎ study notebook5

・LSTM Input and Output Structure

▶︎ documents

・Many-to-One and Many-to-Many tasks in LSTM

▶︎ study notebook6

・LSTMの時系列データへの適用例3

Modeling関連

2024-08-24T03:39:50+00:00

・PyTorchにおけるShape操作

▶︎ study notebook1

　　　　・reshape, view, squeeze, transpose, permute

▶︎ study notebook2

　　　　・Unfoldについて

▶︎ study notebook3

　　　　・squeezeについて

▶︎ study notebook4

　　　　・次元を指定したsumについて

・batchとmini batch

▶︎ study notebook1

・学習データの逐次更新(２年ごと更新など)と、全期間データの利用

▶︎ study notebook1

・Differentiable Function（back propagationを担保する変数操作）

▶︎ study notebook1

・Hyperparameterの種類とparameter tuning

▶︎ study notebook1

・parameterの種類

▶︎ study notebook2

・Bayesian parameter tuning

・End-to-End Training PipelineとFine tuning

▶︎ study notebook1

・カスタム損失関数の作成

▶︎ study notebook1

・Data Leakage

▶︎ study notebook1

・multiply tensors with different dimension

▶︎ study notebook1

・Normalization method

▶︎ study notebook1

・Normalization method of dividing by first record

▶︎ study notebook2

・Rolling Window Normalization

・ Rolling Window Correlation

▶︎ study notebook1

・Output Feature Order

▶︎ study notebook1

・modelの再現性

▶︎ study notebook1

・train, validation, testへの分割

▶︎ study notebook1

・Random Seed & Initial Value Dependent Problem

▶︎ 説明ページ

Data Leakage

2024-08-24T03:30:56+00:00

Data leakage in Normalization

Data leakage refers to a situation in a machine learning project where information from outside the training dataset is used to create the model. This can lead to overly optimistic performance estimates during training and validation, because the model has essentially been given access to data it wouldn't have in a real-world scenario, leading to poor performance on unseen data.

In the context of feature normalization, data leakage can occur if you normalize your entire dataset (including both training and test data) together, before splitting it into training and test sets. The correct procedure is to split your data first and then normalize the training and test data separately.

How to Avoid Data Leakage During Normalization

1. Split Your Data First:

Before any preprocessing, including normalization, split your dataset into training, validation, and test sets. This ensures that the preprocessing of the training data does not influence the preprocessing of the test data and vice versa.

2. Fit the Scaler on Training Data Only:

When normalizing (or standardizing) your features, you should fit your scaler (or normalization parameters) only on the training data. This means calculating the mean and standard deviation (for standard scaling) or the minimum and maximum values (for min-max normalization) using only the training data.

3. Apply the Same Transformation to Test Data:

After fitting the scaler on the training data, use the same parameters (mean, standard deviation, min, max, etc.) to transform the test data. This mimics the real-world scenario where the model is applied to new, unseen data, ensuring that the test data is scaled based on the distribution of the training data only.

Example in Python

Assuming you are using `scikit-learn` for normalization, here is how you might properly normalize your data to avoid data leakage:

```python

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

# Assume X is your feature matrix and y are your labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a StandardScaler object

scaler = StandardScaler()

# Fit on the training data

scaler.fit(X_train)

# Transform the training data

X_train_scaled = scaler.transform(X_train)

# Transform the test data using the same scaler

X_test_scaled = scaler.transform(X_test)

```

By following these steps, you can prevent data leakage during the normalization process, ensuring that your model's evaluation metrics accurately reflect its performance on truly unseen data.

Pytorch基本

2024-08-24T03:03:57+00:00

・DatasetとDataLoader

▶︎ study_notebook1

▶︎ study_notebook2

・ SoftmaxとCross Entropy

▶︎ study_notebook1

・ Activation Function

▶︎ study_notebook1

・ Saving and Loading model

▶︎ study_notebook1

▶︎ study_notebook2

・ Adjust Learning Rate

▶︎ study_notebook1

▶︎ study_notebook2

▶︎ study_notebook3

・ Early Stopping

▶︎ study_notebook1

▶︎ study_notebook2

・ ClassとSequenceの利用

▶︎ study_notebook1

・ Dropout Regularization

▶︎ study_notebook1

・ K-Fold Cross Validation

▶︎ study_notebook1

・ torch.compile()について

▶︎ study_notebook1

・ Batch Normalization

▶︎ study_notebook1

・モデル構築フロー

▶︎ study_notebook1

・ PyTorch Lightning

▶︎ study_notebook1

・Lightningへの書き換え方法説明1

▶︎ study_notebook2

・How to use PyTorch Lightning DataModule

▶︎ study_notebook3

・Lightningへの書き換え方法説明2

Compare Machine Learning Model and Deep Learning Model for Time Series

2024-08-24T01:38:06+00:00

These answers are made by GPT.

Traditional Machine Learning Models:

Pros:

・Interpretability: Traditional models like linear regression or decision trees are generally easier to interpret. You can understand how each variable contributes to the final prediction.

・Efficiency: They often require less training data and computational power compared to Deep Learning models.

・Good for linear relationships: They excel at capturing linear relationships between variables.

Cons:

・Limited complexity: They might struggle to capture complex non-linear relationships between variables.

・Feature engineering: Often require manual feature engineering to extract relevant information from the data, which can be time-consuming and domain-specific.

Deep Learning Models (Transformers):

Pros:

・Captures complex relationships: Transformers can learn and capture intricate non-linear relationships between variables, even if they're not explicitly defined.

・Automatic feature learning: They automatically learn features from the data, potentially reducing the need for manual feature engineering.

Cons:

・Black box nature: Understanding how a Transformer model arrives at a prediction can be challenging. This lack of interpretability might be a drawback in some cases.

・Data hungry: Transformers often require large amounts of high-quality training data to perform well. With limited data, they might underperform compared to simpler models.

・Computational cost: Training Transformers can be computationally expensive, especially for large datasets.

Choosing the Right Model:

If the relationships are likely linear and interpretability is crucial, a traditional model might be a good choice.

If the relationships are complex and interpretability is less of a concern, a Transformer model could be a better option, especially if you have a large amount of training data.

Here's an additional tip: You can consider using a hybrid approach. Combine a traditional model's interpretability with a Transformer's ability to capture complex relationships. For example, use a Transformer for feature extraction and then feed those features into a traditional model for prediction.

Ultimately, the best way to decide is to experiment with both types of models on your specific data and see which one performs better for your task.

正準相関分析(Canonical Correlation Analysis)

2021-12-22T12:47:58+00:00

以下にpythonコード記載

https://yoshi-cow.github.io/statistics.github.io/canonical_correlation_analysis.html

線形回帰式でのカテゴリ変数使用時の注意点

2021-12-04T04:51:47+00:00

ランク落ち（多重共線性）

最小二乗法でのパラメータ推定は、

　　α = (X・t_X)_(-1) ・t_X・Y

　（X：行列、t_X：Xの転置行列、(X・t_X)_(-1)：(X・t_X)の逆行列、Y：行列）

で行うが、データ間に相関があるとランク落ちが発生し、(X・tX)に逆行列が無くなるので、αが求められなくなる。

なので、例えば、性別を"man"列、と"woman"列で２つ用意し、それぞれ0/1で表すと、man列とwoman列で相関あるので、ランク落ちが発生する。なので、カテゴリ変数を数値化するときは、man列だけとかにする。（利用する機械学習モデルに正則化項があれば、相関消えるので問題ないが、時系列の統計モデル(ARモデル等)には、正則化項がないので注意！）

正則化の例

・リッジ回帰

　　　α = (X・t_X + λ・I)_(-1) ・t_X・Y

　（X：行列、t_X：Xの転置行列、(X・t_X)_(-1)：(X・t_X)の逆行列、Y：行列、 I：単位行列）

正則化項として「λ・I」を足すことで、ランク落ちをなくしている。

以下の例だと、X・t_Xは、逆行列が存在しないため、αが求められないが、正則化項として λ・Iを加えていることで、ランク落ちが無くなり、αが求められる形に変わった。

上は、行列式も0になり逆行列が無いが、下のようにλ・Iを加えることで、ランク落ちが無くなり、行列式も0ではなくなり、逆行列が存在するように変わった。

VARモデル

2021-11-08T12:54:16+00:00

VARモデルとは

ベクトル自己回帰モデル(VAR model： vector autoregressive model)は、ARモデルをベクトルに一般化したもの。(ARなので、データは定常性が前提)

例えば、個人消費と個人収入の指標という２つの時系列データがあった場合、以下のようにお互いに影響しあっていると考えることができる。

消費が増えた後に（お店などが繁盛するため）収入が増える
収入が増えた後に（使えるお金が増えたので）消費が増える

このような状況をVARによりモデル化できる。

要は、時系列データが互いに影響を及ぼしあっているのを表現した時系列モデルがVARモデル。

VARモデルの構造

例）２変量のVAR(1)モデル

イメージにすると

　2001年の消費 = c1 + Φ11 * 2000年の消費 + Φ12 * 2000年の収入 + ノイズ

　2001年の収入 = c2 + Φ21 * 2000年の消費 + Φ22 * 2000年の収入 + ノイズ

消費・収入ともに「過去の消費と、過去の収入」という同じ説明変数が使われている。

かく乱項ε1t と ε2tはホワイトノイズで、過去の自身のかく乱項と相関を持たない。しかし、同時点のかく乱項同士は相関を持っていてもかまわない。

statsmodelsによる例

リンクのコード参照

Granger因果性検定

ある変数(群)が他の変数(群)の予測の向上に役立つかどうかを判定する手法

(1) 2001年の収入 = c2 + Φ21 * 2000年の消費 + Φ22 * 2000年の収入 + ノイズ①

(2) 2001年の収入 = c2 + Φ22 * 2000年の収入 + ノイズ②

ノイズ①が「相手のデータも使った時の予測残差」

ノイズ②が「相手がいない時の予測残差」

この２つの予測残差の残差平方和の大小を比較して「相手のデータを使うことで、予測残差が有意に減少したか」を検定する。

帰無仮説：予測残差は減少しない
対立仮説：予測残差は減少する -> Grangerの因果があるとみなせる

※ 定常データにしか適用できないので注意！

コード例

リンクのコード参照

見せかけの回帰と対策・共和分

2021-11-03T03:35:37+00:00

「時系列分析と状態空間モデルの基礎」より

見せかけの回帰

単位根のあるデータ同士の回帰分析にかけると有意な回帰係数が得られる現象を『見せかけの回帰』という。
定常AR過程同士の回帰分析でも『見せかけの回帰』が生じる。

原因

見せかけの回帰の発生理由として、「残差に自己相関がある」ことが言われている。

残差に自己相関があると、最小二乗推定量における有効性が失われる。すなわち推定されたパラメタが「最も分散が小さい推定量である」という保証が得られなくなってしまう。

残差に対して正の自己相関があった場合、以下の問題が発生

係数の分散の過小推定
決定係数R²が過大となる
係数のt検定が使えなくなる

残差の自己相関の有無を調べる方法

Durbin-Watson検定(DW検定)の使用

残差の１次の自己相関が０であった場合は、DW統計量はおよそ２になることが知られている。DW統計量が２からどれほど離れているかを確認すれば、残差の自己相関の有無について、おおよその検討がつく。

wikiのDW検定の説明

pythonによるDW検定例

リンクのコード参照

見せかけの回帰を防ぐ方法

見せかけの回帰を防ぐには、過去のデータをモデルに組み込み、データの持つ自己相関を表現するモデルを作る。

ARIMAXモデルやベクトル自己回帰モデル(VAR)、状態空間モデルなどが候補となる。

また、残差の自己相関を明示的にモデルに組み込む回帰モデルとして、一般化最小二乗法(GLS)が知られている。

別の方法としては、単位根を持つデータの場合、差分系列へ回帰分析を実行する。

差分をとることでランダムウォークがただのホワイトノイズになり、見せかけの回帰が起こらなくなる。ただし、差分をとる場合は共和分の確認必要で、共和分があったら差分をとる方法は使えない。

一般的な流れ、

単位根の有無を検定(ADF検定 or KPSS検定)で確認
単位根がなければ一般化最小二乗法(GLS)を適用(Prais-Winsten法など利用）
単位根があれば、共和分の有無を確認したうえで、共和分がなければ差分系列への回帰分析を実施する

単位根検定

ADF検定

仮定：y_t = Σ_i=1_to_℘(Φ_i * y_t-1) + ε_t　ε_t ~ W.N.(σ²)
上式で y_t が単位根過程に従うとき、AR特性方程式は z=1を解に持つので、単位根検定するためには、
　　　 Σ_i=1_to_℘(Φ_i) = 1
を検定する。
　　　ρ = Σ_i=1_to_℘(Φ_i)
とすると、

帰無仮説：ρ = 1 (単位根過程)
対立仮説：0 < ρ < 1 (弱定常過程)
として検定を行う。

コード例

共和分

単位根を持つデータ同士で回帰分析をした場合、見せかけの回帰になることが多いが、データが共和分を持っているとその限りでない。

例えば、データx_tとy_tがそれぞれ単位根を持っており、y_tとx_tの線形結合が単位根を持たなくなったとしたら、両者は共和分の関係にある。

x_tとy_tで線形結合したときに、ランダムウォーク系列が無くなり、ホワイトノイズとなるような場合に共和分の関係が生じる。

共和分があるかどうかの検定：共和分検定（Engle-Grangerの方法など）

共和分検定

　単位根を持つデータに対して、OLSにより回帰直線を求める。そして残差を計算する。

残差に対して単位根検定を行い、単位根がなくなれば共和分ありとみなすのが共和分検定。

回帰式は一種の線形結合なので、共和分関係にあれば単位根は消える。

また、残差に対して単位根検定を行うため、通常のADF検定やKPSS検定は使えない。

Phillips-Ouliaris検定(PO検定)を使う。この検定の帰無仮説は「共和分関係がない」。

AR, MA, ARIMA, 及び各統計モデル用語について

2021-11-03T02:18:31+00:00

以下のリンク参照

https://yoshi-cow.github.io/statistics.github.io/timeseries_analysis_basic_1.html

statsmodelによるARIMAXなどのコード例

2021-11-03T02:13:34+00:00

以下のリンク参照

https://yoshi-cow.github.io/statistics.github.io/time_series_sample_code.html

モデルの同定

2021-11-03T02:12:29+00:00

ARMAモデルなどの次数の決定方法

差分をとるか判断する：単位根検定

ADF検定の利用

帰無仮説：単位根あり
対立仮説：単位根なし

モデルの定常性・反転可能性のチェック

次数決定ARモデルが定常であるときは、常に反転可能。そのためMA項における反転可能条件がARMAモデルの反転可能条件となる。
定常性・反転可能性ともに、係数の特性方程式の解の絶対値が１よりも大きいことで確認できる。

次数決定

次数自動決定関数：

ARmodel：statsmodels.tsa.ar_model.ar_select_order
ARMAmodel：statsmodels.tsa.stattools.arma_order_select_ic

リンクのコード参照

評価：残差の自己相関のテスト

ARIMAモデルを正しく推定できていた場合、残差は自己相関の無いホワイトノイズになる。残差に自己相関が残っていた場合は「未来を予測する情報がまだ残っている」ことになる。
リンクのコード参照

評価：残差の正規性のテスト

時系列モデルの残差項として正規分布に従ったホワイトノイズが仮定されているため、正しくモデル化できている場合は、当てはめ残差も正規分布に従っているはず。
正規性の検定として、Jarque-Bera検定やShapiro-Wilk検定を用いる。

statsmodelでのモデル作成後のsummaryの見方

リンクのブログ参照

モデルの同定方法のわかりやすいブログ

Advanced Time Series Analysis with ARMA and ARIMA

AWS EMRによる、streaming dataの分散処理実装

2021-09-19T07:50:23+00:00

以下のYouTubeで、AWS EMRによるspark streaming の実装を学ぶ

■EMR nobebook利用時の注意点

emr-5.30.0以降のを使うと、kernel起動エラーが起こるので、EMRのnotebook使うときは、emr-5.29.0を選択する。

詳しくはリンク参照

■仕様

EMR上に置いたcsvファイルから１レコードづつkafkaに流し（ストリーミングデータの代わり）、kafkaからpysparkでデータ取得して可視化用に前処理した結果をnodejsで可視化する

■EMR起動までの各設定

security group作成
　EC2 -> セキュリティグループ
　※ インバウンドルールは、すべてのipアドレスOK（anywhere）に設定すると、EMRでセキュリティエラーでクラスターが作成されないので、すべてのipをOKにしたいときは、SSHのみOKの設定にすること！！！（エラーについてはリンク1, リンク2参照）
key pairsの作成
windowsからコンソールにログインする予定なので、キーファイルは.ppkで作成
（プライベートkeyは作成時に自動ダウンロードされる）
puttyダウンロード(リンク)
windowsからAWSコンソールへログイン時に使うSSHツール
EMRクラスター作成（以下は、詳細設定で設定）
　・リリースは、spark3.1.1を使っているので、『emr-6.3.0』を選択し、ソフト一覧でsparkも選択入れる
　・選択するソフトウエアは、今回はHadoopとsparkのみ
　　（zookeeper使うソフト選択すると、zookeeperがインストールされるので、自分でkafkaインストールするときに、kafkaのzookeeperと被るので注意
　・master node:1、slave node:2でクラスター数は３でとりあえず設定
　・EC2 キーペアは、2で作成したkeyを選択
　・EC2セキュリティグループの、マスタ・コアそれぞれに1で作成したsセキュリティグループを選択
　・ハードウェアは、m4.largeをとりあえず選択
　※クラスターの作成がうまくいくと『待機中』になる。
AWS EMR コンソールへの接続TEST
　(1) 作成したクラスターの「マスターパブリック DNS」をコピー
　(2) 「Connect to the Master Node Using SSH」のリンクをクリック
　(3) 接続するローカルpcで、PuTTYを立ち上げ、Host Nameにコピーした「マスターパブリック DNS」を貼り付け
　(4) PuTTYのCategoryからSSH -> Authを選ぶ、Browseでダウンロードしたppkファイルを選択
　(5) コンソール立ち上がったら、hadoopと入力してログイン
　　ログイン出来たら、EMRと大きく表示される。

※EMRのクラスタ作成時に、Zookeeperを使用するソフト選択すると、zookeeperもインストールされるので、kafkaをEMR上でインストールするなら、注意すること。
EMRでのLISTENポートは、「sudo lsof -i -n -P | grep zookeeper」で確認
コンソールからkafkaをEMRにインストール
今回は、ストリーミングデータをkafkaで取得して、pysparkで読み込んで前処理して、kafkaに戻すシステムなので、まずはkafkaをEMRにインストールする。
　kafkaのサイトリンク
　(1) kafkaのダウンロード(※必ずBinaryをダウンロードすること！！！）
　　wget https://archive.apache.org/dist/kafka/2.8.0/kafka_2.12-2.8.0.tgz
　(2) kafkaの解凍
　　tar -xzf kafka_2.12-2.8.0.tgz
　(3) フォルダ名の変更
　　mv kafka_2.12-2.8.0 kafka
kafkaの起動
kafkaの起動は、メタデータ管理のzookeeperを起動してから、kafka Brokerを起動する。
　(1) zookeeperの起動
　　kafkaのファルダに入ってから起動
　　cd kafka
　　bin/zookeeper-server-start.sh config/zookeeper.properties

　(2) Kafka Brokerの起動
　　最初のコンソールはzookeeperが起動中なので、別コンソールを開いて起動する。
　　kafkaフォルダ内で、
　　bin/kafka-server-start.sh config/server.properties
　(3) テスト用のtopic生成
　　別コンソールを開いて、kafkaが正しく起動しているか確認するため、topicを作成してみる
　　 bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
　　※ kafkaのversionが3.0.0以上は、 --zookeeperオプションが無くなったので、3.0.0以降をダウンロードしたら、topic作成方法を確認すること！
以下コマンドで、作成したtestトピックの設定状況確認
　　bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test

以上までが、EMRの起動とコンソールへのアクセス

■sparkによるkafka操作用設定

repositoryから、Spark Integration For Kafkaをダウンロード

wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10_2.12/3.1.2/spark-streaming-kafka-0-10_2.12-3.1.2.jar

■kafkaでのストリーミングデータの送受信

送信はコンソールからデータ流す（ストリーミングデータの変わり）

トピックの作成
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic orders_topic

ストリーミングデータ送信用トピック：order_topic
ストリーミングデータの送信
実際のストリーミングデータの代わりに、s3ファイルのデータを取得して、コンソールから kafka-console-producer.sh でデータをkafkaに送る

push_orders_data_in_topic.sh

■sparkによる前処理および、kafkaへの再送信

kafkaから受け取ったデータをpysparkで可視化用に前処理した後、その結果を受け取り用kafkaトピックに送る

別のEMRコンソール開いて、前処理後用のtopic作成
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic orders_ten_sec_data

前処理後データ送信用トピック：orders_ten_sec_data
pykfkaのインストール
　pysparkからトピック作るライブラリ
pip install pykafka

■sparkによる前処理とnode.jsによるストリーミングデータの可視化

別のコンソール開く
node.jsのインストール
※8系はすでにサポート対象外だが、youtubeの仕様をそのままつかう
curl --silent --location https://rpm.nodesource.com/setup_8.x | sudo bash -
sudo yum -y install nodejs
npm install express
npm install socket.io
npm install kafka-node
csvファイルを読み込んでコンソールからストリーミングデータの代わりとしてkafkaに流す(push_orders_data_in_topic.shスクリプトでストリーミングデータの代わりを送信)
/bin/bash push_orders_data_in_topic.sh ../data/ordersdata ip--172-xxxx:9092 orders_data
spark用コンソール開いて、spark-submitで、kafkaからデータ取得して前処理して返すプログラム(spark_streaming_order_status.py)を実行
spark-submit --jars spark-streaming-kafka-0-10-assembly_2.12-3.0.1.jar spark_streaming_order_status.py localhost:2181 orders_data
ローカルPCから、emrのurlで見れないか試したら、ssh接続じゃないとダメなの気づいた・・・

上記YouTubeのプログラムファイル場所：リンク

sparkによる、IoTデータ収集とDBへの保存の分散処理化

2021-09-04T05:00:25+00:00

以下仕様で、IoTデータの収集とDBへの保存フローを学ぶ

学習書籍：『アプリケーションエンジニアのための Apache Spark入門』（秀和システム）

コードは以下のGitHubに保存

https://github.com/yoshi-cow/IoT-spark-streaming

分散処理の学習(Apache Spark)

2021-08-28T07:13:24+00:00

pysparkにて、分散処理によるデータ前処理、機械学習のコーディングを学ぶ

dataframeの基本
コードリンク

基本的なdataframeの関数
コードリンク

sparkによるSQL操作
コードリンク

フィルター、独自定義関数、データベース入出力など
コードリンク

機械学習_教師有り
コードリンク

機械学習_決定木系
コードリンク

機械学習_教師無し
コードリンク

グリッドサーチ、クロスバリデーション
コードリンク

PCAによる特徴量数選択
コードリンク

kafkaとsparkの連携
Githubのコードリンク

（未）時系列データのクラスタリング方法例

2021-06-19T05:31:00+00:00

時系列分析にて、傾向抽出の方法で悩んだ時の備忘録

1. DTW(動的時間伸縮法)

　　特徴

　　２つの時系列の各点の距離を総当たりで求め、最短となるパスを見つける
　　時系列同同士の長さや周期が違っても類似度を求められる
　　総当たりなので、計算コスト高い

　　プログラム例

　　結果例

2. 特徴点マッチング

　　本来は類似画像を求めるのに利用される
　　時系列データを曜日と時間でヒートマップで可視化した結果をクラスタリングするのに利用した。

　　プログラム例

状態空間モデル

2021-05-01T01:35:17+00:00

状態方程式：予測対象分布　x_t = T_t * x_(t-1) + R_t * ξ_t　　ξ_t～N(0, Q_t)
観測方程式：予測対象の観測値の分布　y_t = Z_t * x_t + ε_t　　ε_t～N(0, H_t)
フィルタリング：手に入った観測値で、予測された状態の値を補正
平滑化：すべてのデータが手に入った後に、状態の補正を行う（予測精度が上がるわけではないが、補正に使われる情報が増えるので、ノイズの影響を軽減できる）
プログラム例１（statsを用いた外生変数入れた予測モデル）
プログラム例２（サブクラス作成による予測モデル）

ローカルレベルモデル

状態方程式：μ_t = μ_(t-1) + ω_t　　ω_t～N(0, σ²_ω)
観測方程式：y_t = μ_t + ν_t　　ν_t～N(0, σ²_ν)

ローカル線形トレンドモデル

トレンド成分：δ_t = δ_(t-1) + ξ_t　　ξ_t～N(0, σ²_ξ)
状態方程式：μ_t = μ_(t-1) + δ_(t-1) + ω_t　　ω_t～N(0, σ²_ω)
観測方程式： y_t = μ_t + ν_t　　ν_t～N(0, σ²_ν)

カルマンフィルタ

予測ステップ

μ_t = μ_(t-1)|(t-1)　　μ_(t-1)|(t-1)：t-1時点のフィルタ化推定量
P_t = P_(t-1)|(t-1) + σ²_ω　　P_(t-1)|(t-1)：t-1時点の「状態のフィルタ化推定量μ_(t-1)|(t-1)の推定誤差分散」
^y_t = μ_t　　^y_t：t時点の観測値の予測値
F_t = P_t + σ²_ν F_t：t時点の「観測値の予測誤差の分散」

フィルタリング

K_t = P_t / (P_t + σ²_ν) = P_t / F_t　　K_t：カルマンゲイン
y_resid,t = y_t - ^y_t　　y_resid,t：観測誤差
μ_t|t = μ_t + K_t * y_resid,t
P_t|t = (1 - K_t ) * P_t

散漫カルマンフィルタ

　カルマンフィルタの改良版

　状態の初期値への対応を以下のように行う

状態の初期値はあきらめる
ただし、「状態の予測誤差の分散」の初期値を無限大にしておく

※散漫カルマンフィルタを入れることで、最初の予測値の精度が上がるが、時間経過とともにカルマンフィルタの値を同じになるので、初期値の精度気にしないなら、いらないと思われる

平滑化

　省略

状態空間モデルのプログラム例１

状態空間モデルのプログラム例２（サブクラス作成版）

時系列分析備忘録

2021-01-31T12:54:37+00:00

ラグ / 確率過程 / 期待値 / 分散、標準偏差 / 自己共分散 / 自己相関

説明サイト：時系列分析で登場する統計量・用語を一つずつ解説

原系列 / 差分系列、階差系列 / 対数系列 / 対数差分系列 / 季節調整系列

説明サイト：時系列分析の基本的なモデルをわかりやすく解説

定常性 = 平均回帰性

下は、平均回帰性が見られないケース

自己相関

= 時系列データが自分の過去の値からどれくらい影響を受けているかの指標

上の平均開始しているケースでは、一時点前の値からしか影響を受けていないが、下の平均回帰していないケースは、過去遡って影響を受けている

分散不均一性
= 分散が時間軸に沿って一定していない

上は、分散が均一っぽいケース。下が、分散が不均一なケース

上記画像は、『なぜ統計学がビジネスの意思決定において大事なのか？』より

定常過程

時間や位置によって確率分布が変化しない確率過程

単位根過程

時系列分析の単位根過程、ランダムウォークとは？

定義：ytが非定常過程、また差分系列yt–yt–1=Δytが定常過程である時、ytは単位根過程である。
経済、金融データが従うランダムウォークは単位根過程の一種です。また、ARIMA、SARIMAモデルも単位根過程のひとつ。
単位根過程に従うxt,yt を回帰分析を行うと、まったく関係のないxt,ytの間に有意な相関を見出してしまう「見せかけの回帰」が行われます。「見せかけの回帰」は、例えば株価と二酸化炭素濃度という関係のない２つの間に相関性を導き出してしまいます。
そのため時系列データに対して回帰分析を行う時は「見せかけの回帰」に気を付ける必要があります。
「見せかけの回帰」を避けるためには、あらかじめ２つの時系列データxt,ytが単位根過程に従っているかどうか確認しなくてはなりません。そんな時は単位根過程であるかをチェックする単位根検定が行われます。

Ljung-Box検定

Ljung-Box検定は、ある時系列の自己相関が存在するかどうかを次の設定で確かめる検定です。

H0: ラグ1からラグ m までの全てで自己相関が0

H1: ラグ1から m までの自己相関のうち、少なくとも一つが0でない

有意水準を5%としたとき、検定統計量の得られる確率が5%未満であれば「有意な（意味のありそうな）自己相関がある」と判断します。逆に5%以上であれば、「有意な自己相関があるとは言えない（どちらともいえない）」と判断します。 Ljung-Box検定は、時系列モデルをあてはめた後に残差に対して行われ、残差が独立であることを確認するためにも使用されます。これには、説明しきれなかった部分が本当にノイズであってほしい、というモチベーションがあります。 statsmodelsではacorr_ljungbox()で実装されています。

ADF検定

拡張Dickey-Fuller (ADF) 検定は、時系列が定常であるかどうかを確かめる検定です。

statsmodelsではadfuller()で実装されています。

時系列解析の検証に使われる2つの仮説検定方法より

非定常　->　定常　への変換

差分変換
一時点離れた値との差を取る差分変換により、平均に関して非定常な系列を定常にできる場合があります。
　　　　　　　　　Δyt = yt – yt − 1
差分変換は、系列が定常であるとみなせるようになるまで複数回行うことがあります。一度差分をとることは一階差分と呼ばれ、さらにもう一度差分をとることは二階差分と呼ばれます。周期性のあるデータに対しては、周期分だけ離れた値との差をとる季節差分を行うことがあります。

非負な原系列に対して対数を取る対数変換は、トレンドに比例してバラツキが変化するような系列に対して有効です。これは、対数をとることにより積の構造が和の構造に変わるためです。
　　　　　　　　　log(X×Y) = logX + logY
対数差分変換
金融の分野では、元の価格推移のデータを対数変換したのち差分変換する対数差分変換がしばしば行われ、変化後のデータは「対数収益率」とよばれます。対数収益率を用いることには、「100円の銘柄が1円値上がりする」ことと「1000円の銘柄が1円値上がりする」ことの不平等さを解消できることなど、様々なメリットがあります。

非定常な時系列データを変換して定常性を持たせる解析より

正規表現備忘録

2020-09-09T05:29:22+00:00

〇〇を含まない文字列の抽出

参考サイト：

http://www-creators.com/archives/1827

特定のパターンで開始しない文字列
　^(?!PATTERN).*$
特定のパターンで終了しない文字列
　^.*(?
特定のパターンを含まない文字列
　^(?!.*PATTERN).*$

pandasのdataframeにPostgreSQLのデータを直接読み込むコード

2020-09-09T00:36:43+00:00

上記は、以下のgithubより、
https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess

機械学習Tips保管庫

VARモデル

状態空間モデル

LangGraph

・基本操作

・LangGraphの概念・用語

・LangGraphにより基本的なマルチエージェント構築例

・チェックポイントについて

・マルチエージェント構築例2

・リデューサーとツール呼び出しの基本（ToolNodeの利用）

・複数ツールの利用（最適なツールを一つ選択して実行）

・React型エージェントとメモリ機能について（create_react_agent関数を使わない版）

・create_react_agent関数によるReact型エージェント作成

・ストリーム出力について

・ヒューマンフィードバックの適用

・Agentic RAG

・エージェントによる最適なベクトルDBの選択

・ベクトル検索結果に対するセルフリフレクションの適用

・クエリ内容を分析して、動的にRAGのクエリを最適化

・エージェントデザインパターン

・主要なエージェントデザインパターン

・Passive Goal Creator：ユーザーからの具体的な目標を抽出するパターン

・Prompt Response Optimizer：生成された目標やユーザー要求を、より効果的なプロンプトに変換しLLMからより質の高い回答を得るパターン

・Single-Path Plan Generator：設定された目標を達成するための一連の具体的なステップを生成するパターン

・Multi-Path Plan Generator：タスク分解時に複数の選択肢を同時に生成し、実行時のコンテキストに応じて実行エージェント自身に都度適切な選択をさせるパターン

・Self-Reflection and Cross-Reflection：タスクの実行結果をエージェント自身で振り返ることで、その実行内容を自己改善させる

・Role-Based Cooperation：AIエージェントが協調してタスク実行

・サブクラスに分解して順番に処理

・複数ノードが独立して動くことができるときに、並列化を適用

・条件分岐による様々なノードへのルーティング

・管理エージェントがサブタスクに分解して、作業用エージェントに依頼

・LLM出力結果を評価用サブエージェントが評価

・アプリ構築例

・Streamlitを用いたLangGraphアプリ

・APIによるLangGraphの呼び出し(FastAPI)とLangGraph Studioによるデバッグ

LangChain & RAG

・LCEL(LangChain Expression Language)について ▶ 解説notebook

・RAG基本

・Advanced RAG

・LangSmith

CNN, TCN

The structure of CNN

Channel in CNN

Difference between Kernel and channel

Parameters of nn.Conv1d of PyTorch

Predicting Energy Production by TCN

Analyzing codes of TCN class in Darts library

Other sample codes of TCN

Radom Seed & Initial value dependece

1. Influence of Random Seed in PyTorch

How Popular Data Scientists Deal with Result Variability:

2. Initail Value Dependent Problem

Key Aspects of the Initial Value Dependent Problem:

Strategies to Mitigate Inital Value Dependent Problems:

3. The relationship between Initial value dependence and setting Random Seed

Connection Between Inital Value Dependence and Random Seed:

Summary of the Relationship:

シーケンスデータでの、データの独立性について

Question:

Answer by GPT:

1. シーケンスの重複：

2. 独立したシーケンスの数：

3. データの独立性の問題：

データの独立性の問題とは

1. データの独立性とは？

2. 時系列データにおける独立性の問題

3. データの独立性を確保する方法

4. 結論

sequence lengthについて

1. 従来の統計的時系列モデル

2. フィーチャーエンジニアリングにおけるsequence length

3. ディープラーニングモデル

4. 結論

データの独立性の問題を避けるには？

1. 統計的時系列モデル

2. ディープラーニングモデル（LSTM, GRU, Transformerなど）

3. 伝統的な機械学習モデル

4. 結論

Transformer

・Transformer概要

　　・LangGraphの概念・用語

　　・LangGraphにより基本的なマルチエージェント構築例

　　・チェックポイントについて

　　・マルチエージェント構築例2

　　・リデューサーとツール呼び出しの基本（ToolNodeの利用）

　　・複数ツールの利用（最適なツールを一つ選択して実行）

　　・React型エージェントとメモリ機能について（create_react_agent関数を使わない版）

　　・create_react_agent関数によるReact型エージェント作成

　　・ストリーム出力について

　　・ヒューマンフィードバックの適用

　　・エージェントによる最適なベクトルDBの選択

　　・ベクトル検索結果に対するセルフリフレクションの適用

　　・クエリ内容を分析して、動的にRAGのクエリを最適化

　　・主要なエージェントデザインパターン

　　・Passive Goal Creator：ユーザーからの具体的な目標を抽出するパターン

　　・Prompt Response Optimizer：生成された目標やユーザー要求を、より効果的なプロンプトに変換しLLMからより質の高い回答を得るパターン

　　・Single-Path Plan Generator：設定された目標を達成するための一連の具体的なステップを生成するパターン

　　　・Multi-Path Plan Generator：タスク分解時に複数の選択肢を同時に生成し、実行時のコンテキストに応じて実行エージェント自身に都度適切な選択をさせるパターン

　　・Self-Reflection and Cross-Reflection：タスクの実行結果をエージェント自身で振り返ることで、その実行内容を自己改善させる

　　・Role-Based Cooperation：AIエージェントが協調してタスク実行

　　・サブクラスに分解して順番に処理

　　・複数ノードが独立して動くことができるときに、並列化を適用

　　・条件分岐による様々なノードへのルーティング

　　・管理エージェントがサブタスクに分解して、作業用エージェントに依頼

　　・LLM出力結果を評価用サブエージェントが評価

　　・Streamlitを用いたLangGraphアプリ

　　・APIによるLangGraphの呼び出し(FastAPI)とLangGraph Studioによるデバッグ

・LCEL(LangChain Expression Language)について
　▶ 解説notebook