pdxtra module#
PDXtra is a data analysis library built on top of Pandas and geared toward
time-series data analysis. The library offers subclasses of traditional
Pandas Series and DataFrame objects which provide additional
convenience methods for querying and manipulating data while retaining all
of the functionality of the original Pandas API. Additionally, the library also
provides TimeSeries and TimeSeriesDataFrame subclasses which are
specifically designed to be used for time-series and financial data analysis.
This distribution has been developed and tested on a standalone Python build and could–under rare circumstances–exibit unexpected behavior on standard CPython. Although no differences between the execution of code in the standalone build and normal CPython have been found during the process of development, I nevertheless wish to provide fair warning.
- class pdxtra.DataFrame(*args, **kwargs)#
Bases:
DataFrame- set_temp_index(column_name)#
Context manager for temporarily switching the index column. The index column is automatically reverted back when the context manager exits.
Parameters#
- column_namestr
The name of the column to set as the index column while inside the scope of the context manager.
- vlookup(values, column, method=None, tolerance=None)#
Performs an Excel-like, veritcal lookup. In this case, the “v” in the function name has a double meaning. It stands for both “vertical” and “vector”, because unlike the typical vertical lookup in Excel, the function returns a vector of values (
numpy.array) from the specified column, rather than a scalar. Can perform nearest match on the index column.Parameters#
- valuesSequence
A sequence of values to search for in the index column.
- columnstr
The column name from which we wish to extract values whose indicies match the indicies found in the index column.
- method{None, str}, default None
Method used to determine whether or not a value is considered a match or near-match. From the Pandas documentation:
[
None(default)]: exact matches only.pad / ffill: find the previous index value if no exact match.
backfill / bfill: use the next index value if no exact match.
nearest: use the nearest index value if no exact match. Tied distances are broken by preferring the larger index value.
- tolerance{numeric, date, datetime, None}, default None
The maximum distance to search on either side of the index value when performing nearest match. If searching a series containing
dateordatetimeobjects, the user is responsible for providing the appropriatetimedeltafor computing differences between the index values and the values supplied in the function call. This argument is ignored when the method isNone. Seepandas.Index.get_indexerfor further explanation.
Returns#
A 1-d Numpy array, which can be empty if no values are found.
- xlookup(row_idx, column_name, nearest_match=False, tolerance=5, lookback=True)#
Perform an Excel-style ‘xlookup’ by extracting the value(s) at the intersection of a set of (x,y) coordinates, where the Y coordinate is an index value and the X coordinate is a column name. Can perform nearest match on numerical and date indexes.
Parameters#
- row_idx{int, float, date, datetime}
The index value specifying which row is to be searched. When using ‘nearest_match’ this value becomes the pivot value supplied to the ‘nearest_to’ function.
- column_namestr
The name of the column from which to extract the value.
- nearest_matchbool, default False
If true, use the ‘nearest_to’ function to find the value of the index column which is closest to the ‘row_idx’ value.
- toleranceint, default 5
The maximum distance to search on either side of the ‘row_idx’ value when performing nearest match. If searching a series containing
dateordatetimeobjects, the user is responsible for providing the appropriatetimedeltafor computing differences between the index values and the value assigned to ‘row_idx’. This argument is ignored when ‘nearest_match’ is false.- lookbackbool, default True
Whether or not to prioritize looking back vs. looking forwards along the sorted values of the
Serieswhen performing nearest match. This argument is ignored when ‘nearest_match’ is false.
Returns#
A value from the column being indexed, or
Noneif no value is returned on nearest match.Raises#
KeyError:When the dataframe cannot be indexed using the coordinates provided.
TypeError:When nearest match is performed on a series which contains values that do not support subraction between themselves and the other values contained in the Series, or when a comparison between differences and the tolerance is considered ambiguous.
Examples#
Find the value in column “integers” at index 2.
>>> import pdxtra as pdx >>> df = pdx.DataFrame({"integers": [1, 2, 3, 4, 5]}) >>> df.xlookup(2, "integers") 3
Find the value in column “integers” where the index equals “d”.
>>> df.index = ["a", "b", "c", "d", "e"] >>> df.xlookup("d", "integers") 4
Find the value in column “integers” where the index is nearest to 3.
>>> df.index = [2, 4, 6, 8, 10] >>> df.xlookup(3, "integers", nearest_match=True) 1
Find the value in column “integers” where the index is nearest to 3, but look forwards instead of backwards.
>>> df.xlookup(3, "integers", nearest_match=True, lookback=False) 2
- class pdxtra.Series(*args, **kwargs)#
Bases:
Series- nearest_to(pivot, tolerance=5, lookback=True, float_method='fuzzy')#
Sort then search the series for a value which is nearest to the pivot value. If an exact match is found, then that match will be returned. All values in the series must support subtraction between themselves and the other values of the series.
Can look forwards and backwards for matches, but does so hierarchically. For instance, a forward lookup will only return the forward value if the distance between it and the pivot value is less than the distance between any of the lower values and the pivot value. The same logic holds true for lookbacks. Any value above or below the pivot value, whose distance from the pivot is greater than the absolute value of the pivot plus the tolerance, will be excluded from the search.
Note
This function can be called directly on a Series object.
Parameters#
- seriesSeries
The Pandas series to be searched. When called as a method from the Series class, this argument is assigned to the class instance.
- pivot{int, float, date, datetime}
The value on which to match. The search window (as defined by the tolerance) “pivots” around this value.
- tolerance{int, float, timedelta}, default 5
The maximum distance to search on either side of the pivot. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the series values and the pivot value.
- lookbackbool, default True
Whether or not to prioritize looking backwards vs. looking forwards along the sorted values of the series.
- float_methodstr {fuzzy, precise}, default fuzzy
Method for comparing floats when determining whether or not the difference between the pivot and a given value exceeds the bounds set by the tolerance.
Warning
Using precise will only increase the likelihood that the value returned falls within bounds. Neither of the two methods, fuzzy or precise, can guarantee perfect comparison of floating-point numbers. See the documentation for
numpy.isclosefor additional information on how this method compares floating-point numbers when precise is used. For an array of floating-point numbers significantly smaller than zero, precise may produce false-positive comparisons. For columns which do not contain floating-point numbers, the default (fuzzy) should always be used.
Returns#
Either a value from the series or
Noneif no match is found.Raises#
TypeError:One or more of the values does not support subtraction between itself and the other values in the series, or a comparison between differences and the tolerance is considered ambiguous.
ValueError:The ‘float_method’ argument which was passed to the method is not one of the accepted strings.
Examples#
Get the value from the list which is nearest to seven.
>>> import pdxtra as pdx >>> values = pd.Series([1, 2, 3, 4, 8, 9]) >>> pdx.nearest_to(values, 7) 8
Get the value from the list which is nearest to three and use default
lookback=Truefor tie breaking.>>> values = pd.Series([1, 2, 4, 5]) >>> pdx.nearest_to(values, 3) 2
Get the value from the list which is nearst to three and look forward to break the tie.
>>> pdx.nearest_to(values, 3, lookback=False) 4
- normalize(multiplier=1.5)#
Returns a series with outliers removed via the IQR method.
Note
This function can be called directly on a
DataFrameobject.Parameters#
- multiplier{int, float} default 1.5
The value by which to multiply (extend) upper and lower bounds.
Returns#
A 1-d Numpy array.
Examples#
>>> import pdxtra as pdx >>> s = Series([-30, 1, 2, 3, 30]) >>> s = normalize(s) >>> s array([1, 2, 3])
- true_ema(span)#
Calculate the true N-day, exponential moving average of the
Series.Calculations for exponential moving averages in Pandas start on the last position of the span which means that in order for a moving average with a span of N days to be a true N-day moving average, we need to add one to the minimum number of periods. This method produces an exponential moving average where ‘min_periods’ for the exponential moving window is always ‘span + 1’.
Parameters#
- spanint
From the Pandas documentation: “Specify decay in terms of span.” See Pandas documentation on
pandas.ewmfor a full explanation.
Returns#
A subclass of Pandas
ExponentialMovingWindow.
- property linearize#
Returns the linear regression of a
Series.
- class pdxtra.TimeSeries(*args, **kwargs)#
Bases:
Series- intersects(other)#
Returns the (x,y) coordinates for each point of intersection between the
Seriesand the sequence supplied to the method. The coordinate vectors are returned as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each pair for an array of N elements having index positions “i” is equal to the index [:, i].Parameters#
- otherSequence
A sequence of values forming some finite curve that is expected to intersect with the curve represented by the series. This sequence must be the same length as the series itself.
Returns#
A 2-d Numpy array.
- macd(short_span=12, long_span=26)#
Converts the
Seriesto moving average convergence-divergence.Parameters#
- short_spanint
The length of the short rolling window.
- long_spanint
The length of the long rolling window.
Returns#
A PDXtra
TimeSeriesobject.
- near_intersects(other)#
Returns the (x,y) coordinates for each point of intersection between the series and the sequence supplied to the method. The coordinate vectors are returned as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each coordinate pair for an array of N elements having index positions “i” is equal to the index [:, i]. Every pair represents the nearest non-interpolated coordinates that preceed a point of intersection.
Parameters#
- otherSequence
A sequence of values forming some finite curve that is expected to intersect with the curve represented by the series. This sequence must be the same length as the series itself.
Returns#
A 2-d Numpy array.
- class pdxtra.TimeSeriesDataFrame(*args, **kwargs)#
Bases:
DataFrame- intersects(column_1, column_2)#
Returns the (x,y) coordinates for each point of intersection of two finite curves as a 2-d array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each pair for an array of N elements having index positions “i” is equal to the index [:, i].
Parameters#
- column_1str
The name of the first
Seriesrepresenting a finite curve.- column_2str
The name of the second
Serieswhose curve is expected to intersect with the first.
Returns#
A 2-d Numpy array.
- near_intersects(column_1, column_2)#
Returns the (x,y) coordinates for each point of intersection of two finite curves as a 2-d Numpy array where index [0, :] contains the X values and index [1, :] contains the Y values for all points of intersection. Each coordinate pair for an array of N elements having index positions “i” is equal to the index [:, i]. Every pair represents the nearest non-interpolated coordinates that preceed a point of intersection.
Parameters#
- column_1str
The name of the first
Seriesrepresenting a finite curve.- column_2str
The name of the second
Serieswhose curve is expected to intersect with first.
Returns#
A 2-d Numpy array.
- obv(close, volume)#
Calculates the on-balance volume of the close and volume columns specified. Specifically designed for financial, time-series data analysis.
Parameters#
- closestr
Name of the column which represents closing prices.
- volumestr
Name of the column which represents the trading volume.
Returns#
A PDXtra
TimeSeriesobject.
- pdxtra.downcast_dataframe(df, integers=True, floats=True)#
Downcast all numeric columns of the dataframe while retaining the sign of the values. The precision warnings associated with
pandas.to_numericalso apply here.Parameters#
- dfDataframe
The dataframe to be downcast.
- integersbool, default True
Whether or not to downcast integer columns.
- floatsbool, default True
Whether or not to downcast floating-point columns.
Note
Due to the way in which pandas.select_dtypes selects columns, nullable Pandas types will automatically be downcast and cannot be selectively excluded when calling the function. Instead, exclude the nullable type columns and downcast the view or copy.
Returns#
A Pandas
DataFrame.Raises#
DowncastTypeError:When attempting to downcast an individual
Series.
Examples#
Downcast a dataframe containing an integer and a float column.
>>> import pdxtra as pdx >>> df = pdx.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]}) >>> df.dtypes a int64 b float64 dtype: object >>> df = pdx.downcast_dataframe(df) >>> df.dtypes a int8 b float32 dtype: object
Downcast only the integer columns.
>>> df = pdx.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]}) >>> df.dtypes a int64 b float64 dtype: object >>> df = pdx.downcast_dataframe(df, floats=False) >>> df.dtypes a int8 b float64 dtype: object
- pdxtra.find_intersects(x, y1, y2)#
Returns the (x,y) coordinates for all points of intersection of two finite curves. Each array should represent the Y values of some curve whose X values match the values of the ‘x’ parameter.
- pdxtra.find_near_intersects(x, y1, y2)#
Returns the (x,y) coordinates for all points that preceed a point of intersection of two finite curves. Each array should represent the Y values of some curve whose X values match the values of the ‘x’ parameter.
- pdxtra.nearest_to(series, pivot, tolerance=5, lookback=True, float_method='fuzzy')#
Sort then search the series for a value which is nearest to the pivot value. If an exact match is found, then that match will be returned. All values in the series must support subtraction between themselves and the other values of the series.
Can look forwards and backwards for matches, but does so hierarchically. For instance, a forward lookup will only return the forward value if the distance between it and the pivot value is less than the distance between any of the lower values and the pivot value. The same logic holds true for lookbacks. Any value above or below the pivot value, whose distance from the pivot is greater than the absolute value of the pivot plus the tolerance, will be excluded from the search.
Note
This function can be called directly on a Series object.
Parameters#
- seriesSeries
The Pandas series to be searched. When called as a method from the Series class, this argument is assigned to the class instance.
- pivot{int, float, date, datetime}
The value on which to match. The search window (as defined by the tolerance) “pivots” around this value.
- tolerance{int, float, timedelta}, default 5
The maximum distance to search on either side of the pivot. If searching a series containing date or datetime objects, the user is responsible for providing the appropriate timedelta for computing differences between the series values and the pivot value.
- lookbackbool, default True
Whether or not to prioritize looking backwards vs. looking forwards along the sorted values of the series.
- float_methodstr {fuzzy, precise}, default fuzzy
Method for comparing floats when determining whether or not the difference between the pivot and a given value exceeds the bounds set by the tolerance.
Warning
Using precise will only increase the likelihood that the value returned falls within bounds. Neither of the two methods, fuzzy or precise, can guarantee perfect comparison of floating-point numbers. See the documentation for
numpy.isclosefor additional information on how this method compares floating-point numbers when precise is used. For an array of floating-point numbers significantly smaller than zero, precise may produce false-positive comparisons. For columns which do not contain floating-point numbers, the default (fuzzy) should always be used.
Returns#
Either a value from the series or
Noneif no match is found.Raises#
TypeError:One or more of the values does not support subtraction between itself and the other values in the series, or a comparison between differences and the tolerance is considered ambiguous.
ValueError:The ‘float_method’ argument which was passed to the method is not one of the accepted strings.
Examples#
Get the value from the list which is nearest to seven.
>>> import pdxtra as pdx >>> values = pd.Series([1, 2, 3, 4, 8, 9]) >>> pdx.nearest_to(values, 7) 8
Get the value from the list which is nearest to three and use default
lookback=Truefor tie breaking.>>> values = pd.Series([1, 2, 4, 5]) >>> pdx.nearest_to(values, 3) 2
Get the value from the list which is nearst to three and look forward to break the tie.
>>> pdx.nearest_to(values, 3, lookback=False) 4
- pdxtra.normalize(series, multiplier=1.5)#
Returns a series with outliers removed via the IQR method.
Note
This function can be called directly on a
DataFrameobject.Parameters#
- multiplier{int, float} default 1.5
The value by which to multiply (extend) upper and lower bounds.
Returns#
A 1-d Numpy array.
Examples#
>>> import pdxtra as pdx >>> s = Series([-30, 1, 2, 3, 30]) >>> s = normalize(s) >>> s array([1, 2, 3])